{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Maximum Likelihood Estimation (Generic models)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial explains how to quickly implement new maximum likelihood models in `statsmodels`. We give two examples: \n", "\n", "1. Probit model for binary dependent variables\n", "2. Negative binomial model for count data\n", "\n", "The `GenericLikelihoodModel` class eases the process by providing tools such as automatic numeric differentiation and a unified interface to ``scipy`` optimization functions. Using ``statsmodels``, users can fit new MLE models simply by \"plugging-in\" a log-likelihood function. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 1: Probit model" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2021-02-02T06:56:22.815627Z", "iopub.status.busy": "2021-02-02T06:56:22.814837Z", "iopub.status.idle": "2021-02-02T06:56:25.802721Z", "shell.execute_reply": "2021-02-02T06:56:25.803961Z" } }, "outputs": [], "source": [ "import numpy as np\n", "from scipy import stats\n", "import statsmodels.api as sm\n", "from statsmodels.base.model import GenericLikelihoodModel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ``Spector`` dataset is distributed with ``statsmodels``. You can access a vector of values for the dependent variable (``endog``) and a matrix of regressors (``exog``) like this:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2021-02-02T06:56:25.809191Z", "iopub.status.busy": "2021-02-02T06:56:25.807749Z", "iopub.status.idle": "2021-02-02T06:56:25.849105Z", "shell.execute_reply": "2021-02-02T06:56:25.850116Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "::\n", "\n", " Number of Observations - 32\n", "\n", " Number of Variables - 4\n", "\n", " Variable name definitions::\n", "\n", " Grade - binary variable indicating whether or not a student's grade\n", " improved. 1 indicates an improvement.\n", " TUCE - Test score on economics test\n", " PSI - participation in program\n", " GPA - Student's grade point average\n", "\n", " GPA TUCE PSI\n", "0 2.66 20.0 0.0\n", "1 2.89 22.0 0.0\n", "2 3.28 24.0 0.0\n", "3 2.92 12.0 0.0\n", "4 4.00 21.0 0.0\n" ] } ], "source": [ "data = sm.datasets.spector.load_pandas()\n", "exog = data.exog\n", "endog = data.endog\n", "print(sm.datasets.spector.NOTE)\n", "print(data.exog.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Them, we add a constant to the matrix of regressors:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2021-02-02T06:56:25.855157Z", "iopub.status.busy": "2021-02-02T06:56:25.853763Z", "iopub.status.idle": "2021-02-02T06:56:25.871869Z", "shell.execute_reply": "2021-02-02T06:56:25.872873Z" } }, "outputs": [], "source": [ "exog = sm.add_constant(exog, prepend=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To create your own Likelihood Model, you simply need to overwrite the loglike method." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2021-02-02T06:56:25.877425Z", "iopub.status.busy": "2021-02-02T06:56:25.876008Z", "iopub.status.idle": "2021-02-02T06:56:25.897880Z", "shell.execute_reply": "2021-02-02T06:56:25.898898Z" } }, "outputs": [], "source": [ "class MyProbit(GenericLikelihoodModel):\n", " def loglike(self, params):\n", " exog = self.exog\n", " endog = self.endog\n", " q = 2 * endog - 1\n", " return stats.norm.logcdf(q*np.dot(exog, params)).sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Estimate the model and print a summary:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2021-02-02T06:56:25.903549Z", "iopub.status.busy": "2021-02-02T06:56:25.902148Z", "iopub.status.idle": "2021-02-02T06:56:26.273584Z", "shell.execute_reply": "2021-02-02T06:56:26.274617Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Optimization terminated successfully.\n", " Current function value: 0.400588\n", " Iterations: 292\n", " Function evaluations: 494\n", " MyProbit Results \n", "==============================================================================\n", "Dep. Variable: GRADE Log-Likelihood: -12.819\n", "Model: MyProbit AIC: 33.64\n", "Method: Maximum Likelihood BIC: 39.50\n", "Date: Tue, 02 Feb 2021 \n", "Time: 06:56:26 \n", "No. Observations: 32 \n", "Df Residuals: 28 \n", "Df Model: 3 \n", "==============================================================================\n", " coef std err z P>|z| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "const -7.4523 2.542 -2.931 0.003 -12.435 -2.469\n", "GPA 1.6258 0.694 2.343 0.019 0.266 2.986\n", "TUCE 0.0517 0.084 0.617 0.537 -0.113 0.216\n", "PSI 1.4263 0.595 2.397 0.017 0.260 2.593\n", "==============================================================================\n" ] } ], "source": [ "sm_probit_manual = MyProbit(endog, exog).fit()\n", "print(sm_probit_manual.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare your Probit implementation to ``statsmodels``' \"canned\" implementation:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2021-02-02T06:56:26.279376Z", "iopub.status.busy": "2021-02-02T06:56:26.277964Z", "iopub.status.idle": "2021-02-02T06:56:26.303115Z", "shell.execute_reply": "2021-02-02T06:56:26.302034Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Optimization terminated successfully.\n", " Current function value: 0.400588\n", " Iterations 6\n" ] } ], "source": [ "sm_probit_canned = sm.Probit(endog, exog).fit()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2021-02-02T06:56:26.310241Z", "iopub.status.busy": "2021-02-02T06:56:26.309476Z", "iopub.status.idle": "2021-02-02T06:56:26.317669Z", "shell.execute_reply": "2021-02-02T06:56:26.318686Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "const -7.452320\n", "GPA 1.625810\n", "TUCE 0.051729\n", "PSI 1.426332\n", "dtype: float64\n", "[-7.45233176 1.62580888 0.05172971 1.42631954]\n" ] } ], "source": [ "print(sm_probit_canned.params)\n", "print(sm_probit_manual.params)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2021-02-02T06:56:26.322838Z", "iopub.status.busy": "2021-02-02T06:56:26.322044Z", "iopub.status.idle": "2021-02-02T06:56:26.341419Z", "shell.execute_reply": "2021-02-02T06:56:26.342395Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " const GPA TUCE PSI\n", "const 6.464166 -1.169668 -0.101173 -0.594792\n", "GPA -1.169668 0.481473 -0.018914 0.105439\n", "TUCE -0.101173 -0.018914 0.007038 0.002472\n", "PSI -0.594792 0.105439 0.002472 0.354070\n", "[[ 6.46416770e+00 -1.16966617e+00 -1.01173181e-01 -5.94789009e-01]\n", " [-1.16966617e+00 4.81472117e-01 -1.89134591e-02 1.05438228e-01]\n", " [-1.01173181e-01 -1.89134591e-02 7.03758403e-03 2.47189233e-03]\n", " [-5.94789009e-01 1.05438228e-01 2.47189233e-03 3.54069514e-01]]\n" ] } ], "source": [ "print(sm_probit_canned.cov_params())\n", "print(sm_probit_manual.cov_params())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the ``GenericMaximumLikelihood`` class provides automatic differentiation, so we did not have to provide Hessian or Score functions in order to calculate the covariance estimates." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## Example 2: Negative Binomial Regression for Count Data\n", "\n", "Consider a negative binomial regression model for count data with\n", "log-likelihood (type NB-2) function expressed as:\n", "\n", "$$\n", " \\mathcal{L}(\\beta_j; y, \\alpha) = \\sum_{i=1}^n y_i ln \n", " \\left ( \\frac{\\alpha exp(X_i'\\beta)}{1+\\alpha exp(X_i'\\beta)} \\right ) -\n", " \\frac{1}{\\alpha} ln(1+\\alpha exp(X_i'\\beta)) + ln \\Gamma (y_i + 1/\\alpha) - ln \\Gamma (y_i+1) - ln \\Gamma (1/\\alpha)\n", "$$\n", "\n", "with a matrix of regressors $X$, a vector of coefficients $\\beta$,\n", "and the negative binomial heterogeneity parameter $\\alpha$. \n", "\n", "Using the ``nbinom`` distribution from ``scipy``, we can write this likelihood\n", "simply as:\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2021-02-02T06:56:26.347351Z", "iopub.status.busy": "2021-02-02T06:56:26.345957Z", "iopub.status.idle": "2021-02-02T06:56:26.357615Z", "shell.execute_reply": "2021-02-02T06:56:26.358620Z" } }, "outputs": [], "source": [ "import numpy as np\n", "from scipy.stats import nbinom" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2021-02-02T06:56:26.365090Z", "iopub.status.busy": "2021-02-02T06:56:26.361445Z", "iopub.status.idle": "2021-02-02T06:56:26.368085Z", "shell.execute_reply": "2021-02-02T06:56:26.369062Z" } }, "outputs": [], "source": [ "def _ll_nb2(y, X, beta, alph):\n", " mu = np.exp(np.dot(X, beta))\n", " size = 1/alph\n", " prob = size/(size+mu)\n", " ll = nbinom.logpmf(y, size, prob)\n", " return ll" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### New Model Class\n", "\n", "We create a new model class which inherits from ``GenericLikelihoodModel``:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2021-02-02T06:56:26.374767Z", "iopub.status.busy": "2021-02-02T06:56:26.372091Z", "iopub.status.idle": "2021-02-02T06:56:26.381637Z", "shell.execute_reply": "2021-02-02T06:56:26.382862Z" } }, "outputs": [], "source": [ "from statsmodels.base.model import GenericLikelihoodModel" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2021-02-02T06:56:26.392938Z", "iopub.status.busy": "2021-02-02T06:56:26.392153Z", "iopub.status.idle": "2021-02-02T06:56:26.413339Z", "shell.execute_reply": "2021-02-02T06:56:26.414347Z" } }, "outputs": [], "source": [ "class NBin(GenericLikelihoodModel):\n", " def __init__(self, endog, exog, **kwds):\n", " super(NBin, self).__init__(endog, exog, **kwds)\n", " \n", " def nloglikeobs(self, params):\n", " alph = params[-1]\n", " beta = params[:-1]\n", " ll = _ll_nb2(self.endog, self.exog, beta, alph)\n", " return -ll \n", " \n", " def fit(self, start_params=None, maxiter=10000, maxfun=5000, **kwds):\n", " # we have one additional parameter and we need to add it for summary\n", " self.exog_names.append('alpha')\n", " if start_params == None:\n", " # Reasonable starting values\n", " start_params = np.append(np.zeros(self.exog.shape[1]), .5)\n", " # intercept\n", " start_params[-2] = np.log(self.endog.mean())\n", " return super(NBin, self).fit(start_params=start_params, \n", " maxiter=maxiter, maxfun=maxfun, \n", " **kwds) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Two important things to notice: \n", "\n", "+ ``nloglikeobs``: This function should return one evaluation of the negative log-likelihood function per observation in your dataset (i.e. rows of the endog/X matrix). \n", "+ ``start_params``: A one-dimensional array of starting values needs to be provided. The size of this array determines the number of parameters that will be used in optimization.\n", " \n", "That's it! You're done!\n", "\n", "### Usage Example\n", "\n", "The [Medpar](https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/doc/COUNT/medpar.html)\n", "dataset is hosted in CSV format at the [Rdatasets repository](https://raw.githubusercontent.com/vincentarelbundock/Rdatasets). We use the ``read_csv``\n", "function from the [Pandas library](https://pandas.pydata.org) to load the data\n", "in memory. We then print the first few columns: \n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2021-02-02T06:56:26.419157Z", "iopub.status.busy": "2021-02-02T06:56:26.417757Z", "iopub.status.idle": "2021-02-02T06:56:26.423157Z", "shell.execute_reply": "2021-02-02T06:56:26.424165Z" } }, "outputs": [], "source": [ "import statsmodels.api as sm" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2021-02-02T06:56:26.429139Z", "iopub.status.busy": "2021-02-02T06:56:26.427792Z", "iopub.status.idle": "2021-02-02T06:56:26.578138Z", "shell.execute_reply": "2021-02-02T06:56:26.578766Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " | los | \n", "hmo | \n", "white | \n", "died | \n", "age80 | \n", "type | \n", "type1 | \n", "type2 | \n", "type3 | \n", "provnum | \n", "
---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "4 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "30001 | \n", "
1 | \n", "9 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "30001 | \n", "
2 | \n", "3 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "30001 | \n", "
3 | \n", "9 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "30001 | \n", "
4 | \n", "1 | \n", "0 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "30001 | \n", "