{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Maximum Likelihood Estimation (Generic models)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial explains how to quickly implement new maximum likelihood models in `statsmodels`. We give two examples: \n", "\n", "1. Probit model for binary dependent variables\n", "2. Negative binomial model for count data\n", "\n", "The `GenericLikelihoodModel` class eases the process by providing tools such as automatic numeric differentiation and a unified interface to ``scipy`` optimization functions. Using ``statsmodels``, users can fit new MLE models simply by \"plugging-in\" a log-likelihood function. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 1: Probit model" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:02:02.244468Z", "iopub.status.busy": "2022-11-02T17:02:02.243988Z", "iopub.status.idle": "2022-11-02T17:02:03.035527Z", "shell.execute_reply": "2022-11-02T17:02:03.034852Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "import numpy as np\n", "from scipy import stats\n", "import statsmodels.api as sm\n", "from statsmodels.base.model import GenericLikelihoodModel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ``Spector`` dataset is distributed with ``statsmodels``. You can access a vector of values for the dependent variable (``endog``) and a matrix of regressors (``exog``) like this:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:02:03.041139Z", "iopub.status.busy": "2022-11-02T17:02:03.039811Z", "iopub.status.idle": "2022-11-02T17:02:03.063535Z", "shell.execute_reply": "2022-11-02T17:02:03.062885Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "::\n", "\n", " Number of Observations - 32\n", "\n", " Number of Variables - 4\n", "\n", " Variable name definitions::\n", "\n", " Grade - binary variable indicating whether or not a student's grade\n", " improved. 1 indicates an improvement.\n", " TUCE - Test score on economics test\n", " PSI - participation in program\n", " GPA - Student's grade point average\n", "\n", " GPA TUCE PSI\n", "0 2.66 20.0 0.0\n", "1 2.89 22.0 0.0\n", "2 3.28 24.0 0.0\n", "3 2.92 12.0 0.0\n", "4 4.00 21.0 0.0\n" ] } ], "source": [ "data = sm.datasets.spector.load_pandas()\n", "exog = data.exog\n", "endog = data.endog\n", "print(sm.datasets.spector.NOTE)\n", "print(data.exog.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Them, we add a constant to the matrix of regressors:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:02:03.098156Z", "iopub.status.busy": "2022-11-02T17:02:03.097802Z", "iopub.status.idle": "2022-11-02T17:02:03.105076Z", "shell.execute_reply": "2022-11-02T17:02:03.104442Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "exog = sm.add_constant(exog, prepend=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To create your own Likelihood Model, you simply need to overwrite the loglike method." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:02:03.108549Z", "iopub.status.busy": "2022-11-02T17:02:03.108105Z", "iopub.status.idle": "2022-11-02T17:02:03.112887Z", "shell.execute_reply": "2022-11-02T17:02:03.112307Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "class MyProbit(GenericLikelihoodModel):\n", " def loglike(self, params):\n", " exog = self.exog\n", " endog = self.endog\n", " q = 2 * endog - 1\n", " return stats.norm.logcdf(q*np.dot(exog, params)).sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Estimate the model and print a summary:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:02:03.116309Z", "iopub.status.busy": "2022-11-02T17:02:03.115884Z", "iopub.status.idle": "2022-11-02T17:02:03.185137Z", "shell.execute_reply": "2022-11-02T17:02:03.184524Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Optimization terminated successfully.\n", " Current function value: 0.400588\n", " Iterations: 292\n", " Function evaluations: 494\n", " MyProbit Results \n", "==============================================================================\n", "Dep. Variable: GRADE Log-Likelihood: -12.819\n", "Model: MyProbit AIC: 33.64\n", "Method: Maximum Likelihood BIC: 39.50\n", "Date: Wed, 02 Nov 2022 \n", "Time: 17:02:03 \n", "No. Observations: 32 \n", "Df Residuals: 28 \n", "Df Model: 3 \n", "==============================================================================\n", " coef std err z P>|z| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "const -7.4523 2.542 -2.931 0.003 -12.435 -2.469\n", "GPA 1.6258 0.694 2.343 0.019 0.266 2.986\n", "TUCE 0.0517 0.084 0.617 0.537 -0.113 0.216\n", "PSI 1.4263 0.595 2.397 0.017 0.260 2.593\n", "==============================================================================\n" ] } ], "source": [ "sm_probit_manual = MyProbit(endog, exog).fit()\n", "print(sm_probit_manual.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare your Probit implementation to ``statsmodels``' \"canned\" implementation:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:02:03.190090Z", "iopub.status.busy": "2022-11-02T17:02:03.188948Z", "iopub.status.idle": "2022-11-02T17:02:03.202540Z", "shell.execute_reply": "2022-11-02T17:02:03.201962Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Optimization terminated successfully.\n", " Current function value: 0.400588\n", " Iterations 6\n" ] } ], "source": [ "sm_probit_canned = sm.Probit(endog, exog).fit()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:02:03.206960Z", "iopub.status.busy": "2022-11-02T17:02:03.205833Z", "iopub.status.idle": "2022-11-02T17:02:03.212802Z", "shell.execute_reply": "2022-11-02T17:02:03.212238Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "const -7.452320\n", "GPA 1.625810\n", "TUCE 0.051729\n", "PSI 1.426332\n", "dtype: float64\n", "[-7.45233176 1.62580888 0.05172971 1.42631954]\n" ] } ], "source": [ "print(sm_probit_canned.params)\n", "print(sm_probit_manual.params)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:02:03.217126Z", "iopub.status.busy": "2022-11-02T17:02:03.216030Z", "iopub.status.idle": "2022-11-02T17:02:03.233707Z", "shell.execute_reply": "2022-11-02T17:02:03.233133Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " const GPA TUCE PSI\n", "const 6.464166 -1.169668 -0.101173 -0.594792\n", "GPA -1.169668 0.481473 -0.018914 0.105439\n", "TUCE -0.101173 -0.018914 0.007038 0.002472\n", "PSI -0.594792 0.105439 0.002472 0.354070\n", "[[ 6.46416769e+00 -1.16966616e+00 -1.01173181e-01 -5.94788997e-01]\n", " [-1.16966616e+00 4.81472112e-01 -1.89134585e-02 1.05438225e-01]\n", " [-1.01173181e-01 -1.89134585e-02 7.03758394e-03 2.47189243e-03]\n", " [-5.94788997e-01 1.05438225e-01 2.47189243e-03 3.54069512e-01]]\n" ] } ], "source": [ "print(sm_probit_canned.cov_params())\n", "print(sm_probit_manual.cov_params())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the ``GenericMaximumLikelihood`` class provides automatic differentiation, so we did not have to provide Hessian or Score functions in order to calculate the covariance estimates." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## Example 2: Negative Binomial Regression for Count Data\n", "\n", "Consider a negative binomial regression model for count data with\n", "log-likelihood (type NB-2) function expressed as:\n", "\n", "$$\n", " \\mathcal{L}(\\beta_j; y, \\alpha) = \\sum_{i=1}^n y_i ln \n", " \\left ( \\frac{\\alpha exp(X_i'\\beta)}{1+\\alpha exp(X_i'\\beta)} \\right ) -\n", " \\frac{1}{\\alpha} ln(1+\\alpha exp(X_i'\\beta)) + ln \\Gamma (y_i + 1/\\alpha) - ln \\Gamma (y_i+1) - ln \\Gamma (1/\\alpha)\n", "$$\n", "\n", "with a matrix of regressors $X$, a vector of coefficients $\\beta$,\n", "and the negative binomial heterogeneity parameter $\\alpha$. \n", "\n", "Using the ``nbinom`` distribution from ``scipy``, we can write this likelihood\n", "simply as:\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:02:03.237198Z", "iopub.status.busy": "2022-11-02T17:02:03.236979Z", "iopub.status.idle": "2022-11-02T17:02:03.241086Z", "shell.execute_reply": "2022-11-02T17:02:03.240509Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "import numpy as np\n", "from scipy.stats import nbinom" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:02:03.244214Z", "iopub.status.busy": "2022-11-02T17:02:03.243802Z", "iopub.status.idle": "2022-11-02T17:02:03.250459Z", "shell.execute_reply": "2022-11-02T17:02:03.249883Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "def _ll_nb2(y, X, beta, alph):\n", " mu = np.exp(np.dot(X, beta))\n", " size = 1/alph\n", " prob = size/(size+mu)\n", " ll = nbinom.logpmf(y, size, prob)\n", " return ll" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### New Model Class\n", "\n", "We create a new model class which inherits from ``GenericLikelihoodModel``:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:02:03.253668Z", "iopub.status.busy": "2022-11-02T17:02:03.253245Z", "iopub.status.idle": "2022-11-02T17:02:03.257763Z", "shell.execute_reply": "2022-11-02T17:02:03.257232Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "from statsmodels.base.model import GenericLikelihoodModel" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:02:03.260887Z", "iopub.status.busy": "2022-11-02T17:02:03.260465Z", "iopub.status.idle": "2022-11-02T17:02:03.270129Z", "shell.execute_reply": "2022-11-02T17:02:03.269576Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "class NBin(GenericLikelihoodModel):\n", " def __init__(self, endog, exog, **kwds):\n", " super(NBin, self).__init__(endog, exog, **kwds)\n", " \n", " def nloglikeobs(self, params):\n", " alph = params[-1]\n", " beta = params[:-1]\n", " ll = _ll_nb2(self.endog, self.exog, beta, alph)\n", " return -ll \n", " \n", " def fit(self, start_params=None, maxiter=10000, maxfun=5000, **kwds):\n", " # we have one additional parameter and we need to add it for summary\n", " self.exog_names.append('alpha')\n", " if start_params == None:\n", " # Reasonable starting values\n", " start_params = np.append(np.zeros(self.exog.shape[1]), .5)\n", " # intercept\n", " start_params[-2] = np.log(self.endog.mean())\n", " return super(NBin, self).fit(start_params=start_params, \n", " maxiter=maxiter, maxfun=maxfun, \n", " **kwds) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Two important things to notice: \n", "\n", "+ ``nloglikeobs``: This function should return one evaluation of the negative log-likelihood function per observation in your dataset (i.e. rows of the endog/X matrix). \n", "+ ``start_params``: A one-dimensional array of starting values needs to be provided. The size of this array determines the number of parameters that will be used in optimization.\n", " \n", "That's it! You're done!\n", "\n", "### Usage Example\n", "\n", "The [Medpar](https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/doc/COUNT/medpar.html)\n", "dataset is hosted in CSV format at the [Rdatasets repository](https://raw.githubusercontent.com/vincentarelbundock/Rdatasets). We use the ``read_csv``\n", "function from the [Pandas library](https://pandas.pydata.org) to load the data\n", "in memory. We then print the first few columns: \n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:02:03.275370Z", "iopub.status.busy": "2022-11-02T17:02:03.273799Z", "iopub.status.idle": "2022-11-02T17:02:03.279672Z", "shell.execute_reply": "2022-11-02T17:02:03.279143Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "import statsmodels.api as sm" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:02:03.283729Z", "iopub.status.busy": "2022-11-02T17:02:03.282887Z", "iopub.status.idle": "2022-11-02T17:02:03.542700Z", "shell.execute_reply": "2022-11-02T17:02:03.541997Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/html": [ "
\n", " | los | \n", "hmo | \n", "white | \n", "died | \n", "age80 | \n", "type | \n", "type1 | \n", "type2 | \n", "type3 | \n", "provnum | \n", "
---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "4 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "30001 | \n", "
1 | \n", "9 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "30001 | \n", "
2 | \n", "3 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "30001 | \n", "
3 | \n", "9 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "30001 | \n", "
4 | \n", "1 | \n", "0 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "30001 | \n", "