{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Formulas: Fitting models using R-style formulas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since version 0.5.0, ``statsmodels`` allows users to fit statistical models using R-style formulas. Internally, ``statsmodels`` uses the [patsy](http://patsy.readthedocs.org/) package to convert formulas and data to the matrices that are used in model fitting. The formula framework is quite powerful; this tutorial only scratches the surface. A full description of the formula language can be found in the ``patsy`` docs: \n", "\n", "* [Patsy formula language description](http://patsy.readthedocs.org/)\n", "\n", "## Loading modules and functions" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:10:38.040422Z", "iopub.status.busy": "2022-11-02T17:10:38.039251Z", "iopub.status.idle": "2022-11-02T17:10:38.936251Z", "shell.execute_reply": "2022-11-02T17:10:38.935530Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "import numpy as np # noqa:F401 needed in namespace for patsy\n", "import statsmodels.api as sm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Import convention" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can import explicitly from statsmodels.formula.api" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:10:38.943198Z", "iopub.status.busy": "2022-11-02T17:10:38.942834Z", "iopub.status.idle": "2022-11-02T17:10:38.947622Z", "shell.execute_reply": "2022-11-02T17:10:38.946886Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "from statsmodels.formula.api import ols" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, you can just use the `formula` namespace of the main `statsmodels.api`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:10:38.951350Z", "iopub.status.busy": "2022-11-02T17:10:38.951106Z", "iopub.status.idle": "2022-11-02T17:10:38.963087Z", "shell.execute_reply": "2022-11-02T17:10:38.962288Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ ">" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sm.formula.ols" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or you can use the following convention" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:10:38.967187Z", "iopub.status.busy": "2022-11-02T17:10:38.966951Z", "iopub.status.idle": "2022-11-02T17:10:38.971020Z", "shell.execute_reply": "2022-11-02T17:10:38.970329Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "import statsmodels.formula.api as smf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These names are just a convenient way to get access to each model's `from_formula` classmethod. See, for instance" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:10:38.974683Z", "iopub.status.busy": "2022-11-02T17:10:38.974451Z", "iopub.status.idle": "2022-11-02T17:10:38.981152Z", "shell.execute_reply": "2022-11-02T17:10:38.980467Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ ">" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sm.OLS.from_formula" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All of the lower case models accept ``formula`` and ``data`` arguments, whereas upper case ones take ``endog`` and ``exog`` design matrices. ``formula`` accepts a string which describes the model in terms of a ``patsy`` formula. ``data`` takes a [pandas](https://pandas.pydata.org/) data frame or any other data structure that defines a ``__getitem__`` for variable names like a structured array or a dictionary of variables. \n", "\n", "``dir(sm.formula)`` will print a list of available models. \n", "\n", "Formula-compatible models have the following generic call signature: ``(formula, data, subset=None, *args, **kwargs)``" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## OLS regression using formulas\n", "\n", "To begin, we fit the linear model described on the [Getting Started](./regression_diagnostics.html) page. Download the data, subset columns, and list-wise delete to remove missing observations:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:10:38.984932Z", "iopub.status.busy": "2022-11-02T17:10:38.984695Z", "iopub.status.idle": "2022-11-02T17:10:39.036875Z", "shell.execute_reply": "2022-11-02T17:10:39.036106Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "dta = sm.datasets.get_rdataset(\"Guerry\", \"HistData\", cache=True)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:10:39.041046Z", "iopub.status.busy": "2022-11-02T17:10:39.040780Z", "iopub.status.idle": "2022-11-02T17:10:39.053489Z", "shell.execute_reply": "2022-11-02T17:10:39.052952Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LotteryLiteracyWealthRegion
0413773E
1385122N
2661361C
3804676E
4796983E
\n", "
" ], "text/plain": [ " Lottery Literacy Wealth Region\n", "0 41 37 73 E\n", "1 38 51 22 N\n", "2 66 13 61 C\n", "3 80 46 76 E\n", "4 79 69 83 E" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = dta.data[[\"Lottery\", \"Literacy\", \"Wealth\", \"Region\"]].dropna()\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fit the model:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:10:39.058066Z", "iopub.status.busy": "2022-11-02T17:10:39.056923Z", "iopub.status.idle": "2022-11-02T17:10:39.076825Z", "shell.execute_reply": "2022-11-02T17:10:39.076257Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: Lottery R-squared: 0.338\n", "Model: OLS Adj. R-squared: 0.287\n", "Method: Least Squares F-statistic: 6.636\n", "Date: Wed, 02 Nov 2022 Prob (F-statistic): 1.07e-05\n", "Time: 17:10:39 Log-Likelihood: -375.30\n", "No. Observations: 85 AIC: 764.6\n", "Df Residuals: 78 BIC: 781.7\n", "Df Model: 6 \n", "Covariance Type: nonrobust \n", "===============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "-------------------------------------------------------------------------------\n", "Intercept 38.6517 9.456 4.087 0.000 19.826 57.478\n", "Region[T.E] -15.4278 9.727 -1.586 0.117 -34.793 3.938\n", "Region[T.N] -10.0170 9.260 -1.082 0.283 -28.453 8.419\n", "Region[T.S] -4.5483 7.279 -0.625 0.534 -19.039 9.943\n", "Region[T.W] -10.0913 7.196 -1.402 0.165 -24.418 4.235\n", "Literacy -0.1858 0.210 -0.886 0.378 -0.603 0.232\n", "Wealth 0.4515 0.103 4.390 0.000 0.247 0.656\n", "==============================================================================\n", "Omnibus: 3.049 Durbin-Watson: 1.785\n", "Prob(Omnibus): 0.218 Jarque-Bera (JB): 2.694\n", "Skew: -0.340 Prob(JB): 0.260\n", "Kurtosis: 2.454 Cond. No. 371.\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n" ] } ], "source": [ "mod = ols(formula=\"Lottery ~ Literacy + Wealth + Region\", data=df)\n", "res = mod.fit()\n", "print(res.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Categorical variables\n", "\n", "Looking at the summary printed above, notice that ``patsy`` determined that elements of *Region* were text strings, so it treated *Region* as a categorical variable. `patsy`'s default is also to include an intercept, so we automatically dropped one of the *Region* categories.\n", "\n", "If *Region* had been an integer variable that we wanted to treat explicitly as categorical, we could have done so by using the ``C()`` operator: " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:10:39.081450Z", "iopub.status.busy": "2022-11-02T17:10:39.080333Z", "iopub.status.idle": "2022-11-02T17:10:39.093464Z", "shell.execute_reply": "2022-11-02T17:10:39.092914Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Intercept 38.651655\n", "C(Region)[T.E] -15.427785\n", "C(Region)[T.N] -10.016961\n", "C(Region)[T.S] -4.548257\n", "C(Region)[T.W] -10.091276\n", "Literacy -0.185819\n", "Wealth 0.451475\n", "dtype: float64\n" ] } ], "source": [ "res = ols(formula=\"Lottery ~ Literacy + Wealth + C(Region)\", data=df).fit()\n", "print(res.params)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Patsy's mode advanced features for categorical variables are discussed in: [Patsy: Contrast Coding Systems for categorical variables](./contrasts.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Operators\n", "\n", "We have already seen that \"~\" separates the left-hand side of the model from the right-hand side, and that \"+\" adds new columns to the design matrix. \n", "\n", "## Removing variables\n", "\n", "The \"-\" sign can be used to remove columns/variables. For instance, we can remove the intercept from a model by: " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:10:39.098130Z", "iopub.status.busy": "2022-11-02T17:10:39.096984Z", "iopub.status.idle": "2022-11-02T17:10:39.114378Z", "shell.execute_reply": "2022-11-02T17:10:39.113621Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C(Region)[C] 38.651655\n", "C(Region)[E] 23.223870\n", "C(Region)[N] 28.634694\n", "C(Region)[S] 34.103399\n", "C(Region)[W] 28.560379\n", "Literacy -0.185819\n", "Wealth 0.451475\n", "dtype: float64\n" ] } ], "source": [ "res = ols(formula=\"Lottery ~ Literacy + Wealth + C(Region) -1 \", data=df).fit()\n", "print(res.params)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multiplicative interactions\n", "\n", "\":\" adds a new column to the design matrix with the interaction of the other two columns. \"*\" will also include the individual columns that were multiplied together:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:10:39.118366Z", "iopub.status.busy": "2022-11-02T17:10:39.117875Z", "iopub.status.idle": "2022-11-02T17:10:39.136379Z", "shell.execute_reply": "2022-11-02T17:10:39.135817Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Literacy:Wealth 0.018176\n", "dtype: float64 \n", "\n", "Literacy 0.427386\n", "Wealth 1.080987\n", "Literacy:Wealth -0.013609\n", "dtype: float64\n" ] } ], "source": [ "res1 = ols(formula=\"Lottery ~ Literacy : Wealth - 1\", data=df).fit()\n", "res2 = ols(formula=\"Lottery ~ Literacy * Wealth - 1\", data=df).fit()\n", "print(res1.params, \"\\n\")\n", "print(res2.params)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many other things are possible with operators. Please consult the [patsy docs](https://patsy.readthedocs.org/en/latest/formulas.html) to learn more." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Functions\n", "\n", "You can apply vectorized functions to the variables in your model: " ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:10:39.140799Z", "iopub.status.busy": "2022-11-02T17:10:39.140416Z", "iopub.status.idle": "2022-11-02T17:10:39.151472Z", "shell.execute_reply": "2022-11-02T17:10:39.150831Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Intercept 115.609119\n", "np.log(Literacy) -20.393959\n", "dtype: float64\n" ] } ], "source": [ "res = smf.ols(formula=\"Lottery ~ np.log(Literacy)\", data=df).fit()\n", "print(res.params)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define a custom function:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:10:39.155816Z", "iopub.status.busy": "2022-11-02T17:10:39.154733Z", "iopub.status.idle": "2022-11-02T17:10:39.165962Z", "shell.execute_reply": "2022-11-02T17:10:39.165391Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Intercept 136.003079\n", "log_plus_1(Literacy) -20.393959\n", "dtype: float64\n" ] } ], "source": [ "def log_plus_1(x):\n", " return np.log(x) + 1.0\n", "\n", "\n", "res = smf.ols(formula=\"Lottery ~ log_plus_1(Literacy)\", data=df).fit()\n", "print(res.params)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Any function that is in the calling namespace is available to the formula." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using formulas with models that do not (yet) support them\n", "\n", "Even if a given `statsmodels` function does not support formulas, you can still use `patsy`'s formula language to produce design matrices. Those matrices \n", "can then be fed to the fitting function as `endog` and `exog` arguments. \n", "\n", "To generate ``numpy`` arrays: " ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:10:39.170332Z", "iopub.status.busy": "2022-11-02T17:10:39.169233Z", "iopub.status.idle": "2022-11-02T17:10:39.179717Z", "shell.execute_reply": "2022-11-02T17:10:39.179163Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[41.]\n", " [38.]\n", " [66.]\n", " [80.]\n", " [79.]]\n", "[[1.000e+00 3.700e+01 7.300e+01 2.701e+03]\n", " [1.000e+00 5.100e+01 2.200e+01 1.122e+03]\n", " [1.000e+00 1.300e+01 6.100e+01 7.930e+02]\n", " [1.000e+00 4.600e+01 7.600e+01 3.496e+03]\n", " [1.000e+00 6.900e+01 8.300e+01 5.727e+03]]\n" ] } ], "source": [ "import patsy\n", "\n", "f = \"Lottery ~ Literacy * Wealth\"\n", "y, X = patsy.dmatrices(f, df, return_type=\"matrix\")\n", "print(y[:5])\n", "print(X[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To generate pandas data frames: " ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:10:39.183949Z", "iopub.status.busy": "2022-11-02T17:10:39.182875Z", "iopub.status.idle": "2022-11-02T17:10:39.196615Z", "shell.execute_reply": "2022-11-02T17:10:39.196065Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Lottery\n", "0 41.0\n", "1 38.0\n", "2 66.0\n", "3 80.0\n", "4 79.0\n", " Intercept Literacy Wealth Literacy:Wealth\n", "0 1.0 37.0 73.0 2701.0\n", "1 1.0 51.0 22.0 1122.0\n", "2 1.0 13.0 61.0 793.0\n", "3 1.0 46.0 76.0 3496.0\n", "4 1.0 69.0 83.0 5727.0\n" ] } ], "source": [ "f = \"Lottery ~ Literacy * Wealth\"\n", "y, X = patsy.dmatrices(f, df, return_type=\"dataframe\")\n", "print(y[:5])\n", "print(X[:5])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "execution": { "iopub.execute_input": "2022-11-02T17:10:39.200865Z", "iopub.status.busy": "2022-11-02T17:10:39.199770Z", "iopub.status.idle": "2022-11-02T17:10:39.212102Z", "shell.execute_reply": "2022-11-02T17:10:39.211548Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: Lottery R-squared: 0.309\n", "Model: OLS Adj. R-squared: 0.283\n", "Method: Least Squares F-statistic: 12.06\n", "Date: Wed, 02 Nov 2022 Prob (F-statistic): 1.32e-06\n", "Time: 17:10:39 Log-Likelihood: -377.13\n", "No. Observations: 85 AIC: 762.3\n", "Df Residuals: 81 BIC: 772.0\n", "Df Model: 3 \n", "Covariance Type: nonrobust \n", "===================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "-----------------------------------------------------------------------------------\n", "Intercept 38.6348 15.825 2.441 0.017 7.149 70.121\n", "Literacy -0.3522 0.334 -1.056 0.294 -1.016 0.312\n", "Wealth 0.4364 0.283 1.544 0.126 -0.126 0.999\n", "Literacy:Wealth -0.0005 0.006 -0.085 0.933 -0.013 0.012\n", "==============================================================================\n", "Omnibus: 4.447 Durbin-Watson: 1.953\n", "Prob(Omnibus): 0.108 Jarque-Bera (JB): 3.228\n", "Skew: -0.332 Prob(JB): 0.199\n", "Kurtosis: 2.314 Cond. No. 1.40e+04\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "[2] The condition number is large, 1.4e+04. This might indicate that there are\n", "strong multicollinearity or other numerical problems.\n" ] } ], "source": [ "print(sm.OLS(y, X).fit().summary())" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" } }, "nbformat": 4, "nbformat_minor": 4 }