.. _formula_examples:
Fitting models using R-style formulas
=====================================
Since version 0.5.0, ``statsmodels`` allows users to fit statistical
models using R-style formulas. Internally, ``statsmodels`` uses the
`patsy `_ package to convert formulas and
data to the matrices that are used in model fitting. The formula
framework is quite powerful; this tutorial only scratches the surface. A
full description of the formula language can be found in the ``patsy``
docs:
- `Patsy formula language description `_
Loading modules and functions
-----------------------------
.. ipython:: python
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas
Notice that we called ``statsmodels.formula.api`` in addition to the usual
``statsmodels.api``. In fact, ``statsmodels.api`` is used here only to load
the dataset. The ``formula.api`` hosts many of the same
functions found in ``api`` (e.g. OLS, GLM), but it also holds lower case
counterparts for most of these models. In general, lower case models
accept ``formula`` and ``df`` arguments, whereas upper case ones take
``endog`` and ``exog`` design matrices. ``formula`` accepts a string
which describes the model in terms of a ``patsy`` formula. ``df`` takes
a `pandas `_ data frame.
``dir(smf)`` will print a list of available models.
Formula-compatible models have the following generic call signature:
``(formula, data, subset=None, *args, **kwargs)``
OLS regression using formulas
-----------------------------
To begin, we fit the linear model described on the `Getting
Started `_ page. Download the data, subset columns,
and list-wise delete to remove missing observations:
.. ipython:: python
df = sm.datasets.get_rdataset("Guerry", "HistData").data
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
df.head()
Fit the model:
.. ipython:: python
mod = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)
res = mod.fit()
print(res.summary())
Categorical variables
---------------------
Looking at the summary printed above, notice that ``patsy`` determined
that elements of *Region* were text strings, so it treated *Region* as a
categorical variable. ``patsy``'s default is also to include an
intercept, so we automatically dropped one of the *Region* categories.
If *Region* had been an integer variable that we wanted to treat
explicitly as categorical, we could have done so by using the ``C()``
operator:
.. ipython:: python
res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()
print(res.params)
Examples more advanced features ``patsy``'s categorical variables
function can be found here: `Patsy: Contrast Coding Systems for
categorical variables `_
Operators
---------
We have already seen that "~" separates the left-hand side of the model
from the right-hand side, and that "+" adds new columns to the design
matrix.
Removing variables
~~~~~~~~~~~~~~~~~~
The "-" sign can be used to remove columns/variables. For instance, we
can remove the intercept from a model by:
.. ipython:: python
res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()
print(res.params)
Multiplicative interactions
~~~~~~~~~~~~~~~~~~~~~~~~~~~
":" adds a new column to the design matrix with the product of the other
two columns. "\*" will also include the individual columns that were
multiplied together:
.. ipython:: python
res1 = smf.ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()
res2 = smf.ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()
print(res1.params)
print(res2.params)
Many other things are possible with operators. Please consult the `patsy
docs `_ to learn
more.
Functions
---------
You can apply vectorized functions to the variables in your model:
.. ipython:: python
res = smf.ols(formula='Lottery ~ np.log(Literacy)', data=df).fit()
print(res.params)
Define a custom function:
.. ipython:: python
def log_plus_1(x):
return np.log(x) + 1.0
res = smf.ols(formula='Lottery ~ log_plus_1(Literacy)', data=df).fit()
print(res.params)
.. _patsy-namespaces:
Namespaces
----------
Notice that all of the above examples use the calling namespace to look for the functions to apply. The namespace used can be controlled via the ``eval_env`` keyword. For example, you may want to give a custom namespace using the :class:`patsy:patsy.EvalEnvironment` or you may want to use a "clean" namespace, which we provide by passing ``eval_func=-1``. The default is to use the caller's namespace. This can have (un)expected consequences, if, for example, someone has a variable names ``C`` in the user namespace or in their data structure passed to ``patsy``, and ``C`` is used in the formula to handle a categorical variable. See the `Patsy API Reference `_ for more information.
Using formulas with models that do not (yet) support them
---------------------------------------------------------
Even if a given ``statsmodels`` function does not support formulas, you
can still use ``patsy``'s formula language to produce design matrices.
Those matrices can then be fed to the fitting function as ``endog`` and
``exog`` arguments.
To generate ``numpy`` arrays:
.. ipython:: python
import patsy
f = 'Lottery ~ Literacy * Wealth'
y, X = patsy.dmatrices(f, df, return_type='matrix')
print(y[:5])
print(X[:5])
``y`` and ``X`` would be instances of ``patsy.DesignMatrix`` which is a subclass of ``numpy.ndarray``.
To generate pandas data frames:
.. ipython:: python
f = 'Lottery ~ Literacy * Wealth'
y, X = patsy.dmatrices(f, df, return_type='dataframe')
print(y[:5])
print(X[:5])
.. ipython:: python
print(sm.OLS(y, X).fit().summary())