statsmodels.regression.mixed_linear_model.MixedLM.from_formula

method

classmethod MixedLM.from_formula(formula, data, re_formula=None, vc_formula=None, subset=None, use_sparse=False, missing='none', *args, **kwargs)[source]

Create a Model from a formula and dataframe.

Parameters
formulastr or generic Formula object

The formula specifying the model

dataarray-like

The data for the model. See Notes.

re_formulastring

A one-sided formula defining the variance structure of the model. The default gives a random intercept for each group.

vc_formuladict-like

Formulas describing variance components. vc_formula[vc] is the formula for the component with variance parameter named vc. The formula is processed into a matrix, and the columns of this matrix are linearly combined with independent random coefficients having mean zero and a common variance.

subsetarray-like

An array-like object of booleans, integers, or index values that indicate the subset of df to use in the model. Assumes df is a pandas.DataFrame

missingstring

Either ‘none’ or ‘drop’

argsextra arguments

These are passed to the model

kwargsextra keyword arguments

These are passed to the model with one exception. The eval_env keyword is passed to patsy. It can be either a patsy.EvalEnvironment object or an integer indicating the depth of the namespace to use. For example, the default eval_env=0 uses the calling namespace. If you wish to use a “clean” environment set eval_env=-1.

Returns
modelModel instance

Notes

data must define __getitem__ with the keys in the formula terms args and kwargs are passed on to the model instantiation. E.g., a numpy structured or rec array, a dictionary, or a pandas DataFrame.

If the variance component is intended to produce random intercepts for disjoint subsets of a group, specified by string labels or a categorical data value, always use ‘0 +’ in the formula so that no overall intercept is included.

If the variance components specify random slopes and you do not also want a random group-level intercept in the model, then use ‘0 +’ in the formula to exclude the intercept.

The variance components formulas are processed separately for each group. If a variable is categorical the results will not be affected by whether the group labels are distinct or re-used over the top-level groups.

Examples

Suppose we have data from an educational study with students nested in classrooms nested in schools. The students take a test, and we want to relate the test scores to the students’ ages, while accounting for the effects of classrooms and schools. The school will be the top-level group, and the classroom is a nested group that is specified as a variance component. Note that the schools may have different number of classrooms, and the classroom labels may (but need not be) different across the schools.

>>> vc = {'classroom': '0 + C(classroom)'}
>>> MixedLM.from_formula('test_score ~ age', vc_formula=vc,                                   re_formula='1', groups='school', data=data)

Now suppose we also have a previous test score called ‘pretest’. If we want the relationship between pretest scores and the current test to vary by classroom, we can specify a random slope for the pretest score

>>> vc = {'classroom': '0 + C(classroom)', 'pretest': '0 + pretest'}
>>> MixedLM.from_formula('test_score ~ age + pretest', vc_formula=vc,                                   re_formula='1', groups='school', data=data)

The following model is almost equivalent to the previous one, but here the classroom random intercept and pretest slope may be correlated.

>>> vc = {'classroom': '0 + C(classroom)'}
>>> MixedLM.from_formula('test_score ~ age + pretest', vc_formula=vc,                                   re_formula='1 + pretest', groups='school',                                   data=data)