statsmodels.regression.mixed_linear_model.MixedLM.from_formula¶
- classmethod MixedLM.from_formula(formula, data, re_formula=None, vc_formula=None, subset=None, use_sparse=False, missing='none', *args, **kwargs)[source]¶
Create a Model from a formula and dataframe.
- Parameters:
- formula
str
orgeneric
Formula
object
The formula specifying the model
- dataarray_like
The data for the model. See Notes.
- re_formula
str
A one-sided formula defining the variance structure of the model. The default gives a random intercept for each group.
- vc_formuladict-like
Formulas describing variance components. vc_formula[vc] is the formula for the component with variance parameter named vc. The formula is processed into a matrix, and the columns of this matrix are linearly combined with independent random coefficients having mean zero and a common variance.
- subsetarray_like
An array-like object of booleans, integers, or index values that indicate the subset of df to use in the model. Assumes df is a pandas.DataFrame
- missing
str
Either ‘none’ or ‘drop’
- args
extra
arguments
These are passed to the model
- kwargs
extra
keyword
arguments
These are passed to the model with one exception. The
eval_env
keyword is passed to patsy. It can be either apatsy:patsy.EvalEnvironment
object or an integer indicating the depth of the namespace to use. For example, the defaulteval_env=0
uses the calling namespace. If you wish to use a “clean” environment seteval_env=-1
.
- formula
- Returns:
- model
Model
instance
- model
Notes
data must define __getitem__ with the keys in the formula terms args and kwargs are passed on to the model instantiation. E.g., a numpy structured or rec array, a dictionary, or a pandas DataFrame.
If the variance component is intended to produce random intercepts for disjoint subsets of a group, specified by string labels or a categorical data value, always use ‘0 +’ in the formula so that no overall intercept is included.
If the variance components specify random slopes and you do not also want a random group-level intercept in the model, then use ‘0 +’ in the formula to exclude the intercept.
The variance components formulas are processed separately for each group. If a variable is categorical the results will not be affected by whether the group labels are distinct or re-used over the top-level groups.
Examples
Suppose we have data from an educational study with students nested in classrooms nested in schools. The students take a test, and we want to relate the test scores to the students’ ages, while accounting for the effects of classrooms and schools. The school will be the top-level group, and the classroom is a nested group that is specified as a variance component. Note that the schools may have different number of classrooms, and the classroom labels may (but need not be) different across the schools.
>>> vc = {'classroom': '0 + C(classroom)'} >>> MixedLM.from_formula('test_score ~ age', vc_formula=vc, re_formula='1', groups='school', data=data)
Now suppose we also have a previous test score called ‘pretest’. If we want the relationship between pretest scores and the current test to vary by classroom, we can specify a random slope for the pretest score
>>> vc = {'classroom': '0 + C(classroom)', 'pretest': '0 + pretest'} >>> MixedLM.from_formula('test_score ~ age + pretest', vc_formula=vc, re_formula='1', groups='school', data=data)
The following model is almost equivalent to the previous one, but here the classroom random intercept and pretest slope may be correlated.
>>> vc = {'classroom': '0 + C(classroom)'} >>> MixedLM.from_formula('test_score ~ age + pretest', vc_formula=vc, re_formula='1 + pretest', groups='school', data=data)