statsmodels.genmod.generalized_linear_model.GLM

class statsmodels.genmod.generalized_linear_model.GLM(endog, exog, family=None, offset=None, exposure=None, freq_weights=None, var_weights=None, missing='none', **kwargs)[source]

Generalized Linear Models

GLM inherits from statsmodels.base.model.LikelihoodModel

Parameters:
endogarray_like

1d array of endogenous response variable. This array can be 1d or 2d. Binomial family models accept a 2d array with two columns. If supplied, each observation is expected to be [success, failure].

exogarray_like

A nobs x k array where nobs is the number of observations and k is the number of regressors. An intercept is not included by default and should be added by the user (models specified using a formula include an intercept by default). See statsmodels.tools.add_constant.

familyfamily class instance

The default is Gaussian. To specify the binomial distribution family = sm.family.Binomial() Each family can take a link instance as an argument. See statsmodels.family.family for more information.

offsetarray_like or None

An offset to be included in the model. If provided, must be an array whose length is the number of rows in exog.

exposurearray_like or None

Log(exposure) will be added to the linear prediction in the model. Exposure is only valid if the log link is used. If provided, it must be an array with the same length as endog.

freq_weightsarray_like

1d array of frequency weights. The default is None. If None is selected or a blank value, then the algorithm will replace with an array of 1’s with length equal to the endog. WARNING: Using weights is not verified yet for all possible options and results, see Notes.

var_weightsarray_like

1d array of variance (analytic) weights. The default is None. If None is selected or a blank value, then the algorithm will replace with an array of 1’s with length equal to the endog. WARNING: Using weights is not verified yet for all possible options and results, see Notes.

missingstr

Available options are ‘none’, ‘drop’, and ‘raise’. If ‘none’, no nan checking is done. If ‘drop’, any observations with nans are dropped. If ‘raise’, an error is raised. Default is ‘none’.

Attributes:
df_modelfloat

Model degrees of freedom is equal to p - 1, where p is the number of regressors. Note that the intercept is not reported as a degree of freedom.

df_residfloat

Residual degrees of freedom is equal to the number of observation n minus the number of regressors p.

endogndarray

See Notes. Note that endog is a reference to the data so that if data is already an array and it is changed, then endog changes as well.

exposurearray_like

Include ln(exposure) in model with coefficient constrained to 1. Can only be used if the link is the logarithm function.

exogndarray

See Notes. Note that exog is a reference to the data so that if data is already an array and it is changed, then exog changes as well.

freq_weightsndarray

See Notes. Note that freq_weights is a reference to the data so that if data is already an array and it is changed, then freq_weights changes as well.

var_weightsndarray

See Notes. Note that var_weights is a reference to the data so that if data is already an array and it is changed, then var_weights changes as well.

iterationint

The number of iterations that fit has run. Initialized at 0.

familyfamily class instance

The distribution family of the model. Can be any family in statsmodels.families. Default is Gaussian.

mundarray

The mean response of the transformed variable. mu is the value of the inverse of the link function at lin_pred, where lin_pred is the linear predicted value of the WLS fit of the transformed variable. mu is only available after fit is called. See statsmodels.families.family.fitted of the distribution family for more information.

n_trialsndarray

See Notes. Note that n_trials is a reference to the data so that if data is already an array and it is changed, then n_trials changes as well. n_trials is the number of binomial trials and only available with that distribution. See statsmodels.families.Binomial for more information.

normalized_cov_paramsndarray

The p x p normalized covariance of the design / exogenous data. This is approximately equal to (X.T X)^(-1)

offsetarray_like

Include offset in model with coefficient constrained to 1.

scalefloat

The estimate of the scale / dispersion of the model fit. Only available after fit is called. See GLM.fit and GLM.estimate_scale for more information.

scaletypestr

The scaling used for fitting the model. This is only available after fit is called. The default is None. See GLM.fit for more information.

weightsndarray

The value of the weights after the last iteration of fit. Only available after fit is called. See statsmodels.families.family for the specific distribution weighting functions.

Notes

Note: PerfectSeparationError exception has been converted to a PerfectSeparationWarning and perfect separation or perfect prediction will not raise an exception by default. (changed in version 0.14)

Only the following combinations make sense for family and link:

Family

ident

log

logit

probit

cloglog

pow

opow

nbinom

loglog

logc

Gaussian

x

x

x

x

x

x

x

x

x

inv Gaussian

x

x

x

binomial

x

x

x

x

x

x

x

x

x

Poisson

x

x

x

neg binomial

x

x

x

x

gamma

x

x

x

Tweedie

x

x

x

Not all of these link functions are currently available.

Endog and exog are references so that if the data they refer to are already arrays and these arrays are changed, endog and exog will change.

statsmodels supports two separate definitions of weights: frequency weights and variance weights.

Frequency weights produce the same results as repeating observations by the frequencies (if those are integers). Frequency weights will keep the number of observations consistent, but the degrees of freedom will change to reflect the new weights.

Variance weights (referred to in other packages as analytic weights) are used when endog represents an an average or mean. This relies on the assumption that that the inverse variance scales proportionally to the weight–an observation that is deemed more credible should have less variance and therefore have more weight. For the Poisson family–which assumes that occurrences scale proportionally with time–a natural practice would be to use the amount of time as the variance weight and set endog to be a rate (occurrences per period of time). Similarly, using a compound Poisson family, namely Tweedie, makes a similar assumption about the rate (or frequency) of occurrences having variance proportional to time.

Both frequency and variance weights are verified for all basic results with nonrobust or heteroscedasticity robust cov_type. Other robust covariance types have not yet been verified, and at least the small sample correction is currently not based on the correct total frequency count.

Currently, all residuals are not weighted by frequency, although they may incorporate n_trials for Binomial and var_weights

Residual Type

Applicable weights

Anscombe

var_weights

Deviance

var_weights

Pearson

var_weights and n_trials

Reponse

n_trials

Working

n_trials

WARNING: Loglikelihood and deviance are not valid in models where scale is equal to 1 (i.e., Binomial, NegativeBinomial, and Poisson). If variance weights are specified, then results such as loglike and deviance are based on a quasi-likelihood interpretation. The loglikelihood is not correctly specified in this case, and statistics based on it, such AIC or likelihood ratio tests, are not appropriate.

Examples

>>> import statsmodels.api as sm
>>> data = sm.datasets.scotland.load()
>>> data.exog = sm.add_constant(data.exog)

Instantiate a gamma family model with the default link function.

>>> gamma_model = sm.GLM(data.endog, data.exog,
...                      family=sm.families.Gamma())
>>> gamma_results = gamma_model.fit()
>>> gamma_results.params
array([-0.01776527,  0.00004962,  0.00203442, -0.00007181,  0.00011185,
       -0.00000015, -0.00051868, -0.00000243])
>>> gamma_results.scale
0.0035842831734919055
>>> gamma_results.deviance
0.087388516416999198
>>> gamma_results.pearson_chi2
0.086022796163805704
>>> gamma_results.llf
-83.017202161073527

Methods

estimate_scale(mu)

Estimate the dispersion/scale.

estimate_tweedie_power(mu[, method, low, high])

Tweedie specific function to estimate scale and the variance parameter.

fit([start_params, maxiter, method, tol, ...])

Fits a generalized linear model for a given family.

fit_constrained(constraints[, start_params])

fit the model subject to linear equality constraints

fit_regularized([method, alpha, ...])

Return a regularized fit to a linear regression model.

from_formula(formula, data[, subset, drop_cols])

Create a Model from a formula and dataframe.

get_distribution(params[, scale, exog, ...])

Return a instance of the predictive distribution.

hessian(params[, scale, observed])

Hessian, second derivative of loglikelihood function

hessian_factor(params[, scale, observed])

Weights for calculating Hessian

information(params[, scale])

Fisher information matrix.

initialize()

Initialize a generalized linear model.

loglike(params[, scale])

Evaluate the log-likelihood for a generalized linear model.

loglike_mu(mu[, scale])

Evaluate the log-likelihood for a generalized linear model.

predict(params[, exog, exposure, offset, ...])

Return predicted values for a design matrix

score(params[, scale])

score, first derivative of the loglikelihood function

score_factor(params[, scale])

weights for score for each observation

score_obs(params[, scale])

score first derivative of the loglikelihood for each observation.

score_test(params_constrained[, ...])

score test for restrictions or for omitted variables

Properties

endog_names

Names of endogenous variables.

exog_names

Names of exogenous variables.

exposure_name

Name of the exposure variable if available.

freq_weights_name

Name of the freq weights variable if available.

offset_name

Name of the offset variable if available.

var_weights_name

Name of var weights variable if available.


Last update: Nov 14, 2024