Generalized Least Squares¶
[1]:
import numpy as np
import statsmodels.api as sm
The Longley dataset is a time series dataset:
[2]:
data = sm.datasets.longley.load()
data.exog = sm.add_constant(data.exog)
print(data.exog.head())
const GNPDEFL GNP UNEMP ARMED POP YEAR
0 1.0 83.0 234289.0 2356.0 1590.0 107608.0 1947.0
1 1.0 88.5 259426.0 2325.0 1456.0 108632.0 1948.0
2 1.0 88.2 258054.0 3682.0 1616.0 109773.0 1949.0
3 1.0 89.5 284599.0 3351.0 1650.0 110929.0 1950.0
4 1.0 96.2 328975.0 2099.0 3099.0 112075.0 1951.0
Let’s assume that the data is heteroskedastic and that we know the nature of the heteroskedasticity. We can then define sigma
and use it to give us a GLS model
First we will obtain the residuals from an OLS fit
[3]:
ols_resid = sm.OLS(data.endog, data.exog).fit().resid
Assume that the error terms follow an AR(1) process with a trend:
\(\epsilon_i = \beta_0 + \rho\epsilon_{i-1} + \eta_i\)
where \(\eta \sim N(0,\Sigma^2)\)
and that \(\rho\) is simply the correlation of the residual a consistent estimator for rho is to regress the residuals on the lagged residuals
[4]:
resid_fit = sm.OLS(
np.asarray(ols_resid)[1:], sm.add_constant(np.asarray(ols_resid)[:-1])
).fit()
print(resid_fit.tvalues[1])
print(resid_fit.pvalues[1])
-1.4390229839793167
0.17378444788655237
While we do not have strong evidence that the errors follow an AR(1) process we continue
[5]:
rho = resid_fit.params[1]
As we know, an AR(1) process means that near-neighbors have a stronger relation so we can give this structure by using a toeplitz matrix
[6]:
from scipy.linalg import toeplitz
toeplitz(range(5))
[6]:
array([[0, 1, 2, 3, 4],
[1, 0, 1, 2, 3],
[2, 1, 0, 1, 2],
[3, 2, 1, 0, 1],
[4, 3, 2, 1, 0]])
[7]:
order = toeplitz(range(len(ols_resid)))
so that our error covariance structure is actually rho**order which defines an autocorrelation structure
[8]:
sigma = rho ** order
gls_model = sm.GLS(data.endog, data.exog, sigma=sigma)
gls_results = gls_model.fit()
Of course, the exact rho in this instance is not known so it it might make more sense to use feasible gls, which currently only has experimental support.
We can use the GLSAR model with one lag, to get to a similar result:
[9]:
glsar_model = sm.GLSAR(data.endog, data.exog, 1)
glsar_results = glsar_model.iterative_fit(1)
print(glsar_results.summary())
GLSAR Regression Results
==============================================================================
Dep. Variable: TOTEMP R-squared: 0.996
Model: GLSAR Adj. R-squared: 0.992
Method: Least Squares F-statistic: 295.2
Date: Wed, 02 Nov 2022 Prob (F-statistic): 6.09e-09
Time: 17:06:55 Log-Likelihood: -102.04
No. Observations: 15 AIC: 218.1
Df Residuals: 8 BIC: 223.0
Df Model: 6
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -3.468e+06 8.72e+05 -3.979 0.004 -5.48e+06 -1.46e+06
GNPDEFL 34.5568 84.734 0.408 0.694 -160.840 229.953
GNP -0.0343 0.033 -1.047 0.326 -0.110 0.041
UNEMP -1.9621 0.481 -4.083 0.004 -3.070 -0.854
ARMED -1.0020 0.211 -4.740 0.001 -1.489 -0.515
POP -0.0978 0.225 -0.435 0.675 -0.616 0.421
YEAR 1823.1829 445.829 4.089 0.003 795.100 2851.266
==============================================================================
Omnibus: 1.960 Durbin-Watson: 2.554
Prob(Omnibus): 0.375 Jarque-Bera (JB): 1.423
Skew: 0.713 Prob(JB): 0.491
Kurtosis: 2.508 Cond. No. 4.80e+09
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.8e+09. This might indicate that there are
strong multicollinearity or other numerical problems.
/opt/hostedtoolcache/Python/3.10.8/x64/lib/python3.10/site-packages/scipy/stats/stats.py:1541: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=15
warnings.warn("kurtosistest only valid for n>=20 ... continuing "
Comparing gls and glsar results, we see that there are some small differences in the parameter estimates and the resulting standard errors of the parameter estimate. This might be do to the numerical differences in the algorithm, e.g. the treatment of initial conditions, because of the small number of observations in the longley dataset.
[10]:
print(gls_results.params)
print(glsar_results.params)
print(gls_results.bse)
print(glsar_results.bse)
const -3.797855e+06
GNPDEFL -1.276565e+01
GNP -3.800132e-02
UNEMP -2.186949e+00
ARMED -1.151776e+00
POP -6.805356e-02
YEAR 1.993953e+03
dtype: float64
const -3.467961e+06
GNPDEFL 3.455678e+01
GNP -3.434101e-02
UNEMP -1.962144e+00
ARMED -1.001973e+00
POP -9.780460e-02
YEAR 1.823183e+03
dtype: float64
const 670688.699308
GNPDEFL 69.430807
GNP 0.026248
UNEMP 0.382393
ARMED 0.165253
POP 0.176428
YEAR 342.634628
dtype: float64
const 871584.051696
GNPDEFL 84.733715
GNP 0.032803
UNEMP 0.480545
ARMED 0.211384
POP 0.224774
YEAR 445.828748
dtype: float64