Robust Linear Models¶

[1]:

%matplotlib inline

[2]:

import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm

Estimation¶

Load data:

[3]:

data = sm.datasets.stackloss.load()
data.exog = sm.add_constant(data.exog)

Huber’s T norm with the (default) median absolute deviation scaling

[4]:

huber_t = sm.RLM(data.endog, data.exog, M=sm.robust.norms.HuberT())
hub_results = huber_t.fit()
print(hub_results.params)
print(hub_results.bse)
print(
    hub_results.summary(
        yname="y", xname=["var_%d" % i for i in range(len(hub_results.params))]
    )
)

const       -41.026498
AIRFLOW       0.829384
WATERTEMP     0.926066
ACIDCONC     -0.127847
dtype: float64
const        9.791899
AIRFLOW      0.111005
WATERTEMP    0.302930
ACIDCONC     0.128650
dtype: float64
                    Robust linear Model Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                   21
Model:                            RLM   Df Residuals:                       17
Method:                          IRLS   Df Model:                            3
Norm:                          HuberT
Scale Est.:                       mad
Cov Type:                          H1
Date:                Thu, 14 Nov 2024
Time:                        16:50:19
No. Iterations:                    19
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
var_0        -41.0265      9.792     -4.190      0.000     -60.218     -21.835
var_1          0.8294      0.111      7.472      0.000       0.612       1.047
var_2          0.9261      0.303      3.057      0.002       0.332       1.520
var_3         -0.1278      0.129     -0.994      0.320      -0.380       0.124
==============================================================================

If the model instance has been used for another fit with different fit parameters, then the fit options might not be the correct ones anymore .

Huber’s T norm with ‘H2’ covariance matrix

[5]:

hub_results2 = huber_t.fit(cov="H2")
print(hub_results2.params)
print(hub_results2.bse)

const       -41.026498
AIRFLOW       0.829384
WATERTEMP     0.926066
ACIDCONC     -0.127847
dtype: float64
const        9.089504
AIRFLOW      0.119460
WATERTEMP    0.322355
ACIDCONC     0.117963
dtype: float64

Andrew’s Wave norm with Huber’s Proposal 2 scaling and ‘H3’ covariance matrix

[6]:

andrew_mod = sm.RLM(data.endog, data.exog, M=sm.robust.norms.AndrewWave())
andrew_results = andrew_mod.fit(scale_est=sm.robust.scale.HuberScale(), cov="H3")
print("Parameters: ", andrew_results.params)

Parameters:  const       -40.881796
AIRFLOW       0.792761
WATERTEMP     1.048576
ACIDCONC     -0.133609
dtype: float64

See help(sm.RLM.fit) for more options and module sm.robust.scale for scale options

Comparing OLS and RLM¶

Artificial data with outliers:

[7]:

nsample = 50
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, (x1 - 5) ** 2))
X = sm.add_constant(X)
sig = 0.3  # smaller error variance makes OLS<->RLM contrast bigger
beta = [5, 0.5, -0.0]
y_true2 = np.dot(X, beta)
y2 = y_true2 + sig * 1.0 * np.random.normal(size=nsample)
y2[[39, 41, 43, 45, 48]] -= 5  # add some outliers (10% of nsample)

Example 1: quadratic function with linear truth¶

Note that the quadratic term in OLS regression will capture outlier effects.

[8]:

res = sm.OLS(y2, X).fit()
print(res.params)
print(res.bse)
print(res.predict())

[ 4.98695717  0.51539962 -0.0120323 ]
[0.45576134 0.0703634  0.00622607]
[ 4.6861498   4.94362385  5.19708881  5.44654467  5.69199144  5.93342912
  6.1708577   6.40427718  6.63368758  6.85908887  7.08048107  7.29786418
  7.5112382   7.72060311  7.92595894  8.12730567  8.3246433   8.51797184
  8.70729129  8.89260164  9.0739029   9.25119506  9.42447812  9.5937521
  9.75901698  9.92027276 10.07751945 10.23075704 10.37998554 10.52520495
 10.66641526 10.80361647 10.9368086  11.06599162 11.19116556 11.31233039
 11.42948614 11.54263279 11.65177034 11.7568988  11.85801816 11.95512843
 12.04822961 12.13732169 12.22240468 12.30347857 12.38054337 12.45359907
 12.52264568 12.58768319]

Estimate RLM:

[9]:

resrlm = sm.RLM(y2, X).fit()
print(resrlm.params)
print(resrlm.bse)

[ 4.91799019e+00  5.01952043e-01 -1.83708902e-03]
[0.12220137 0.01886624 0.00166937]

Draw a plot to compare OLS estimates to the robust estimates:

[10]:

fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111)
ax.plot(x1, y2, "o", label="data")
ax.plot(x1, y_true2, "b-", label="True")
pred_ols = res.get_prediction()
iv_l = pred_ols.summary_frame()["obs_ci_lower"]
iv_u = pred_ols.summary_frame()["obs_ci_upper"]

ax.plot(x1, res.fittedvalues, "r-", label="OLS")
ax.plot(x1, iv_u, "r--")
ax.plot(x1, iv_l, "r--")
ax.plot(x1, resrlm.fittedvalues, "g.-", label="RLM")
ax.legend(loc="best")

[10]:

<matplotlib.legend.Legend at 0x7f1aab03c640>

../../../_images/examples_notebooks_generated_robust_models_0_18_1.png

Example 2: linear function with linear truth¶

Fit a new OLS model using only the linear term and the constant:

[11]:

X2 = X[:, [0, 1]]
res2 = sm.OLS(y2, X2).fit()
print(res2.params)
print(res2.bse)

[5.47193233 0.39507667]
[0.3911391  0.03370213]

Estimate RLM:

[12]:

resrlm2 = sm.RLM(y2, X2).fit()
print(resrlm2.params)
print(resrlm2.bse)

[4.97564182 0.48626427]
[0.09935901 0.00856117]

Draw a plot to compare OLS estimates to the robust estimates:

[13]:

pred_ols = res2.get_prediction()
iv_l = pred_ols.summary_frame()["obs_ci_lower"]
iv_u = pred_ols.summary_frame()["obs_ci_upper"]

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(x1, y2, "o", label="data")
ax.plot(x1, y_true2, "b-", label="True")
ax.plot(x1, res2.fittedvalues, "r-", label="OLS")
ax.plot(x1, iv_u, "r--")
ax.plot(x1, iv_l, "r--")
ax.plot(x1, resrlm2.fittedvalues, "g.-", label="RLM")
legend = ax.legend(loc="best")

../../../_images/examples_notebooks_generated_robust_models_0_24_0.png

Last update: Nov 14, 2024