Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.986
Model:                            OLS   Adj. R-squared:                  0.985
Method:                 Least Squares   F-statistic:                     1064.
Date:                Wed, 02 Nov 2022   Prob (F-statistic):           1.74e-42
Time:                        17:06:17   Log-Likelihood:                 4.4594
No. Observations:                  50   AIC:                           -0.9188
Df Residuals:                      46   BIC:                             6.729
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9488      0.079     62.924      0.000       4.791       5.107
x1             0.5124      0.012     42.246      0.000       0.488       0.537
x2             0.6093      0.048     12.778      0.000       0.513       0.705
x3            -0.0216      0.001    -20.250      0.000      -0.024      -0.019
==============================================================================
Omnibus:                        6.272   Durbin-Watson:                   2.752
Prob(Omnibus):                  0.043   Jarque-Bera (JB):                5.171
Skew:                           0.723   Prob(JB):                       0.0754
Kurtosis:                       3.625   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.40969647  4.94512482  5.43363276  5.84201378  6.14904543  6.34897612
  6.45247004  6.48485512  6.48196167  6.48423561  6.53009328  6.64960978
  6.85957769  7.16074809  7.53770719  7.96140876  8.39394633  8.79478112
  9.12740246  9.36532639  9.49644688  9.52502482  9.470988    9.36665699
  9.25143498  9.16533394  9.14240078  9.20512354  9.36073664  9.60003232
  9.89887343 10.22215909 10.52959169 10.78229764 10.94921423 11.01219312
 10.96898204 10.83359489 10.63400924 10.40756972 10.19485242 10.03299831
  9.94961111  9.95822208 10.05606687 10.22453968 10.43225015 10.64018073
 10.80809711 10.90115863]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew)  # predict out of sample
print(ynewpred)
[10.87960032 10.69855331 10.38191219  9.98412967  9.57688468  9.23153285
  9.00163619  8.90984922  8.94237236  9.05233055]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7fc0436f7010>
../../../_images/examples_notebooks_generated_predict_12_1.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1": x1, "y": y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           4.948842
x1                  0.512418
np.sin(x1)          0.609303
I((x1 - 5) ** 2)   -0.021566
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    10.879600
1    10.698553
2    10.381912
3     9.984130
4     9.576885
5     9.231533
6     9.001636
7     8.909849
8     8.942372
9     9.052331
dtype: float64