Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.981
Model:                            OLS   Adj. R-squared:                  0.979
Method:                 Least Squares   F-statistic:                     776.8
Date:                Mon, 20 Jan 2025   Prob (F-statistic):           2.15e-39
Time:                        16:20:17   Log-Likelihood:                -2.1655
No. Observations:                  50   AIC:                             12.33
Df Residuals:                      46   BIC:                             19.98
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.1020      0.090     56.821      0.000       4.921       5.283
x1             0.4891      0.014     35.318      0.000       0.461       0.517
x2             0.4569      0.054      8.393      0.000       0.347       0.566
x3            -0.0197      0.001    -16.181      0.000      -0.022      -0.017
==============================================================================
Omnibus:                        2.816   Durbin-Watson:                   2.018
Prob(Omnibus):                  0.245   Jarque-Bera (JB):                1.860
Skew:                           0.304   Prob(JB):                        0.395
Kurtosis:                       3.724   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.61014615  5.06814923  5.48980127  5.85020169  6.13343636  6.33519232
  6.46346636  6.53725115  6.58341485  6.63228668  6.71267396  6.84712916
  7.04824454  7.31658334  7.64058718  7.99847512  8.36182217  8.70022948
  8.98631872  9.20023033  9.33288634  9.38748187  9.3789602   9.33155761
  9.27482142  9.23875548  9.24889059  9.32208989  9.46377826  9.66705107
  9.91380856 10.17772958 10.42859609 10.63725791 10.78042199 10.8444794
 10.82774125 10.74071613 10.60438365 10.44674748 10.29823446 10.18669556
 10.13283001 10.14678453 10.2264862  10.35798307 10.5177364  10.67648804
 10.80406743 10.87434854]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew)  # predict out of sample
print(ynewpred)
[10.85720127 10.71870713 10.47678398 10.17226437  9.85889827  9.59019325
  9.40631393  9.32424828  9.33364839  9.39936401]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7f34ebb05990>
../../../_images/examples_notebooks_generated_predict_12_1.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1": x1, "y": y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           5.101978
x1                  0.489087
np.sin(x1)          0.456899
I((x1 - 5) ** 2)   -0.019673
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    10.857201
1    10.718707
2    10.476784
3    10.172264
4     9.858898
5     9.590193
6     9.406314
7     9.324248
8     9.333648
9     9.399364
dtype: float64

Last update: Jan 20, 2025