Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.984
Model:                            OLS   Adj. R-squared:                  0.983
Method:                 Least Squares   F-statistic:                     944.8
Date:                Wed, 19 Feb 2025   Prob (F-statistic):           2.59e-41
Time:                        12:34:51   Log-Likelihood:                0.43447
No. Observations:                  50   AIC:                             7.131
Df Residuals:                      46   BIC:                             14.78
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.8884      0.085     57.348      0.000       4.717       5.060
x1             0.5043      0.013     38.357      0.000       0.478       0.531
x2             0.5682      0.052     10.995      0.000       0.464       0.672
x3            -0.0197      0.001    -17.050      0.000      -0.022      -0.017
==============================================================================
Omnibus:                        1.003   Durbin-Watson:                   1.887
Prob(Omnibus):                  0.606   Jarque-Bera (JB):                0.391
Skew:                           0.158   Prob(JB):                        0.822
Kurtosis:                       3.297   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.39636399  4.90477057  5.36956434  5.75977805  6.05562037  6.25172755
  6.35804466  6.39819165  6.40558268  6.41793624  6.47107812  6.59305532
  6.79952804  7.09119671  7.45368697  7.85991126  8.27451931  8.65970601
  8.98142275  9.21497149  9.34906254  9.38766948  9.34937628  9.26432401
  9.16925875  9.10149434  9.09278175  9.16409256  9.32217336  9.55843731
  9.8503749  10.16525225 10.46548916 10.71483346 10.88431709 10.95701553
 10.93082839 10.81882456 10.64709498 10.45046599 10.26677747 10.13066583
 10.06787338 10.0910189  10.19752447 10.37003893 10.57928872 10.78888736
 10.9613137  11.06407653]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew)  # predict out of sample
print(ynewpred)
[11.06373926 10.9148837  10.63979315 10.28924835  9.93009466  9.62887592
  9.43554225  9.37122049  9.42304162  9.54729155]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7f1688057280>
../../../_images/examples_notebooks_generated_predict_12_1.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1": x1, "y": y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           4.888378
x1                  0.504254
np.sin(x1)          0.568216
I((x1 - 5) ** 2)   -0.019681
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    11.063739
1    10.914884
2    10.639793
3    10.289248
4     9.930095
5     9.628876
6     9.435542
7     9.371220
8     9.423042
9     9.547292
dtype: float64

Last update: Feb 19, 2025