Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.984
Model: OLS Adj. R-squared: 0.983
Method: Least Squares F-statistic: 940.1
Date: Mon, 23 Dec 2024 Prob (F-statistic): 2.90e-41
Time: 12:03:47 Log-Likelihood: 2.5334
No. Observations: 50 AIC: 2.933
Df Residuals: 46 BIC: 10.58
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.1287 0.082 62.747 0.000 4.964 5.293
x1 0.4760 0.013 37.759 0.000 0.451 0.501
x2 0.4514 0.050 9.109 0.000 0.352 0.551
x3 -0.0180 0.001 -16.305 0.000 -0.020 -0.016
==============================================================================
Omnibus: 0.495 Durbin-Watson: 1.511
Prob(Omnibus): 0.781 Jarque-Bera (JB): 0.594
Skew: 0.213 Prob(JB): 0.743
Kurtosis: 2.677 Cond. No. 221.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.67759786 5.12169963 5.5303511 5.87895125 6.15177743 6.34456858
6.46522525 6.53251164 6.57297262 6.61657253 6.6917721 6.82085247
7.01625439 7.27853432 7.59627295 7.94795139 8.30548665 8.6388456
8.92097931 9.13226715 9.26374037 9.31855567 9.31147652 9.26644761
9.21266077 9.17975888 9.19296556 9.26894106 9.41304501 9.61845584
9.86729154 10.13354766 10.38736972 10.59995823 10.74830041 10.81895104
10.81024156 10.73255413 10.60661595 10.46009377 10.32304835 10.22299558
10.18038567 10.20524324 10.29552032 10.43743281 10.60772519 10.77749131
10.91692359 11.00021035]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.00086861 10.88233532 10.66231276 10.38114221 10.09192696 9.84753081
9.68763516 9.6290236 9.66147272 9.75025514]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7f86724d90f0>
Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1": x1, "y": y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I
to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 5.128744
x1 0.475984
np.sin(x1) 0.451402
I((x1 - 5) ** 2) -0.018046
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 11.000869
1 10.882335
2 10.662313
3 10.381142
4 10.091927
5 9.847531
6 9.687635
7 9.629024
8 9.661473
9 9.750255
dtype: float64
Last update:
Dec 23, 2024