Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.988
Model:                            OLS   Adj. R-squared:                  0.987
Method:                 Least Squares   F-statistic:                     1288.
Date:                Tue, 27 Nov 2018   Prob (F-statistic):           2.29e-44
Time:                        16:38:36   Log-Likelihood:                 8.4025
No. Observations:                  50   AIC:                            -8.805
Df Residuals:                      46   BIC:                            -1.157
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9121      0.073     67.581      0.000       4.766       5.058
x1             0.4991      0.011     44.521      0.000       0.477       0.522
x2             0.5597      0.044     12.702      0.000       0.471       0.648
x3            -0.0192      0.001    -19.541      0.000      -0.021      -0.017
==============================================================================
Omnibus:                        0.738   Durbin-Watson:                   2.139
Prob(Omnibus):                  0.691   Jarque-Bera (JB):                0.559
Skew:                           0.255   Prob(JB):                        0.756
Kurtosis:                       2.915   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[ 4.43125888  4.93242782  5.39068647  5.77553005  6.06746278  6.26120101
  6.3665413   6.40675087  6.41474486  6.42767847  6.48084246  6.60186499
  6.80617239  7.09445464  7.45255201  7.85378151  8.26332075  8.64392924
  8.96206683  9.19340431  9.32682062  9.36622998  9.32993883  9.24763811
  9.15552515  9.09035651  9.08340903  9.15534108  9.3127985   9.54732268
  9.83674028 10.14880626 10.44650164 10.69411565 10.86311283 10.93682141
 10.91317238 10.80503944 10.6381239  10.446732   10.26813852 10.13646262
 10.07706215 10.10236748 10.20983914 10.38238512 10.59116903 10.80034778
 10.97296058 11.07700142]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.08018447 10.93714407 10.66983069 10.32826674  9.97829929  9.68547844
  9.49900834  9.43970031  9.49487754  9.62147875]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.912084
x1                  0.499067
np.sin(x1)          0.559730
I((x1 - 5) ** 2)   -0.019233
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    11.080184
1    10.937144
2    10.669831
3    10.328267
4     9.978299
5     9.685478
6     9.499008
7     9.439700
8     9.494878
9     9.621479
dtype: float64