Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.979
Model:                            OLS   Adj. R-squared:                  0.978
Method:                 Least Squares   F-statistic:                     717.2
Date:                Tue, 27 Nov 2018   Prob (F-statistic):           1.30e-38
Time:                        16:24:23   Log-Likelihood:                -4.1574
No. Observations:                  50   AIC:                             16.31
Df Residuals:                      46   BIC:                             23.96
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0452      0.093     53.994      0.000       4.857       5.233
x1             0.4853      0.014     33.674      0.000       0.456       0.514
x2             0.4527      0.057      7.991      0.000       0.339       0.567
x3            -0.0192      0.001    -15.187      0.000      -0.022      -0.017
==============================================================================
Omnibus:                        0.262   Durbin-Watson:                   2.611
Prob(Omnibus):                  0.877   Jarque-Bera (JB):                0.400
Skew:                          -0.150   Prob(JB):                        0.819
Kurtosis:                       2.680   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[  4.56482744   5.01781442   5.43487684   5.79134312   6.07144551
   6.2709107    6.39766195   6.47051724   6.51609761   6.56445334
   6.64412683   6.77746309   6.97693859   7.24311138   7.56452939
   7.91961201   8.28019568   8.61616111   8.90038181   9.11318084
   9.24556374   9.30069625   9.2933842    9.24764096   9.19274211
   9.15841554   9.16995725   9.24407527   9.3861446    9.58932411
   9.83568049  10.09913469  10.3497466   10.55863412  10.70271844
  10.76851579  10.75435281  10.67064146  10.53816838  10.38467968
  10.24032236  10.13269125  10.0822952   10.0991876   10.18131461
  10.31485258  10.47647928  10.63720565  10.7671386   10.8403924 ]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[ 10.82796178  10.69543862  10.46057598  10.16383085   9.85845887
   9.59747549   9.42067591   9.34489188   9.35987065   9.43078525]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           5.045222
x1                  0.485272
np.sin(x1)          0.452697
I((x1 - 5) ** 2)   -0.019216
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.827962
1    10.695439
2    10.460576
3    10.163831
4     9.858459
5     9.597475
6     9.420676
7     9.344892
8     9.359871
9     9.430785
dtype: float64