Introduction to Predictive Models

IN2004B: Generation of Value with Data Analytics

Alan R. Vazquez

Department of Industrial Engineering

Load the libraries


Before we start, let’s import the data science libraries into Python.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Here, we use specific functions from the pandas, matplotlib, seaborn and sklearn libraries in Python.

Main data science problems


Regression Problems. The response is numerical. For example, a person’s income, the value of a house, or a patient’s blood pressure.

Classification Problems. The response is categorical and involves K different categories. For example, the brand of a product purchased (A, B, C) or whether a person defaults on a debt (yes or no).

The predictors (\(\boldsymbol{X}\)) can be numerical or categorical.

Main data science problems


Regression Problems. The response is numerical. For example, a person’s income, the value of a house, or a patient’s blood pressure.

Classification Problems. The response is categorical and involves K different categories. For example, the brand of a product purchased (A, B, C) or whether a person defaults on a debt (yes or no).

The predictors (\(\boldsymbol{X}\)) can be numerical or categorical.

Regression problem


Goal: Find the best function \(f(\boldsymbol{X})\) of the predictors \(\boldsymbol{X} = (X_1, \ldots, X_p)\) that predicts the response \(Y\).

In mathematical terms, we want to establish the following relationship:

\[Y = f(\boldsymbol{X}) + \epsilon\]

  • Where \(\epsilon\) is a natural (random) error.

How to find the shape of \(f(\boldsymbol{X})\)?


Using training data.

How to find the shape of \(f(\boldsymbol{X})\)?


Using training data.

How to evaluate the quality of the candidate function \(\hat{f}(\boldsymbol{X})\)?

Using validation data.

How to evaluate the quality of the candidate function \(\hat{f}(\boldsymbol{X})\)?

Using validation data.

Moreover…


We can use test data for a final evaluation of the model.

Test data is data obtained from the process that generated the training data.

Test data is independent of the training data.


Linear regression model

A common candidate function for predicting a response is the linear regression model. It has the mathematical form:

\[\hat{Y}_i = \hat{f}(X_i) = \hat{\beta}_0 + \hat{\beta}_1 X_i.\]

  • Where \(i = 1, \ldots, n_t\) is the index of the \(n_t\) training data.

  • \(\hat{Y}_i\) is the prediction of the actual value of the response \(Y_i\) associated with a predictor value equal to \(X_i\).

  • \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are called the estimated coefficients of the model.

The values of \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are obtained using the training dataset and the least squares method.

This method finds the values of \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that minimize the error made by the model \(\hat{f}(X_i)\) when trying to predict the responses (\(Y_i\)) of the training dataset.


Technically, the least squares method finds the \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that minimize the following expression

\[(Y_1 - (\hat{\beta}_0 + \hat{\beta}_1 X_1 ))^2 + (Y_2 - (\hat{\beta}_0 + \hat{\beta}_1 X_2 ))^2 + \cdots + (Y_{n_t} - (\hat{\beta}_0 + \hat{\beta}_1 X_{n_t} ))^2 \]

For the \(n_t\) observations in the training data!

The idea in two dimensions

Example


We used the dataset called “Advertising.xlsx” in Canvas.

  • TV: Money spent on TV ads for a product ($).
  • Sales: Sales generated from the product ($).
  • 200 markets
# Load the data into Python
Ads_data = pd.read_excel('Advertising.xlsx')


Ads_data.head()
TV Radio Newspaper Sales
0 230.1 37.8 69.2 22.1
1 44.5 39.3 45.1 10.4
2 17.2 45.9 69.3 9.3
3 151.5 41.3 58.5 18.5
4 180.8 10.8 58.4 12.9



Now, let’s choose our predictor and response.

# Chose the predictor.
X_full = Ads_data.filter(['TV'])

# Set the response.
Y_full = Ads_data.filter(['Sales'])

Create training and validation data


To evaluate a model’s performance on unobserved data, we split the current dataset into a training dataset and a validation dataset. To do this, we use train_test_split().

X_train, X_valid, Y_train, Y_valid = train_test_split(X_full, Y_full, 
                                                      test_size = 0.25,
                                                      random_state = 301655)

We use 75% of the data for training and the rest for validation.

Fit a linear regression model in Python


In Python, we use the LinearRegression() and .fit() functions from the scikit-learn to fit a linear regression model.

# 1. Create the linear regression model
LRmodel = LinearRegression()

# 2. Fit the model.
LRmodel.fit(X_train, Y_train)

The following commands show the estimated coefficients of the model.

print("Coefficients:", LRmodel.coef_)
Coefficients: [[0.05185463]]

We can also show the estimated intercept.

print("Intercept:", LRmodel.intercept_)
Intercept: [6.69303889]


The estimated model thus is

\[\hat{Y}_i = 6.69 + 0.051 X_i.\]

Model assumptions


To use the regression model, the model errors \(e_i = Y_i - \hat{Y}_i\) obtained on the training data must meet three conditions:

  1. On average, they must be equal to 0.
  2. They must have the same dispersion or variability.
  3. They must be independent of each other.

These assumptions are evaluated using a graphical analysis of residuals (model errors).

In Python

For a residual analysis, we need to compute the residuals and predictions using the training dataset. Additionally, we place these objects together with the ID of the observations in a pandas dataframe for further processing.

# Fitted values.
fitted = LRmodel.predict(X_train) + 0*Y_train

# Residuals
residuals = Y_train - fitted

# Construct a pandas data.frame
residual_data = pd.DataFrame()
residual_data["Fitted"] = fitted
residual_data["Residuals"] = residuals
residual_data["ID"] = residuals.index

A technical Python note

Note that we have this weird way of saving the predictions of the model on the training dataset, since we added 0*Y_train.


# Fitted values.
fitted = LRmodel.predict(X_train) + 0*Y_train


The addition of 0*Y_train does not impact the predicted responses. Instead, it allows the fitted object to have the same indices as the residuals object in the previous section.

In this way, these two objects can be put together in a pandas dataframe and being visualized using seaborn.

To validate assumptions 1 and 2, we use a “residuals versus fitted values” plot

Code
# Residual vs Fitted Values Plot
plt.figure(figsize=(8, 5))
sns.scatterplot(data = residual_data, x = "Fitted", y = "Residuals")
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residuals vs Fitted Values')
plt.xlabel('Fitted (predicted) Values', fontsize = 14)
plt.ylabel('Residuals', fontsize = 14)
plt.show()

To validate assumption 3, we use the “residuals versus time” plot. Here, we use the ID as a time variable.

Code
# Residual vs ID Plot
plt.figure(figsize=(8, 5))
sns.scatterplot(data = residual_data, x = "ID", y = "Residuals")
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residuals vs Time')
plt.xlabel('ID', fontsize = 14)
plt.ylabel('Residuals', fontsize = 14)
plt.show()

Prediction error

After estimating and validating the linear regression model, we can check the quality of its predictions on unobserved data. That is, on the data in the validation set.


One metric for this is the mean squared error (MSE\(_v\)):

\[\text{MSE}_v = \frac{(Y_1 - (\hat{\beta}_0 + \hat{\beta}_1 X_1 ))^2 + (Y_2 - (\hat{\beta}_0 + \hat{\beta}_1 X_2 ))^2 + \cdots + (Y_{n_v} - (\hat{\beta}_0 + \hat{\beta}_1 X_{n_v} ))^2}{n_v}\]

  • For the \(n_v\) observations in the the validation data!

The smaller \(\text{MSE}_v\), the better the predictions.


In practice, we use the square root of the mean squared error:

\[\text{RMSE}_v = \sqrt{\text{MSE}_v}.\]

The advantage of \(\text{RMSE}_v\) is that it is in the same units as the response \(Y\), which facilitates its interpretation.

For example, if \(\text{RMSE}_v = 1\), then a prediction of \(\hat{Y} = 5\) will have an (average) error rate of \(\pm 1\).

In Python


To evaluate the model’s performance, we use the validation dataset.


First, we predict the responses using the predictor matrix in X_valid and our pre-trained model in LRmodel. To this end, we use the .predict() function.

Y_pred = LRmodel.predict(X_valid)


Next, we use the mean squared error in the mse() function. Recall that the responses from the validation dataset are in Y_valid, and the model predictions are in Y_pred.

mse = mean_squared_error(Y_valid, Y_pred)  # Mean Squared Error (MSE)
print(round(mse, 2))
15.49


To obtain the root mean squared error (RMSE), we simply take the square root of the MSE.

print(round(mse**(1/2), 2))
3.94

Mini-Activity (solo mode)



  1. Consider the Advertising.xlsx dataset in Canvas.

  2. Use a model to predict Sales that includes the Radio predictor (money spent on radio ads for a product ($)). What is the \(\text{RMSE}_v\)?

  3. Now, use a model to predict Sales that includes two predictors: TV and Radio. What is the \(\text{RMSE}_v\)?

  4. Which model do you prefer?

Other candidate functions

The linear regression model is one of the most common models for predicting a response. It is simple and easy to calculate and interpret. However, it can be limited for complex problems.

For this purpose, there are other, more advanced candidate functions \(\hat{f}(\boldsymbol{X})\), such as:

  • Decision or regression trees.
  • Ensamble methods (bagging and random forest).
  • K nearest neighbors.

Return to main page