IN5148: Statistics and Data Science with Applications in Engineering
Department of Industrial Engineering
Before we start, let’s import the data science libraries into Python.
Here, we use specific functions from the pandas, matplotlib, seaborn and sklearn libraries in Python.
Includes algorithms that learn by example. The user provides the supervised algorithm with a known data set that includes the corresponding known inputs and outputs. The algorithm must find a method to determine how to reach those inputs and outputs.
While the user knows the correct answers to the problem, the algorithm identifies patterns in the data, learns from observations, and makes predictions.
The algorithm makes predictions that can be corrected by the user, and this process continues until the algorithm reaches a high level of accuracy and performance.
Regression Problems. The response is numerical. For example, a person’s income, the value of a house, or a patient’s blood pressure.
Classification Problems. The response is categorical and involves K different categories. For example, the brand of a product purchased (A, B, C) or whether a person defaults on a debt (yes or no).
The predictors (\(\boldsymbol{X}\)) can be numerical or categorical.
Regression Problems. The response is numerical. For example, a person’s income, the value of a house, or a patient’s blood pressure.
Classification Problems. The response is categorical and involves K different categories. For example, the brand of a product purchased (A, B, C) or whether a person defaults on a debt (yes or no).
The predictors (\(\boldsymbol{X}\)) can be numerical or categorical.
Goal: Find the best function \(f(\boldsymbol{X})\) of the predictors \(\boldsymbol{X} = (X_1, \ldots, X_p)\) that describes the response \(Y\).
In mathematical terms, we want to establish the following relationship:
\[Y = f(\boldsymbol{X}) + \epsilon\]
Using training data. 
Using training data.
Using validation data.

Using validation data.

We can use test data for a final evaluation of the model.
Test data is data obtained from the process that generated the training data.
Test data is independent of the training data.

A common candidate function for predicting a response is the linear regression model. It has the mathematical form:
\[\hat{Y}_i = \hat{f}(\boldsymbol{X}_i) = \hat{\beta}_0 + \hat{\beta}_1 X_{i1} + \cdots + \hat{\beta}_p X_{ip}.\]
Where \(i = 1, \ldots, n_t\) is the index of the \(n_t\) training data.
\(\hat{Y}_i\) is the prediction of the actual value of the response \(Y_i\) associated with values of \(p\) predictors denoted by \(\boldsymbol{X}_i = (X_{i1}, \ldots, X_{ip})\).
The values \(\hat{\beta}_0\), \(\hat{\beta}_1\), …, \(\hat{\beta}_p\) are the coefficients of the model.
The values of \(\hat{\beta}_0\), \(\hat{\beta}_1\), …, \(\hat{\beta}_p\) are obtained from the training data using method of least squares.
This method finds the coefficient values that minimize the error made by the model \(\hat{f}(X_i)\) when trying to predict the responses in the training set:
\[RSS = \sum_{i=1}^{n_t} (Y_i - (\hat{\beta}_0 + \hat{\beta}_1 X_{i1} + \cdots + \hat{\beta}_p X_{ip} ))^2 \]
where \(RSS\) means residual sum of squares.
We used the dataset called “Advertising.xlsx” in Canvas.
Now, let’s choose our predictor and response. In the definition of X_full, the double bracket in [] is important because it allows us to have a pandas DataFrame as output. This makes it easier to fit the linear regression model with scikit-learn.
To evaluate a model’s performance on unobserved data, we split the current dataset into a training dataset and a validation dataset. To do this, we use the scikit-learn train_test_split() function.
We use 75% of the data for training and the rest for validation.
In Python, we use the LinearRegression() and fit() functions from the scikit-learn to fit a linear regression model.
The following commands allow you to show the estimated coefficients of the model.
We can also show the estimated intercept.
The estimated model thus is
\[\hat{Y}_i = 6.69 + 0.051 X_i.\]
After estimating and validating the linear regression model, we can check the quality of its predictions on unobserved data. That is, on the data in the validation set.
One metric for this is the mean prediction error (MSE\(_v\)):
\[\text{MSE}_v = \frac{\sum_{i=1}^{n_v} (Y_i - (\hat{\beta}_0 + \hat{\beta}_1 X_{i1} + \cdots + \hat{\beta}_p X_{ip}))^2}{n_v}\]
The smaller \(\text{MSE}_v\), the better the predictions.
In practice, the square root of the mean prediction error is used:
\[\text{RMSE}_v = \sqrt{\text{MSE}_v}.\]
The advantage of \(\text{RMSE}_v\) is that it can be interpreted as:
The average variability of a model prediction.
For example, if \(\text{RMSE}_v = 1\), then a prediction of \(\hat{Y} = 5\) will have an (average) error rate of \(\pm 1\).
To evaluate the model’s performance, we use the validation dataset. Specifically, we use the predictor matrix stored in X_valid.
In Python, we make the prediction using the pre-trained LRmodel.
To evaluate the model, we use the function mean_squared_error() from scikit-learn. Recall that the responses from the validation dataset are in Y_valid, and the model predictions are in Y_pred.
To obtain the root mean squared error (RMSE), we use root_mean_squared_error() instead.
In the context of Data Science, \(R^2\) can be interpreted as the squared correlation between the actual responses and those predicted by the model.
The higher the correlation, the better the agreement between the predicted and actual responses.
We compute \(R^2\) in Python as follows:
Consider the Advertising.xlsx dataset in Canvas.
Use a model to predict Sales that includes the Radio predictor (money spent on radio ads for a product ($)). What is the \(\text{RMSE}_v\)?
Now, use a model to predict Sales that includes two predictors: TV and Radio. What is the \(\text{RMSE}_v\)?
Which model do you prefer?
By using a training data set that is much smaller than our actual data, the estimated model \(\hat{f}(\boldsymbol{X})\) will be less good than if we used the full training data. That is, more likely for predictions to be far from actual values.
Thus, the validation MSE is likely to be bigger than had we (a) used the full data set and (b) fit the correct model.
In other words, using less than all data results in a \(\text{MSE}_v\) that is not a good representation of the predictive performance of \(\hat{f}(\boldsymbol{X})\).
Basic Idea: Divide the training data into \(K\) equally-sized divisions or folds.

Basic Idea: Divide the training data into \(K\) equally-sized divisions or folds (\(K = 5\) here).

Basic Idea: Divide the training data into \(K\) equally-sized divisions or folds (\(K = 5\) here).

Basic Idea: Divide the training data into \(K\) equally-sized divisions or folds (\(K = 5\) here).

Basic Idea: Divide the training data into \(K\) equally-sized divisions or folds (\(K = 5\) here).

We average these fold-based error estimates to yield an evaluation metric:
\[CV(\hat{f}) = \frac{1}{K} \sum^{K}_{k=1} CV_k (\hat{f}^{(-k)}).\]
Called the \(K\)-fold cross-validation estimate.
Here, we used \(K = 5\) but another popular choice is \(K = 10\).
In Python, we apply \(K\)-fold cross validation (CV) using the function cross_val_score() from scikit-learn. The argument cv sets the number of folds to use, and scoring sets the evaluation metric to compute on the folds.
Unfortunately, cross_val_score() outputs negative scores. We simply turn them into positive by multiplying them by -1 or adding a - symbol.
After that, we average the values using .mean() to obtain a the \(5\)-fold CV estimate.
Note that we can compute a \(K\)-fold CV estimate for any evaluation metric including the \(R^2\). To this end, we set scoring = "r2".
We can also compute a \(K\)-fold CV estimate for the root mean squared error (RMSE).

Tecnologico de Monterrey