Let’s import scikit-learn into Python together with the other relevant libraries.
import pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.model_selection import train_test_splitfrom sklearn.impute import SimpleImputer, KNNImputerfrom sklearn.preprocessing import PowerTransformer, StandardScaler from sklearn.feature_selection import VarianceThreshold
We will not use all the functions from the scikit-learn library. Instead, we will use specific functions from the sub-libraries preprocessing, feature_selection, model_selection and impute.
Transforming predictors
Categorical predictors
A categorical predictor takes on values that are categories or groups.
For example:
Type of school: Public or private.
Treatment: New or placebo.
Grade: Passed or not passed.
The categories can be represented by names, labels or even numbers. Their use in regression requires dummy variables, which are numeric.
Dummy variables
The traditional choice for a dummy variable is a binary variable, which can only take the values 0 and 1.
Initially, a categorical variable with\(k\)categories requires\(k\)dummy variables.
Example 1
A market analyst is studying quality characteristics of cars. Specifically, the analyst is investigating the miles per gallon (mpg) of cars can be predicted using:
\(X_1:\) cylinders. Number of cylinders between 4 and 8
\(X_5:\) acceleration. Time to accelerate from 0 to 60 mph (sec.)
\(X_6:\) origin. Origin of car (American, European, Japanese)
The dataset is in the file “auto.xlsx”. Let’s read the data using pandas.
# Load the Excel file into a pandas DataFrame.auto_data = pd.read_excel("auto.xlsx")# Set categorical variables.auto_data['origin'] = pd.Categorical(auto_data['origin'])
Training and validation datasets
For illustrative purposes, we assume that we want to predict the miles per gallon of a car using the six predictors, \(X_1, \ldots, X_6\).
We create the predictor matrix and response column.
# Set full matrix of predictors.X_full_c = (auto_data .filter(['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'origin']))# Set full matrix of responses.Y_full_c = auto_data.filter(['mpg'])
We will use a validation dataset with 30% of the observations in auto_data. The other 70% will be in the training dataset.
# Split the dataset into training and validation.X_train_c, X_valid_c, Y_train_c, Y_valid_c = train_test_split(X_full_c, Y_full_c, test_size =0.3, random_state =59227)
The dataset has missing values. In this example, we remove each row with at least one missing value.
# Remove rows with missing values.complete_Auto = training_auto.dropna()
Example 1 (cont.)
Categorical predictor: Origin of a car. Three categories: American, European and Japanese.
Initially, 3 dummy variables are required:
\[d_1 =
\begin{cases}
1 \text{ if observation is from an American car}\\
0 \text{ otherwise}
\end{cases}\]\[d_2 =
\begin{cases}
1 \text{ if observation is from an European car}\\
0 \text{ otherwise}
\end{cases}\]\[d_3 =
\begin{cases}
1 \text{ if observation is from a Japanese car}\\
0 \text{ otherwise}
\end{cases}\]
The variable Origin would then be replaced by the three dummy variables
Origin (\(X\))
\(d_1\)
\(d_2\)
\(d_3\)
American
1
0
0
American
1
0
0
European
0
1
0
European
0
1
0
American
1
0
0
Japanese
0
0
1
\(\vdots\)
\(\vdots\)
\(\vdots\)
\(\vdots\)
A drawback
A drawback with the initial dummy variables is that they are linearly dependent. That is, \(d_1 + d_2 + d_3 = 1\).
Therefore, we can determine the value of \(d_1 = 1- d_2 - d_3.\)
Predictive models such as linear regression are sensitive to linear dependencies among predictors.
The solution is to drop one of the predictor, say, \(d_1\), from the data.
The variable Origin would then be replaced by the three dummy variables.
Origin (\(X\))
\(d_2\)
\(d_3\)
American
0
0
American
0
0
European
1
0
European
1
0
American
0
0
Japanese
0
1
\(\vdots\)
\(\vdots\)
\(\vdots\)
Dummy variables in Python
We can get the dummy variables of a categorical variable using the function pd.get_dummies() from pandas.
The input of the function is the categorical variable.
The function has an extra argument called drop_first to drop the first dummy variable. It also has the argument dtype to show the values as integers.
A common problem with a predictor is that it may have a skewed distribution. That is, a distribution that accumulates many observations in smaller or larger values of the predictor.
For example, consider the distribution of the predictor DIS (weighted mean of distances to five Boston employment centers) in the “BostonHousing.xlsx” dataset.
Example 2
We revisit the “BostonHousing.xlsx” dataset that contains data of houses in Boston, Massachusetts.
The goal is to predict the median house price in new tracts based on information such as crime rate, pollution, and number of rooms.
The response is the median value of owner-occupied homes in $1000s, contained in the column MEDV.
The predictors
CRIM: per capita crime rate by town.
ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS: proportion of non-retail business acres per town.
CHAS: Charles River (‘Yes’ if tract bounds river; ‘No’ otherwise).
NOX: nitrogen oxides concentration (parts per 10 million).
RM: average number of rooms per dwelling.
AGE: proportion of owner-occupied units built prior to 1940.
DIS: weighted mean of distances to five Boston employment centers
RAD: index of accessibility to radial highways (‘Low’, ‘Medium’, ‘High’).
TAX: full-value property-tax rate per $10,000.
PTRATIO: pupil-teacher ratio by town.
LSTAT: lower status of the population (percent).
Read the dataset
We read the dataset.
Boston_data = pd.read_excel('BostonHousing.xlsx')
The variables CHAS and RAD are categorical variables. Therefore, we turn tthem into categorical using .Categorical() function from pandas.
Before we continue, we generate a training dataset with 80% of the observations. To this end, let’s create the predictor matrix and response column.
# Set full matrix of predictors.X_full_b = Boston_data.drop(columns = ['MEDV']) # Set full matrix of responses.Y_full_b = Boston_data.filter(['MEDV'])# Split the dataset into training and validation.X_train_b, X_valid_b, Y_train_b, Y_valid_b = train_test_split(X_full_b, Y_full_b, test_size =0.2, random_state =59227)
A right-skewed distribution is characterized by the right tail (or the side with higher values) being longer and more extended than the left tail.
Code
plt.figure(figsize=(8,3.5)) # Create space for figure.sns.histplot(data = Boston_training, x ='DIS') # Create the histogram.plt.title("Histogram of DIS") # Plot title.plt.xlabel("DIS") # X labelplt.show() # Display the plot
Problems with strong skewness
Strong (right or left) skewness in a predictor can:
Distort the relationship with the response.
Violate the assumptions of the linear regression model (e.g., linearity, homoscedasticity).
Lead to influential outliers or poor model fit.
To correct the skewness of a predictor’s distribution, we can transform the predictor using the Box-Cox transformation
Box-Cox Transformation
The Box-Cox transformation uses a family of power transformations on a predictor \(X\) such that \(X' = X^{\lambda}\), where \(\lambda\) is a parameter to be determined using the data. When \(\lambda = 0,\) this means \(X' = \ln(X)\), where \(\ln(\cdot)\) is the natural logarithm.
The Box-Cox transformation is the element of this family that results in a transformed variable that follows a normal distribution (approximately).
The variable \(X\) must be strictly positive!
In Python
We apply the Box-Cox transformation using the PowerTransformer and fit_transform functions of the scikit-learn library.
In the PowerTransformer, the input standardize specifies if we wish to standardize the resulting transformed predictor to have a mean of zero and a standard deviation of one. We do not standardize the predictor yet.
The result from the function is a numpy array. However, since we are working with pandas dataframes, we must transform the output to this format using the following code.
plt.figure(figsize=(7,4)) # Create space for figure.sns.histplot(data = Boston_training, x ='DIS') # Create the histogram.plt.title("Histogram of DIS") # Plot title.plt.xlabel("DIS") # X labelplt.show() # Display the plot
Box-Cox transformation
Code
plt.figure(figsize=(7,4)) # Create space for figure.sns.histplot(data = DIS_transform_df, x ='DIS_boxcox') # Create the histogram.plt.title("Histogram of transformed DIS") # Plot title.plt.xlabel("DIS (Box-Cox transformation") # X labelplt.show() # Display the plot
Value of \(\lambda\)
Recall that the Box-Cox transformation finds the best value of \(\lambda\) that makes the transformed predictor \(X^\lambda\) to follow a normal distribution.
We can obtain the value of \(\lambda\) actually used in the transformation using the following code.
# Retrieve the lambda valueprint(pt.lambdas_)
[-0.12343575]
This means that the transformed predictor is \(X' = X^{-0.123}\).
Discussion
The Box-Cox transformation can:
Make the distribution of a predictor more symmetric.
Help to linearize the relationship with the response.
After transformation, the predictor \(X'\) must be used instead of the original predictor \(X\).
Reducing the number of predictors
Removing predictors
There are potential advantages to removing predictors prior to modeling:
Fewer predictors means decreased computational time and complexity.
If two predictors are highly-correlated, they are measuring the same underlying information. So, removing one should not compromise the performance of the model.
Here, we will see two techniques to remove predictors.
Near-Zero variance predictors
A near-zero variance predictor variable is one that has only a handful of unique values that occur with very low frequencies.
If the predictor has a single unique value, then it is called a zero-variance predictor variable.
Since the values of this predictor variable do not vary or change at all, this predictor does not provide any information to the model and must be discarded.
Example 3
We consider a data set related to Glass identification. The data has 214 glass samples labeled as one of seven glass categories. There are nine predictors including the refractive index (RI) and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
Let’s read the data set.
# Load Excel file (make sure the file is in your Colab)glass_data = pd.read_excel('glass.xlsx')
Training and validation datasets
Let’s create a training and validation dataset. To this end, we will consider the variable Type as the response and the other numeric variables as the predictors. We will use 80% of the observations for the training dataset. The other 20% of the observations will go to the validation dataset.
# Set full matrix of predictors.X_full_g = glass_data.drop(columns = ['Type']) # Set full matrix of responses.Y_full_g = glass_data.filter(['Type'])# Split the dataset into training and validation.X_train_g, X_valid_g, Y_train_g, Y_valid_g = train_test_split(X_full_g, Y_full_g, test_size =0.2, random_state =59227)
In Python
We use the function VarianceThreshold() to set the threshold for determining a low sample variance. We also use other functions such as .fit() and .get_support().
# Set thresholdselector = VarianceThreshold(threshold=0.01)# Apply threshold.selector.fit(X_train_g)# Identify predictors with low variance.low_variance_cols = X_train_g.columns[~selector.get_support()]# Print the list of predictors.print(f"Low variance columns: {low_variance_cols}")
After identifying predictors with low variability, we remove them from the problem because they do not add much to the problem. To this end, we use the command below.
Next, we plot the correlation matrix using the function heatmap() from seaborn. The argument annot shows the actual value of the pair-wise correlations, and cmap shows a nice color theme.
The predictors cylinders and displacement are highly correlated. In fact, their correlation is 0.95.
In practice
We deal with collinearity by removing the minimum number of predictors to ensure that all pairwise correlations are below a certain threshold, say, 0.75.
We can identify the variables that are highly correlated using quite complex code. However, here we will do it manually using the correlation map.
# Dataset without highly correlated predictors.reduced_auto = complete_sbAuto[ ['weight', 'acceleration']]
Standarization
Predictors with different units
Many good predictive models have issues with numeric predictors with different units:
Methods such as K-nearest neighbors are based on the distance between observations. If the predictors are on different units or scales, then some predictors will have a larger weight for computing the distance.
Other methods such as LASSO use the variances of the predictors in their calculations. Predictors with different scales will have different variances and so, those with a higher variance will play a bigger role in the calculations.
In a nutshell, some predictors will have a higher impact in the model due to its unit and not its information provided to it.
Standarization
Standardization refers to centering and scaling each numeric predictor individually. It puts every predictor on the same scale.
To center a predictor variable, the average predictor value is subtracted from all the values.
Therefore, the centered predictor has a zero mean (that is, its average value is zero).
To scale a predictor, each of its value is divided by its standard deviation.
Scaling the data coerce the values to have a common standard deviation of one.
In mathematical terms, we standardize a predictor as:
We use on the five numerical predictors in the complete_sbAuto dataset.
complete_sbAuto.head()
cylinders
displacement
horsepower
weight
acceleration
46
6
250.0
100
3282
15.0
72
8
307.0
130
4098
14.0
389
4
135.0
84
2295
11.6
145
4
90.0
75
2108
15.5
11
8
340.0
160
3609
8.0
Two predictors in original units
Consider the complete_sbAuto dataset created previously. Consider two points in the plot: \((175, 5140)\) and \((69, 1613)\).
The distance between these points is \(\sqrt{(69 - 175)^2 + (1613-5140)^2}\)\(= \sqrt{11236 + 12439729}\)\(= 3528.592\).
Standarization in Python
To standardize numerical predictors, we use the function StandardScaler(). Moreover, we apply the function to the variables using the function fit_transform().