Data pre-processing techniques generally refer to the addition, deletion, or transformation of data.
It can make or break a model’s predictive ability.
For example, linear regression models (to be discussed later) are relatively insensitive to the characteristics of the predictor data, but advanced methods like K-nearest neighbors, principal component regression, and LASSO are not.
We will review some common strategies for processing predictors from the data, without considering how they might be related to the response.
In particular, we will review:
Dealing with missing values.
Transforming predictors.
Reducing the number of predictors.
Standardizing the units of the predictors.
scikit-learn library
scikit-learn is a robust and popular library for machine learning in Python
It provides simple, efficient tools for data mining and data analysis
It is built on top of libraries such as NumPy, SciPy, and Matplotlib
Let’s import scikit-learn into Python together with the other relevant libraries.
import pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.model_selection import train_test_splitfrom sklearn.impute import SimpleImputer, KNNImputerfrom sklearn.preprocessing import PowerTransformer, StandardScaler from sklearn.feature_selection import VarianceThreshold
We will not use all the functions from the scikit-learn library. Instead, we will use specific functions from the sub-libraries preprocessing, feature_selection, model_selection and impute.
Training, validation, and test datasets
Recall that …
In data science, we assume that
\[Y = f(\boldsymbol{X}) + \epsilon\]
where \(f(\boldsymbol{X})\) represents the true relationship between \(\boldsymbol{X} = (X_1, X_2, \ldots, X_p)\) and \(Y\).
\(f(\boldsymbol{X})\) is unknown and very complex!
Two datasets
The application of data science models needs two data sets:
Training data is data that we use to train or construct the estimated function \(\hat{f}(\boldsymbol{X})\).
Test data is data that we use to evaluate the predictive performance of \(\hat{f}(\boldsymbol{X})\) only.
A random sample of \(n\) observations.
Use it to construct\(\hat{f}(\boldsymbol{X})\).
Another random sample of \(n_t\) observations, which is independent of the training data.
Use it to evaluate\(\hat{f}(\boldsymbol{X})\).
Validation Dataset
In many practical situations, a test dataset is not available. To overcome this issue, we use a validation dataset.
Idea: Apply model to your validation dataset to mimic what will happen when you apply it to test dataset.
Example 1
The “BostonHousing.xlsx” contains data collected by the US Bureau of the Census concerning housing in the area of Boston, Massachusetts. The dataset includes data on 506 census housing tracts in the Boston area in 1970s.
The goal is to predict the median house price in new tracts based on information such as crime rate, pollution, and number of rooms.
The response is the median value of owner-occupied homes in $1000s, contained in the column MEDV.
The predictors
CRIM: per capita crime rate by town.
ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS: proportion of non-retail business acres per town.
CHAS: Charles River (‘Yes’ if tract bounds river; ‘No’ otherwise).
NOX: nitrogen oxides concentration (parts per 10 million).
RM: average number of rooms per dwelling.
AGE: proportion of owner-occupied units built prior to 1940.
DIS: weighted mean of distances to five Boston employment centers
RAD: index of accessibility to radial highways (‘Low’, ‘Medium’, ‘High’).
TAX: full-value property-tax rate per $10,000.
PTRATIO: pupil-teacher ratio by town.
LSTAT: lower status of the population (percent).
Read the dataset
We read the dataset and set the variable CHAS and RAD as categorical.
We split the current dataset into a training and a validation dataset. To this end, we use the function train_test_split() from scikit-learn.
The function has three main inputs:
A pandas dataframe with the predictor columns only.
A pandas dataframe with the response column only.
The parameter test_size which sets the portion of the dataset that will go to the validation set.
Create the predictor matrix
We use the function .drop() from pandas. This function drops one or more columns from a data frame. Let’s drop the response column MEDV and store the result in X_full.
# Set full matrix of predictors.X_full = Boston_data.drop(columns = ['MEDV']) X_full.head(4)
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
LSTAT
0
0.00632
18.0
2.31
No
0.538
6.575
65.2
4.0900
Low
296
15.3
4.98
1
0.02731
0.0
7.07
No
0.469
6.421
78.9
4.9671
Low
242
17.8
9.14
2
0.02729
0.0
7.07
No
0.469
7.185
61.1
4.9671
Low
242
17.8
4.03
3
0.03237
0.0
2.18
No
0.458
6.998
45.8
6.0622
Low
222
18.7
2.94
Create the response column
We use the function .filter() from pandas to extract the column MEDV from the data frame. We store the result in Y_full.
# Set full matrix of responses.Y_full = Boston_data.filter(['MEDV'])Y_full.head(4)
MEDV
0
24.0
1
21.6
2
34.7
3
33.4
Let’s partition the dataset
# Split the dataset into training and validation.X_train, X_valid, Y_train, Y_valid = train_test_split(X_full, Y_full, test_size =0.3)
The function makes a clever partition of the data using the empirical distribution of the response.
Technically, it splits the data so that the distribution of the response under the training and validation sets is similar.
Usually, the proportion of the dataset that goes to the validation set is 20% or 30%.
The predictors and response in the training dataset are in the objects X_train and Y_train, respectively. We compile these objects into a single dataset using the function .concat() from pandas. The argument axis = 1 tells .concat() to concatenate the datasets by their rows.
After we have partitioned the data, we work on thetraining data to develop our predictive pipeline.
The pipeline has two main steps:
Data preprocessing.
Model development.
We will now discuss preprocessing techniques applied to the predictor columns in the training dataset.
Note that all preprocessing techniques will also be applied to the validation dataset and test dataset to prepare it for your model!
Dealing with missing values
Missing values
In many cases, some predictors have no values for a given observation. It is important to understand why the values are missing.
There four main types of missing data:
Structurally missing data is data that is missing for a logical reason or because it should not exist.
Missing completely at random assumes that the fact that the data is missing is unrelated to the other information in the data.
Missing at random assumes that we can predict the value that is missing based on the other available data.
Missing not at random assumes that there is a mechanism that generates the missing values, which may include observed and unobserved predictors.
For large data sets, removal of observations based on missing values is not a problem, assuming that the type of missing data is completely at random.
In a smaller data sets, there is a high price in removing observations. To overcome this issue, we can use methods of imputation, which try to estimate the missing values of a predictor variable using the other predictors’ values.
Here, we will introduce some simple methods for imputing missing values in categorical and numerical variables.
Example 2
Let’s use the penguins dataset available in the file “penguins.xlsx”.
# Load the Excel file into a pandas DataFrame.penguins_data = pd.read_excel("penguins.xlsx")# Set categorical variables.penguins_data['sex'] = pd.Categorical(penguins_data['sex'])penguins_data['species'] = pd.Categorical(penguins_data['species'])penguins_data['island'] = pd.Categorical(penguins_data['island'])
Training and validation datasets
For illustrative purposes, we assume that we want to predict the species (in the column species) of a penguin using the predictors bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, and sex.
We create the predictor matrix and response column.
# Set full matrix of predictors.X_full_p = (penguins_data .filter(['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex']))# Set full matrix of responses.Y_full_p = penguins_data.filter(['species'])
We use a validation dataset with 30% of the observations in penguins_data. The other 70% will be in the training dataset.
# Split the dataset into training and validation.X_train_p, X_valid_p, Y_train_p, Y_valid_p = train_test_split(X_full_p, Y_full_p, test_size =0.3, random_state =59227)
In train_test_split, we use the input random_state to set a random seed. Essentially, this allows you to obtain the same training and validation datasets every time you run the code. The advice is settingrandom_stateto a big integer number generated at random.
Since preprocessing techniques are meant for the predictors, we will work on the X_train_p data frame.
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
31
37.2
18.1
178.0
3900.0
male
46
41.1
19.0
182.0
3425.0
male
195
49.6
15.0
216.0
4750.0
male
43
44.1
19.7
196.0
4400.0
male
196
50.5
15.9
222.0
5550.0
male
Let’s check if the dataset has missing observations using the function .info() from pandas.
In the output of the function, “non-null” refers to the number of entries in a column that have actual values. That is, the number of entries where there are not NaN.
The new data is complete because each column has 232 “non-null” values; the total number of observations in complete_predictors.
However, note that we have lost eight of the original observations in X_train_p!
Imputation using the mean
We can impute the missing values of a numeric variable using the mean or median of its available values. For example, consider the variable bill_length_mm that has two missing values.
In scikit-learn, we use the function SimpleImputer() to define the method of imputation of missing values.
Using SimpleImputer(), we set the method to impute missing values using the mean.
We also use the function fit_transform() to apply the imputation method to the variable.
# Imputation for numerical variables (using the mean)num_imputer = SimpleImputer(strategy ='mean')# Replace the original variable with new version.X_train_p['bill_length_mm'] = num_imputer.fit_transform(X_train_p[ ['bill_length_mm'] ] )
After imputation, the information of the predictors in the dataset looks like this.
To impute the missing values using the median, we simply set this method in SimpleImputer(). For example, let’s impute the missing values of bill_depth_mm.
# Imputation for numerical variables (using the mean)num_imputer = SimpleImputer(strategy ='median')# Replace the original variable with new version.X_train_p['bill_depth_mm'] = num_imputer.fit_transform(X_train_p[ ['bill_depth_mm'] ] )# Show the information of the predictor.X_train_p['bill_depth_mm'].info()
We use the sample mean when the data distribution is roughly symmetrical.
Pros: Simple and easy to implement.
Cons: Sensitive to outliers; may not be accurate for skewed distributions
We use the sample median when the data is skewed (e.g., incomes, prices).
Pros: Less sensitive to outliers; robust for skewed distributions.
Cons: May reduce variability in the data.
Imputation method for a categorical variable
If a categorical variable has missing values, we can use the most frequent of the available values to replace the missing values. To this end, we use similar commands as before.
For example, let’s impute the missing values of sex using this strategy.
# Imputation for categorical variables (using the most frequent value)cat_imputer = SimpleImputer(strategy ='most_frequent')# Apply imputation strategy for categorical variables.X_train_p['sex'] = cat_imputer.fit_transform(X_train_p[ ['sex'] ]).ravel()
Let’s now have a look at the information of the dataset.
# Apply imputation strategy for categorical variables.X_train_p.info()
The columns bill_length_mm, bill_depth_mm, and sex have 240 complete values.
Unfortunately, after applying cat_imputer to the dataset, the variable sex is an object. To change it to categorical, we use the function pd.Categorical again.
# Apply imputation strategy for categorical variables.X_train_p['sex'] = pd.Categorical(X_train_p['sex'])X_train_p.info()
Multivariate imputation using K-nearest neighbours
K-nearest neighbours (KNN) is an imputation method to fill in the missing values of a predictor using the available values from this and the other predictors.
For each missing value in a predictor \(X\):
Find the \(K\) most similar rows based on the other predictors.
Impute the missing value of \(X\) using the average of the available values of \(X\) in the \(K\) closest rows.
KNN captures local patterns better than individual predictor imputation using the mean or the median.
However, it only works for numeric predictors!
The distance between two rows is calculated using Euclidean distance.
KNN only considers rows without missing values in the predictor columns being used.
In Python
You can use KNN using the function KNNImputer(). Its main input is the number of nearest neighbours to use. You set this value using the parameter n_neighbors.
# Set the KNN imputer using two neighbours.KNN_imputer = KNNImputer(n_neighbors =2)# Select numeric predictors.X_train_num_p = X_train_p.drop(columns = ['sex']) # Apply imputer and store the rulst in a pandas data frame.X_train_num_p = pd.DataFrame(KNN_imputer.fit_transform(X_train_num_p), columns = X_train_num_p.columns)