Common mistakes during preprocessing using Scikit-Learn in Python

When performing preprocessing in Python using scikit-learn, there are several common mistakes that can occur. Here are some examples with corresponding code snippets, using the iris dataset:

1. Fitting transformers separately on train and test data:

It’s crucial to fit preprocessing transformers (e.g., StandardScaler, MinMaxScaler) only on the training data and then apply the same transformations to the test data. Fitting transformers separately can lead to data leakage and incorrect scaling.

The example below shows both incorrect and correct approaches, in the correct approach you must apply fit_transform() only on the training set allowing the transformer to learn only the training set and apply transform() on the testing set.

2. Forgetting to handle missing values before preprocessing:

The correct approach is to handle missing values appropriately before applying any preprocessing transformations. This can be done by filling in missing values with a specific strategy, such as mean, median, or custom value. In the example code below, you can see that we first handled the missing values using SimpleImputer and then applied the preprocessor StandardScaler to scale the features.

3. Applying preprocessing before feature selection:

Individuals can make the mistake of first applying preprocessing and then selecting the most significant features, this can cause preprocessing to be applied to irrelevant features and consumes computational resources. The correct approach is shown in the example code below where feature selection is applied first and then preprocessing is applied to the selected features.