ColumnTransformer is a preprocessing tool in the Scikit-learn library in Python that allows you to apply different transformations to different columns of a dataset. It can be used to apply different preprocessing steps, such as:
- Scaling or Imputation, which is used for numerical columns.
- One-hot encoding, which is used for categorical columns.
- Feature selection, to select the most relevant and important features.
- The ColumnTransformer class allows you to specify a list of transformers, where each transformer is applied to a specific subset of the input features.
- The transformers can be any of the preprocessing tools provided by Scikit-learn, such as StandardScaler, OneHotEncoder, or PCA.
- The ColumnTransformer then applies these transformers to the relevant columns of the input data and concatenates the transformed data back together.
In this thread, you’ll learn how to use ColumnTransformer and apply preprocessing to categorical and numerical columns in the dataset.
Loading a sample dataframe:
To illustrate the working of
ColumnTransformer, a simple sample dataframe is imported and the output of it can be seen by running the code below along with the output of features that are selected for our transformation.
In this sample dataframe, columns
Sex are categorical and columns
Age are numerical.
Applying ColumnTransformer to selected columns:
We will use Scikit-learn’s
OneHotEncoder function which is a transformer used to convert categorical variables into a set of binary dummy variables and the
SimpleImputer() transformer which is used to impute missing values in a dataset using different strategies.
SimpleImputer()objects are initialized after importing libraries and then passed as transformers named
OneHotEncoder()is applied to columns
SimpleImputer()is applied to
- The argument
remainder = "passthrough"indicates that all other columns (
Fare) in the array would be passed as they are without change.
- The results are then finally printed after applying the
There are now more columns than before in the result:
- Starting three columns belong to the
Embarkedcolumn since there are 3 unique values in this column.
- Similarly, the next two columns belong to the
Sexcolumn since there are 2 unique values in this column.
Agecolumn previously had a missing value which is now imputed using the
meanstrategy which is the default strategy of
- The column
Fareis not changed and was the same as before.