This thread will cover different methods of how you can select columns when using the ColumnTransformer
preprocessing function. There are several methods and techniques to achieve this but first, if you want to get familiar with what the ColumnTransformer
function does and how it is applied to categorical and numerical columns, go through the thread of Applying ColumnTransformer to dataframe columns.
Loading a sample dataframe:
- Before exploring the methods, we first load a sample dataframe using a URL and select features to which we’d be applying the
ColumnTransformer
function to. - You can view the output of the code below to get an idea of what the features look like.
1. Selecting using column names:
- The simplest method of selecting column names when applying a transformer is simply passing a list of column names you want to apply the transformer to.
- This method is easy and simple if you want to apply a specific transformer to a small number of columns.
2. Selecting using column index positions:
- This is also an easy method where you pass a list of index positions of columns in your feature array.
- This method is also simple if you want to apply transformers to a small number of columns.
3. Selecting using "slice()" function:
- In Python,
slice()
is a built-in function that creates a slice object. which is used to represent a range of indices that can be used to slice a sequence such as a list, tuple, or string. - In this method, the
slice(0,2)
object specifies the range of columns from the 0th index (inclusive) up to the 2nd index (exclusive), which means it selects the first two columns of the input dataX
. - This method is useful if you want to apply a transformer to columns adjacent to each other or if you want to apply the transformer to a certain range of columns in your array.
4. Selecting using boolean mask:
- This method involves creating a list of boolean values (
True
/False
) whereTrue
would be placed on those index positions where you want the column to be included andFalse
would be placed on a position of the column you don’t want to include. - This method would be tedious if you are applying a transformer on a large number of columns because it is a must that you specify a boolean value for each column in your array.
5. Selecting using a pattern:
- In the code below,
make_column_selector()
is a function that creates a selector object for columns in a collection of columns. It allows the selection of columns based on a pattern that matches the column names. - The
pattern
argument ofmake_column_selector()
is used to specify the pattern to match the column names. The pattern is a string that can contain regular expression syntax to match the column names. - In the given code, the pattern is
'E|S'
, which matches any column name that contains either the letterE
or the letterS
. - This method is useful if there are large number of columns in your array and you want to apply different transformers to different columns, you can then easily specify different patterns for different columns.
6. Selecting using "dtype_include":
- In the given code, the same
make_column_selector()
function is used to create a selector object for columns. - But in this example, the
dtype_include
argument is used to select columns that have a specific data type. - Since
OneHotEncoder()
is used for categorical columns,dtype_include=object
is passed as an argument to only select columns with a dtype ofobject
. Theobject
dtype in pandas includes string data and other Python objects.
7. Selecting using a "dtype_exclude":
- The
make_column_selector()
function also has adtype_exclude
argument that is used to exclude columns that have a specific data type. - In this case,
dtype_exclude='number'
is passed as an argument to exclude all columns with a dtype of numeric data types such asfloat64
,int64
, etc.
The last two methods are extremely useful if you want to apply a certain transformer to all categorical columns and apply a different transformer to all numerical columns in the features array.