Selecting Columns When Transforming Columns

This thread will cover different methods of how you can select columns when using the ColumnTransformer preprocessing function. There are several methods and techniques to achieve this but first, if you want to get familiar with what the ColumnTransformer function does and how it is applied to categorical and numerical columns, go through the thread of Applying ColumnTransformer to dataframe columns.

Loading a sample dataframe:

  • Before exploring the methods, we first load a sample dataframe using a URL and select features to which we’d be applying the ColumnTransformer function to.
  • You can view the output of the code below to get an idea of what the features look like.

1. Selecting using column names:

  • The simplest method of selecting column names when applying a transformer is simply passing a list of column names you want to apply the transformer to.
  • This method is easy and simple if you want to apply a specific transformer to a small number of columns.

2. Selecting using column index positions:

  • This is also an easy method where you pass a list of index positions of columns in your feature array.
  • This method is also simple if you want to apply transformers to a small number of columns.

3. Selecting using "slice()" function:

  • In Python, slice() is a built-in function that creates a slice object. which is used to represent a range of indices that can be used to slice a sequence such as a list, tuple, or string.
  • In this method, the slice(0,2) object specifies the range of columns from the 0th index (inclusive) up to the 2nd index (exclusive), which means it selects the first two columns of the input data X.
  • This method is useful if you want to apply a transformer to columns adjacent to each other or if you want to apply the transformer to a certain range of columns in your array.

4. Selecting using boolean mask:

  • This method involves creating a list of boolean values (True / False) where True would be placed on those index positions where you want the column to be included and False would be placed on a position of the column you don’t want to include.
  • This method would be tedious if you are applying a transformer on a large number of columns because it is a must that you specify a boolean value for each column in your array.

5. Selecting using a pattern:

  • In the code below, make_column_selector() is a function that creates a selector object for columns in a collection of columns. It allows the selection of columns based on a pattern that matches the column names.
  • The pattern argument of make_column_selector() is used to specify the pattern to match the column names. The pattern is a string that can contain regular expression syntax to match the column names.
  • In the given code, the pattern is 'E|S', which matches any column name that contains either the letter E or the letter S.
  • This method is useful if there are large number of columns in your array and you want to apply different transformers to different columns, you can then easily specify different patterns for different columns.

6. Selecting using "dtype_include":

  • In the given code, the same make_column_selector() function is used to create a selector object for columns.
  • But in this example, the dtype_include argument is used to select columns that have a specific data type.
  • Since OneHotEncoder() is used for categorical columns, dtype_include=object is passed as an argument to only select columns with a dtype of object. The object dtype in pandas includes string data and other Python objects.

7. Selecting using a "dtype_exclude":

  • The make_column_selector() function also has a dtype_exclude argument that is used to exclude columns that have a specific data type.
  • In this case, dtype_exclude='number' is passed as an argument to exclude all columns with a dtype of numeric data types such as float64, int64, etc.

The last two methods are extremely useful if you want to apply a certain transformer to all categorical columns and apply a different transformer to all numerical columns in the features array.