Encoding Categorical Features Using Sklearn

Most machine learning algorithms require input data to be in numerical format, and since there are many features in datasets that are categorical, we must know different techniques of how to deal with them to get efficient model results.

One common technique for encoding categorical features into a numerical format is using scikit-learn, a popular Python library for machine learning. Scikit-learn provides several classes for encoding categorical features, including OneHotEncoder, OrdinalEncoder, and many others, each with its own advantages and limitations. This thread is focused on some of the most common and efficient techniques which are discussed below.

Before we explore different functions through which you can encode categorical columns, we first loaded a simple sample dataframe. We selected a column Embarked in this dataframe to be encoded by all functions. Different encodings would be applied on the same column and you’d have a clear idea of what each function does by comparing the results.

1. Using "OneHotEncoder()" method:

  • OneHotEncoder is used to transform categorical features into a numerical format that can be used as input for machine learning algorithms. It creates a binary vector representation for each unique category in the original feature.
  • In the example code below, there are 3 unique values in column Embarked and hence the OneHotEncoder class creates 3 binary columns for each value, and value 1 indicates which value is true in each data sample.
  • The to_array() function simply converts the results obtained by fit_transform() into a 2D array that can be easily interpreted.

2. Using "OrdinalEncoder()" method:

  • The OrdinalEncoder function converts each unique category in a categorical feature into a numerical value based on its order of appearance. Unlike OneHotEncoder, it does not create a separate binary column for each unique value.
  • The advantage of OrdinalEncoder is that it preserves the ordinal relationship between categories, which can be useful when there is a natural order to the categories. Additionally, OrdinalEncoder is computationally efficient and can handle large datasets with many categories.
  • However, it is important to note that the numerical values assigned to each category may not be meaningful in all cases, and may not be appropriate for certain machine-learning algorithms.

3. Using "LabelEncoder()" method:

  • LabelEncoder is used to transform categorical labels into numerical labels. It assigns a unique integer value to each category in the input feature.
  • It is different from OrdinalEncoder as it can transform only one feature at a time and not a list or array of features.
  • The advantage of LabelEncoder is that it is a simple and computationally efficient method for encoding categorical labels. Additionally, it preserves the ordinal relationship between categories, which can be useful in some situations.
  • However, it is important to note that LabelEncoder should be used only when the categorical variable is ordinal or when there are only two categories. In cases where there are more than two categories, it is generally better to use OneHotEncoder or OrdinalEncoder to avoid assigning any arbitrary order to the categories.

If you want to learn how you can apply all these classes and transformers much more efficiently on multiple columns at once, go through the threads of:

  1. Applying ColumnTransformer to dataframe columns.
  2. Selecting columns when transforming columns.