Handling Unknown Categories Using OneHotEncoder

One of the biggest challenges in working with categorical data is dealing with unknown categories. These are categories that are not present in the training set but may appear in the testing set. One way to handle this issue is by using OneHotEncoder which is a commonly used technique for encoding categorical variables. It converts categorical variables into numerical data that machine learning algorithms can understand. In this thread, we will explore how OneHotEncoder can be used to handle unknown categories in Python.

If you want to learn in detail how to apply OneHotEncoder and other transformers to your columns, have a look at these two threads:

  1. Applying ColumnTransformer to dataframe columns.
  2. Selecting columns when transforming columns.

Loading a sample dataframe:

Before using the method, we first create a simple dataframe which random unique values in the training set and some values in the testing set that are not in the training set. This would help you visualize the results and learn how OneHotEncoder would deal with unknown categories in the testing set.

Applying "OneHotEncoder" with "handle_unknown" argument:

  • After importing the OneHotEncoder class, its object is initialized using the handle_unknown parameter, and the value ignore is specified which tells the transformer to ignore values that are not present in the training set.
  • The transformer is applied on the training set using fit_transform() causing OneHotEncoder to learn unique categories in col and then transform it based on those learned categories.
  • The testing set is then only transformed using transform() and we did not use fit_transform() here because since this is the testing set, we don’t want the transformer to learn on this set.
  • Finally, the results are converted into arrays using the toarray() function and you can see in the encoding of the testing set that the last sample, which had col value as D is new for the transformer and is ignored and transformed as 0.