One of the biggest challenges in working with categorical data is dealing with unknown categories. These are categories that are not present in the training set but may appear in the testing set. One way to handle this issue is by using OneHotEncoder
which is a commonly used technique for encoding categorical variables. It converts categorical variables into numerical data that machine learning algorithms can understand. In this thread, we will explore how OneHotEncoder
can be used to handle unknown categories in Python.
If you want to learn in detail how to apply OneHotEncoder
and other transformers to your columns, have a look at these two threads:
Loading a sample dataframe:
Before using the method, we first create a simple dataframe which random unique values in the training set and some values in the testing set that are not in the training set. This would help you visualize the results and learn how OneHotEncoder
would deal with unknown categories in the testing set.
Applying "OneHotEncoder" with "handle_unknown" argument:
- After importing the
OneHotEncoder
class, its object is initialized using thehandle_unknown
parameter, and the valueignore
is specified which tells the transformer to ignore values that are not present in the training set. - The transformer is applied on the training set using
fit_transform()
causingOneHotEncoder
to learn unique categories incol
and then transform it based on those learned categories. - The testing set is then only transformed using
transform()
and we did not usefit_transform()
here because since this is the testing set, we don’t want the transformer to learn on this set. - Finally, the results are converted into arrays using the
toarray()
function and you can see in the encoding of the testing set that the last sample, which hadcol
value asD
is new for the transformer and is ignored and transformed as0
.