Should I prefer OrdinalEncoder over OneHotEncoder for scikit-learn tree models?

sabih · March 14, 2023, 5:57pm

Hey, I’m a bit confused about which encoder to use in Scikit-learn for categorical features in tree-based models. Can you explain the differences between OrdinalEncoder and OneHotEncoder, and when it’s best to use each one? It would be really helpful if you could also provide some code examples to illustrate how to use them.

muneeb · March 14, 2024, 11:30pm

When working with tree-based models in scikit-learn, it can be beneficial to use OrdinalEncoder instead of OneHotEncoder for encoding categorical features.

The OrdinalEncoder is a transformer that encodes categorical features as ordinal integers. This can be useful when working with tree-based models because it preserves the natural ordering of the categories.

In contrast, OneHotEncoder creates a binary feature for each category, resulting in a larger feature space. This can be problematic for tree-based models, which can overfit high-dimensional feature spaces.

Here’s an example of using OrdinalEncoder instead of OneHotEncoder:

Note that evaluating the tree on the training data may not be a good indicator of the model’s performance on new, unseen data. It’s important to evaluate the model on a separate testing set to get a more accurate estimate of its performance.