Common pitfalls to avoid when working with tree-based models of Scikit-Learn in Python

mubashir_rizvi · June 1, 2023, 1:55pm

When working with tree-based models like DecisionTrees and RandomForests in Python, there are a few common mistakes that people can make. This thread will cover some common mistakes along with example codes for better understanding.

1. Overfitting the model:

Overfitting occurs when the model learns the training data too well and performs poorly on unseen data.
One common mistake is not setting proper constraints on the tree’s depth or the number of estimators in the Random Forest.
In the example below, two DecisionTreeClassifier models are created, one is not constrained with a max_depth parameter and this can lead to overfitting while the other is constrained with a depth of 3 (max_depth=3).

2. Using default hyperparameters without tuning:

Tree-based models have several hyperparameters that can significantly affect their performance.
One common mistake is not tuning these hyperparameters and using the default values, which may not be optimal for the given dataset.
In this example, the RandomForestRegressor is first used with default hyperparameters and in the correct approach, we have tuned the hyperparameters to achieve optimal performance.