Common pitfalls to avoid when working with tree-based models of Scikit-Learn in Python

When working with tree-based models like DecisionTrees and RandomForests in Python, there are a few common mistakes that people can make. This thread will cover some common mistakes along with example codes for better understanding.

1. Overfitting the model:

  • Overfitting occurs when the model learns the training data too well and performs poorly on unseen data.
  • One common mistake is not setting proper constraints on the tree’s depth or the number of estimators in the Random Forest.
  • In the example below, two DecisionTreeClassifier models are created, one is not constrained with a max_depth parameter and this can lead to overfitting while the other is constrained with a depth of 3 (max_depth=3).

2. Using default hyperparameters without tuning:

  • Tree-based models have several hyperparameters that can significantly affect their performance.
  • One common mistake is not tuning these hyperparameters and using the default values, which may not be optimal for the given dataset.
  • In this example, the RandomForestRegressor is first used with default hyperparameters and in the correct approach, we have tuned the hyperparameters to achieve optimal performance.