I am facing with the problem of overfitting in Random Forest. I have tried to tweak the following Parameters:
max_depth, min_samples_leaf, min_samples_split.
The train set has 2000 samples and the validation set has 500. I have tried to set the min_samples_leaf and the min_samples_split to be 10 to 50, and even set max_depth equal to 1. Doesn’t help.
For parameters like max_features and n_estimators, I think they are adequate and don’t have large influence on the overfitting. What else can I do to reduce the overfitting in this case?
My assumption for overfitting is due to the fact that the AUC score on training set is 0.8 and on validation set is 0.5.
Here are a few things you could try (ideally in this order):
- Are your classes imbalanced in the data because of they are then your train and val sets could have different distributions. This could be a cause of the different AUC scores (instead of overfitting).
- Verify that there is indeed overfitting by using other metrics in addition to the AUC score. If the data is imbalanced then try metrics which are insensitive to this (this article might help).
- Don’t ignore parameters like max_features and n_estimators if you don’t have a proper reason for it.
- Setup a proper cross-validation setup in whichever library you are using to find the optimal values for all the parameters together.