How can you choose a classifier based on a training set data size?

Choosing a classification algorithm in Supervised Machine learning domain is done with Bias Variance tradeoff.

  • Bias is defined as error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
    Where as,
  • Variance is defined as an error from sensitivity to small fluctuations in the training set. High variance may result from an algorithm modeling the random noise in the training data (overfitting).

So, the expected behavior of some common classification algorithms provided similar conditions are like:

Algorithm Bias Variance
Naive Bayes High Bias Low Variance
Logistic Regression Low Bias High Variance
Decision Tree Low Bias High Variance
Bagging Low Bias High Variance, lesser than Decision tree
Random Forest Low Bias High Variance, lesser than Decision tree and Bagging