What is the best possible way to split data into training and validation sets?

I have mostly seen people using an 80-20 split, however, while reading some Neural Networks literature, I have come across the use of a much bigger training set of 90-10, sometimes even 95-05.

Does the split size vary from one case to another? Does it depend upon the amount of data available? Or on the amount of noise in data?

I have also heard about ‘validation’ set. Is it the same as the test set? If not, what’s the difference?

We know that we fit the model on the training data in order to make predictions for the test data. As a result, the following can happen:

  • Over-fitting or Under fitting our model
    We don’t want any of these things to happen, because they affect the predictability of our model .
    The choice of the train/test split and cross validation helps to avoid these problems.
  • Train/Test Split
    It is usually recommended to split data in the ratio of train/test ->80/20 or 70/30.
  • Drawbacks
    It is possible that the split we make isn’t random. What if one subset of our data has only people from a certain province/state, employees with a certain income level but not other income levels, only one gender? This will result in over-fitting, even though we’re trying to avoid it! This is where cross validation comes to the rescue!
  • Cross Validation
    It’s very similar to train/test split, but it’s applied to more subsets. In K-Folds Cross Validation we split our data into k different subsets. We use k-1 subsets to train our data and leave the last subset (or the last fold) as test data. We then average the model against each of the folds and then finalize our model. After that we test it against the test set.
    Hope this helps!

One thing you need to look out for in classification datasets is the relative number of examples for each class. If your dataset is almost perfectly balanced (similar number of examples for each class) then random sampling for splitting will be fine. However, if the classes are imbalanced (fraud detection data, medical diagnosis datasets etc) then a random sampling will not generate representative train and test sets.

You should use stratified sampling to keep the class ratios intact. This can be combined with resampling strategies like bootstrapped sampling to create representative samples for even smaller datasets.