Choosing an appropriate Train-Test Split Size

What is the best possible way to split data into training and testing sets?
I have mostly seen people using an 80-20 split, however, while reading some Neural Networks literature, I have come across the use of a much bigger training set of 90-10, sometimes even 95-05.
Does the split size vary from one case to another? Does it depend upon the amount of data available? Or on the amount of noise in data?
I have also heard about ‘validation’ set. Is it the same as the test set? If not, what’s the difference?

Using K-fold cross validation is considered a good approach.

The following are the steps:

  1. Shuffle the dataset randomly.
  2. Split the dataset into k groups
  3. For each unique group:
    • Take the group as a hold out or test data set
    • Take the remaining groups as a training data set
    • Fit a model on the training set and evaluate it on the test set
    • Retain the evaluation score and discard the model
  4. Summarize the skill of the model using the sample of model evaluation scores

There are several concerns related to this but there is no single solution. It depends on the number of instances as well.

With a large train size compared to test, you have a high chance of overfitting your data. If the overall data set has very large number of instances, probably in the range of hundreds of thousands, then the split won’t matter there much. However, in case of small data set consisting of a few hundred rows perhaps, it is desirable to go with cross-validation since no single split is going to give you the desired outcome.

I won’t put the term ‘validation’ data under the test data definition. Validation set is one split from your already partitioned training set. The purpose of using validation set is that you do not want to see the test set at all while evaluating your model. Having a look on the test set and using it for evaluation of our model at an early stage is going to give us an insight of the data and get a peak of its pattern, which is not we desire in order to save any chance of overfitting.

In this problem there are two competing concerns: with less training data, your parameter estimates have greater variance. With less testing data, your performance statistic will have greater variance. Broadly speaking you should be concerned with dividing data such that neither variance is too high, which is more to do with the absolute number of instances in each category rather than the percentage.

If you have a total of 100 instances, you’re probably stuck with cross validation as no single split is going to give you satisfactory variance in your estimates. If you have 100,000 instances, it doesn’t really matter whether you choose an 80:20 split or a 90:10 split (indeed you may choose to use less training data if your method is particularly computationally intensive).

Assuming you have enough data to do proper held-out test data (rather than cross-validation), the following is an instructive way to get a handle on variances:

  1. Split your data into training and testing (80/20 is indeed a good starting point)
  2. Split the training data into training and validation (again, 80/20 is a fair split).
  3. Subsample random selections of your training data, train the classifier with this, and record the performance on the validation set
  4. Try a series of runs with different amounts of training data: randomly sample 20% of it, say, 10 times and observe performance on the validation data, then do the same with 40%, 60%, 80%. You should see both greater performance with more data, but also lower variance across the different random samples
  5. To get a handle on variance due to the size of test data, perform the same procedure in reverse. Train on all of your training data, then randomly sample a percentage of your validation data a number of times, and observe performance. You should now find that the mean performance on small samples of your validation data is roughly the same as the performance on all the validation data, but the variance is much higher with smaller numbers of test samples