What is ‘training set’ and ‘test set’ in a machine learning model? how much data will you allocate for your training, validation, and test sets?

Training Set

It is the dataset which you use for training your system. The system will learn based on the information it gets from the training set.

But in the end, you want to measure the system’s performance. And you cannot do this on the same training set, because your system is optimized on that training set, and a performance measure in the training set might not reflect the performance in a real-world scenario. Your system learnt very irrelevant relations which were held only for the training set? This is called overfitting and you cannot be sure. You need a validation set to really validate your system without any doubt of overfitting.

Test Set

So, I have explained training and validation sets. Then what is the test set? Imagine that you trained your system, you measured the performance in a validation set. But you didn’t like the performance. So, you tuned your system (hyper parameters, or feature extraction methods, whatever) to get a better performance. You measured the performance in your validation set again to check if it improved or not. You do this repeatedly, in this way you reach an optimum system for your task. You have a superior performance in validation set now. Well done!

But there is a problem. You did all these tunings and optimizations based on the performance in your validation set. May be these tunings work well for your validation set, but is this necessarily generalizable? Some worked just by chance. Therefore, after all these optimizations, you need to report the performance of your final system using a completely unseen test set. Because those optimizations are also a part of your training and when you optimize your system based on performance on the validation set, you are using the validation set like a training set. Therefore, it is not valid anymore for reporting performance, and you need an independent test set.

To summarize, train your system using training set, optimize the system (everything, including feature extraction and training procedure) by checking performance on the validation set. Finally, when you are done, report your system’s performance on an independent test set.

As a final note, if you do not do any tuning by checking validation performance, then you do not need a test set. You can use your validation set to directly report the performance.

At the beginning of a project, a data scientist divides up all the examples into three subsets: the training set, the validation set, and the test set. Common ratios used are:

  • 70% train, 15% val, 15% test
  • 80% train, 10% val, 10% test
  • 60% train, 20% val, 20% test

The steps are as follows:

  1. Randomly initialize each model
  2. Train each model on the training set
  3. Evaluate each trained model’s performance on the validation set
  4. Choose the model with the best validation set performance
  5. Evaluate this chosen model on the test set .