When exploring the handwritten digits dataset, which is also present in scikit-learn
, there are several common mistakes that people can make. Here are some examples of those mistakes along with corresponding example codes:
1. Not splitting the dataset into training and testing sets:
A common mistake occurs when individuals forget to split the dataset into training and testing sets and then train and test on the complete data, this can cause overfitting and lead to incorrect results. The correct approach for splitting the data is shown in the example code below:
2. Not normalizing or scaling the feature values:
Scaling the feature values helps to ensure that they are on a similar scale, which can improve the performance of many machine learning algorithms. The MinMaxScaler
is used in the example below to normalize the feature of the digits
dataset between a specified range.
3. Not considering class imbalance:
When dealing with imbalanced class distributions, it is important to take steps to address this issue as one class can dominate the others and lead to a biased model. An example of class imbalance would be a dataset with 90% samples for class A and 10% samples for class B.
In the example code below, the oversampling technique is used on the minority class causing an increase in its samples, thereby balancing the distributions.
4. Not visualizing the data:
Visualizing the data can provide insights into the characteristics of the digits
dataset and help in understanding the patterns present. In the example code below, the matplotlib
library is used to create visualizations of the handwritten digits dataset to visualize how the digits look like.
5. Not using appropriate evaluation metrics:
Focusing only on one metric is also a common mistake that occurs when working with the digits
dataset. To tackle this issue, you can choose multiple metrics and compare and contrast the results of each metric, the same is done in the example below.