Common mistakes which occur when working with the Iris dataset in Python

sabih · May 26, 2023, 6:49pm

The “Iris” dataset is popular in machine learning and is often used for classification tasks, while it is a toy dataset, there are some common mistakes people make when working on it. Here are some of those mistakes along with example codes:

1. Incorrectly creating a dataframe of the built-in dataset from sklearn:

The iris dataset is commonly used from the sklearn library as it is present as a built-in dataset in this library but a common mistake that occurs when importing this dataset is having difficulty in presenting it in a dataframe format.

The data is not present as a dataframe in the library by default and you have to convert it into a dataframe to better analyze and work on it. Here is an example code that imports the dataset and converts it into a dataframe:

2. Not examining and exploring the dataset before analysis:

Neglecting to explore the dataset before your analysis can lead to incorrect assumptions or modeling choices. It’s essential to understand the structure and characteristics of the dataset before starting to work on it.

The example code below uses different methods to find the datatypes of the columns, examines null values, summary statistics, and distribution of the target species.

3. Not labeling the axes when visualizing features:

A common mistake that can occur when visualizing different features of the dataset is that individuals can forget to give the plot a title and not label the axes, this can confuse the users as they will have difficulty in figuring out what the plot conveys.

The solution for this is to label the axes and give the plot a title whenever you create a visualization of the iris dataset. Here is an example scatter plot between sepal length and sepal width and you can see how well the plot is labeled.

4. Not choosing appropriate visualizations:

Another common and frequent mistake is that individuals choose the wrong visualization techniques to visualize the data, this causes misleading results and shows a bad picture of the dataset. To tackle this issue, know which plots are used for which type of data and use them appropriately.

An example of this is shown below, where a line plot is used to visualize the relationship between petal length and petal width but this is incorrect as a line plot assumes a sequential order and not a continuous one. The correct choice of the plot is a scatter plot in this case.