When working with the Titanic
dataset in Python, which contains information about whether passengers aboard the Titanic ship survived or not, there are common mistakes that individuals may make. Here are some of these mistakes along with example codes for each:
1. Not exploring and examining the data:
It is crucial to get to know about your data before proceeding to analyze and/or visualize it. This helps you gain more understanding of your data and gives you an idea of how the data is structured. Here is an example code that uses different methods to analyze the titanic
dataset:
2. Not using appropriate visualizations:
When creating visualizations for the data, it is important to consider correct and appropriate visualizations based on the features or variables you want to plot. Choosing incorrect visualizations can cause misleading and can make it harder to interpret the plot.
The example code below visualizes the features Embarked
and Fare
and a scatter
plot is not used because Embarked
is a categorical variable and when you want to create a plot between categorical and numerical features, it is best to go with either barplot
or boxplot
.
3. Not handling missing values before visualization:
Visualizations provide insights based on the data presented. If missing values are not addressed before visualizing the data, the visualization may be misleading or incomplete, and ignoring missing values can introduce bias and affect the interpretation of the data.
The example code below creates a countplot
for the Embarked
feature after handling the missing values in it.
4. Not properly labeling the plot and axes:
Labeling the plot and your axes appropriately gives you and others a better understanding of what the plot is about and what information it is conveying. Not adding such information to your plot can cause difficulty in analyzing the plot and can also cause misinterpretation.
The example code below creates a scatter
plot between Age
and Fare
and also appropriately labels the plot.