How to do k-means clustering with titanic dataset with R?

datasciencedojo · October 18, 2017, 4:31pm

Installing and loading the libraries:

install.packages(c("ggplot2","caret","rpart","rpart.plot")    #(1)
library(ggplot2)   #(2)
ibrary(caret)   #(3)
library(rpart)  #(4)
library(rpart.plot)  #(5)

Cleaning the missing data

1.Read csv file and store it as dataframe.
2. Replace missing value in Embarked column with mode value.

titanic <- read.csv(titanic, 
                    stringsAsFactors = FALSE)     #(1)
titanic$Embarked[titanic$Embarked == ""] <- "S"    #(2)

Engineering new features:

3.Create a new feature called FamilySize.
4.Make a new feature to track which Age values are missing:

titanic$FamilySize <- 1 + titanic$SibSp + titanic$Parch    #(3) 
titanic$AgeMissing <- ifelse(is.na(titanic$Age),       #(4)
                             "Y", "N")

Setting up all the factors on the data:

titanic$Survived <- as.factor(titanic$Survived)
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Sex <- as.factor(titanic$Sex)
titanic$Embarked <- as.factor(titanic$Embarked)
titanic$AgeMissing <- as.factor(titanic$AgeMissing)

Using a very naive (i.e., don’t use this in Production) model for imputing missing ages:

titanic$Age[is.na(titanic$Age)] <- median(titanic$Age, 
                                          na.rm = TRUE)

Defining the subset of features that we will use:

features <- c("Pclass", "Sex", "Age",
              "SibSp", "Parch", "Fare", "Embarked",
              "FamilySize", "AgeMissing")

Using the mighty caret package to convert factors todummy variables:

dummy.vars <- dummyVars(~ ., titanic[, features])
titanic.dummy <- predict(dummy.vars, titanic[, features])

Normalizing the titanic.dummy variable for k-means clustering:

titanic.dummy <- scale(titanic.dummy)

Esatablishing variable to store K-means clustering:

clusters.sum.squares <- rep(0.0, 14)

Setting up cluster parameters:

cluster.params <- 2:15

Trying with different parameters:

Set the seed for the reproducibility.
Try with different cluster parameters…

set.seed(893247)
for (i in cluster.params) {
 kmeans.temp <- kmeans(titanic.dummy, centers = i)
 clusters.sum.squares[i - 1] <- sum(kmeans.temp$withinss)
}

Take a look at our sum of squares.

clusters.sum.squares

Plot our scree plot using the mighty ggplot2.

ggplot(NULL, aes(x = cluster.params, y = clusters.sum.squares)) +
  theme_bw() +
  geom_point() +
  geom_line() +
  labs(x = "Number of Clusters",
       y = "Cluster Sum of Squared Distances",
       title = "Titanic Training Data Scree Plot")

Rplot

Clustering the data using the value from the elbow method:

titanic.kmeans <- kmeans(titanic.dummy, centers = 4)

Adding cluster assignments to our data frame:

titanic$Cluster <- as.factor(titanic.kmeans$cluster)

Visualizing survivability by cluster assignment.:

ggplot(titanic, aes(x = Cluster, fill = Survived)) +
  theme_bw() +
  geom_bar() +
  labs(x = "Cluster Assignment",
       y = "Passenger Count",
       title = "Titanic Training Survivability by Cluster")

Rplot01

Building a single rpart decision tree:

Add cluster fearture to the list of features.
Create single rpart decision tree.
Print out single rpart decision tree.

features <- c(features, "Cluster")        #(9
titanic.rpart <- rpart(Cluster ~ ., data = titanic[, features])    #(10)
prp(titanic.rpart, type = 1)     #(11)

Rplot02

avivas · April 3, 2018, 7:30pm

Hi, I dont find the function ise in titanic$AgeMissing ← ise(is.na(titanic$Age), “Y”, “N”). Thanks

Ciera_Martinez · May 2, 2018, 2:06pm

It is supposed to be ifelse(is.na(titanic$Age), “Y”, “N”)