Installing and loading the libraries:
install.packages(c("ggplot2","caret","rpart","rpart.plot") #(1)
library(ggplot2) #(2)
ibrary(caret) #(3)
library(rpart) #(4)
library(rpart.plot) #(5)
Cleaning the missing data
1.Read csv file and store it as dataframe.
2. Replace missing value in Embarked column with mode value.
titanic <- read.csv(titanic,
stringsAsFactors = FALSE) #(1)
titanic$Embarked[titanic$Embarked == ""] <- "S" #(2)
Engineering new features:
3.Create a new feature called FamilySize.
4.Make a new feature to track which Age values are missing:
titanic$FamilySize <- 1 + titanic$SibSp + titanic$Parch #(3)
titanic$AgeMissing <- ifelse(is.na(titanic$Age), #(4)
"Y", "N")
Setting up all the factors on the data:
titanic$Survived <- as.factor(titanic$Survived)
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Sex <- as.factor(titanic$Sex)
titanic$Embarked <- as.factor(titanic$Embarked)
titanic$AgeMissing <- as.factor(titanic$AgeMissing)
Using a very naive (i.e., don’t use this in Production) model for imputing missing ages:
titanic$Age[is.na(titanic$Age)] <- median(titanic$Age,
na.rm = TRUE)
Defining the subset of features that we will use:
features <- c("Pclass", "Sex", "Age",
"SibSp", "Parch", "Fare", "Embarked",
"FamilySize", "AgeMissing")
Using the mighty caret package to convert factors todummy variables:
dummy.vars <- dummyVars(~ ., titanic[, features])
titanic.dummy <- predict(dummy.vars, titanic[, features])
Normalizing the titanic.dummy variable for k-means clustering:
titanic.dummy <- scale(titanic.dummy)
Esatablishing variable to store K-means clustering:
clusters.sum.squares <- rep(0.0, 14)
Setting up cluster parameters:
cluster.params <- 2:15
Trying with different parameters:
- Set the seed for the reproducibility.
- Try with different cluster parameters…
set.seed(893247)
for (i in cluster.params) {
kmeans.temp <- kmeans(titanic.dummy, centers = i)
clusters.sum.squares[i - 1] <- sum(kmeans.temp$withinss)
}
Take a look at our sum of squares.
clusters.sum.squares
Plot our scree plot using the mighty ggplot2.
ggplot(NULL, aes(x = cluster.params, y = clusters.sum.squares)) +
theme_bw() +
geom_point() +
geom_line() +
labs(x = "Number of Clusters",
y = "Cluster Sum of Squared Distances",
title = "Titanic Training Data Scree Plot")
Clustering the data using the value from the elbow method:
titanic.kmeans <- kmeans(titanic.dummy, centers = 4)
Adding cluster assignments to our data frame:
titanic$Cluster <- as.factor(titanic.kmeans$cluster)
Visualizing survivability by cluster assignment.:
ggplot(titanic, aes(x = Cluster, fill = Survived)) +
theme_bw() +
geom_bar() +
labs(x = "Cluster Assignment",
y = "Passenger Count",
title = "Titanic Training Survivability by Cluster")
Building a single rpart decision tree:
- Add cluster fearture to the list of features.
- Create single rpart decision tree.
- Print out single rpart decision tree.
features <- c(features, "Cluster") #(9
titanic.rpart <- rpart(Cluster ~ ., data = titanic[, features]) #(10)
prp(titanic.rpart, type = 1) #(11)