The code snippet displaying titles which matter for predicting Survivability for the titanic dataset.
Reading titanic dataset and converting Survived variable into factor:
1.First, read the titanic file. Set the working directory properly.
2.Convert a Survived variable into factor column.
titanic <- read.csv(
file = "titanic.csv",
stringsAsFactors = FALSE
) #(1)
titanic$Survived <- as.factor(titanic$Survived) #(2)
Discretizing categories:
Converting all Pclass values into variables:
titanic$pclass_one <- 0
titanic$pclass_two <- 0
titanic$pclass_three <- 0
titanic[titanic$Pclass==1,"pclass_one"] <- 1
titanic[titanic$Pclass==2,"pclass_two"] <- 1
titanic[titanic$Pclass==3,"pclass_three"] <- 1
Converting all the values of Embarked variables into variables:
titanic$embarked_q <- 0
titanic$embarked_s <- 0
titanic$embarked_c <- 0
titanic[titanic$Embarked=="Q","embarked_q"] <- 1
titanic[titanic$Embarked=="S","embarked_s"] <- 1
titanic[titanic$Embarked=="C","embarked_c"] <- 1
Converting values of gender into variables:
titanic$sex_m <- 0
titanic$sex_f <- 0
titanic[titanic$Sex=="male","sex_m"] <- 1
titanic[titanic$Sex=="female","sex_f"] <- 1
Filling Missing values of Age:
titanic[is.na(titanic$Age),"Age"] <- 28
Building Random Forest:
1.Install randomForest package if you have not installed it.
2.Load library packages.
3.Create the list of features for random forest.
4.Create random forest model .
5.Use varImpPlot to create scatter plot for variable importance calculated by random forest.
install.packages("randomForest"). #(1)
library(randomForest) #(2)
features <- c("Survived","Age", "SibSp", "Parch", "Fare", "pclass_one","pclass_two","pclass_three","embarked_q","embarked_s","embarked_c","sex_m","sex_f") #(3)
titanic.forest <- randomForest(Survived~., data = titanic[,features], importance=TRUE) #(4)
varImpPlot(titanic.forest) #(5)