Building decision tree model using rpart and the Titanic dataset:
You can get required dataset here.
Note: Set your working directory properly.
1. Reading file into dataframe:
titanic <- read.csv(file = "titanic.csv",
stringsAsFactors = FALSE)
2. Cleaning missing values:
a. Cleaning embarked:
titanic[titanic$Embarked=="","Embarked"] <- "S"
b. Cleaning age
Finding rows that have masters:
masters <- grep(pattern = "Master\\.",
x = titanic$Name,
ignore.case = TRUE)
c. Calculating median age for masters:
median.masters <- median(
titanic[masters, "Age"],
na.rm=TRUE)
d. Engineering a masters column
titanic$IsMaster <- FALSE
titanic[masters, "IsMaster"] <- TRUE
is.master <- titanic$IsMaster==TRUE
age.missing <- is.na(titanic$Age)
e. Filling in missing values of age:
titanic[is.master & age.missing, "Age"] <- median.masters
f. Cleaning remaining age values:
median.age <- median(
titanic[!is.master, "Age"],
na.rm = TRUE)
age.missing <- is.na(titanic$Age)
titanic[age.missing, "Age"] <- median.age
3. Casting variables into factors::
titanic$Survived <- as.factor(titanic$Survived)
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Sex <- as.factor(titanic$Sex)
titanic$Embarked <- as.factor(titanic$Embarked)
titanic$IsMaster <- as.factor(titanic$IsMaster)
4.Splitting data into train and test data:
bag <- nrow(titanic)
train.indices <- sample(1:bag, bag * .7)
titanic.train <- titanic[train.indices,]
titanic.test <- titanic[-train.indices,]
5.Creating a list of features:
features = c("Survived", "Pclass", "Sex",
"Age", "SibSp", "Parch", "Fare",
"Embarked", "IsMaster")
6.Creating machine learning model with rpart:
library(rpart)
titanic.tree <- rpart(
formula = Survived~.,
data = titanic.train[,features]
)
7. Making predictions:
summary(titanic.tree)
predictions <- predict(
titanic.tree, newdata = titanic.test,
type = "class")