Code sample illustrating building a model, creating predictions, and writing out a CSV suitable to submit to Kaggle
Read in the Titanic training dataset:
NOTE - Set your working directory to the correct location.
1.Read titanic.csv file and set StringAsFactors to FALSE.
titanic <- read.csv("train.csv", stringsAsFactors = FALSE) #(1)
Subset data for a simple model based on only Sex:
2.Select only Survived and Sex columns:
titanic.simple <- titanic[, c("Survived", "Sex")] #(2)
Set up factorial (categorical variables):
3.Set Survived variable to factor column.
4.Set Sex variable to factor column too.
titanic.simple$Survived <- as.factor(titanic.simple$Survived) #(3)
titanic.simple$Sex <- as.factor(titanic.simple$Sex) #(4)
Build an rpart decision tree:
5.Install the rpart.plot package if you have not installed it.
6.Load the library rpart.
7.Load the library rpart.plot.
install.packages("rpart.plot") #(5)
library(rpart) #(6)
library(rpart.plot) #(7)
Ensure everyone gets the same model and train:
8.Set a seed for reproducibility.
9.Create the simple machine learning model with rpart.
set.seed(4786) #(8)
simple.tree <- rpart(Survived ~ ., data = titanic.simple) #(9)
10.Make pretty plot of tree.
prp(simple.tree) #(10)
Working with the test data:
11.Read the test.csv file and setting stringsAsFactors as FALSE.
12.Convert Sex variable into factor column.
titanic.test <- read.csv("test.csv", stringsAsFactors = FALSE) #(11)
titanic.test$Sex <- as.factor(titanic.test$Sex) #(12)
Create predictions:
13.Create a prediction using predict function.
preds <- predict(simple.tree, titanic.test, type = "class") #(13)
Preparing for submission:
14.Create dataframe for submission
15 Write out a .CSV suitable for Kaggle submission
submission <- data.frame(PassengerId = titanic.test$PassengerId,
Survived = preds) #(14)
write.csv(submission, file = "MySubmission.csv", row.names = FALSE) #(15)