Note: Get your train and test data from Titanic - Machine Learning from Disaster | Kaggle.
Set your working directory properly.
Next up read your data and store them in train and test variables.
train <- read.csv("train.csv",stringAsFactors=FALSE)
test <- read.csv("test.csv", stringAsFactors=FALSE)
Combine the data to make data cleaning easier:
survived <- train$Survived
data.combined <- rbind(train[, -2], test)
Transform some variables to factors:
data.combined$Pclass <- as.factor(data.combined$Pclass)
data.combined$Sex <- as.factor(data.combined$Sex)
Test data is missing a fare, clean up:
Summarize Fare values similar to the missing record:
similar.fares <- train[train$Pclass == "3" & train$Sex == "male" &
train$SibSp == 0 & train$Parch == 0 &
!$Age) & train$Age >= 50,]
Replace missing fare with median:
data.combined$Fare[$Fare)] <- 7.75
Create a feature for family size:.
data.combined$FamilySize <- 1 + data.combined$SibSp + data.combined$Parch
Replace the missing values of Embarked with the most common value:
data.combined$Embarked[data.combined$Embarked == ""] <- "S"
Make a factor.
data.combined$Embarked <- as.factor(data.combined$Embarked)
Use the mighty stringr package to pull out titles:
Use regular expression (regex).
data.combined$Title <- str_extract(data.combined$Name, "[a-zA-Z]+\\.")
Implement logic to collapse Titles, starting with adult males should have the title of “Mr.”.
data.combined$Title[data.combined$Sex == "male" &
data.combined$Age >= 16] <- "Mr."
Collapse male children to Title of “Master.”.
data.combined$Title[data.combined$Sex == "male" &
data.combined$Age <= 16] <- "Master."
Collapse female children to Title of “Girl.”.
data.combined$Title[data.combined$Sex == "female" &
data.combined$Age < 16] <- "Girl."
Collapse titles and map to “Miss.”.
table(data.combined$Title[data.combined$Sex == "female"])
data.combined$Title[data.combined$Title == "Ms." |
data.combined$Title == "Mlle."] <- "Miss."
Collapse titles and map to “Mrs.”.
table(data.combined$Title[data.combined$Sex == "female"])
data.combined$Title[data.combined$Title == "Countess." |
data.combined$Title == "Dona." |
data.combined$Title == "Lady." |
data.combined$Title == "Mme." |
data.combined$Title == "Dr."] <- "Mrs."
Make a factor:
data.combined$Title <- as.factor(data.combined$Title)
Before we impute missing ages, add a tracking feature:
data.combined$MissingAge <- as.factor(ifelse($Age),
"Y", "N"))
Leverage a mighty Random Forest to impute missing ages:
Setup training data:
ignore.features <- c(1, 3, 8, 10)
age.train <- data.combined[1:891,]
age.train <- age.train[!$Age), -ignore.features]
Set seed for reproducibility:
rf.age <- randomForest(Age ~ ., data = age.train,
importance = TRUE)
What’s the MAE of this model?
mean(abs(rf.age$predicted - age.train$Age))
Impute missing ages and overwrite NAs in combined data frame
data.combined$Age[$Age)] <-
predict(rf.age, data.combined[$Age),])
Double-check our work.
Split data back out.
train <- data.combined[1:891,]
test <- data.combined[892:1309,]
Use the mighty caret package to perform cross validation with our features using the mighty glmnet package.
Set seed to ensure reproducibility between runs
Set up caret to perform 10-fold cross validation repeated 3 times:
caret.control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3)
Leverage caret to create dummy variables (i.e., one-hot encoding)as glmnet will not work with factor variables.
dummy.vars <- dummyVars(~ ., data = train[, -ignore.features])
train.dummy <- predict(dummy.vars, train)
Use caret to train a the mighty glmnet package as a binary (i.e., logistic) regression model. The code below illustrates the following awesomeness:
1 - Instead of using the formula interface, explicitly set the predictors and the class label.
2 - Pre process our numeric data by centering and scaling (i.e., standardizing) the numeric columns. <- train(x = train.dummy,
y = as.factor(survived),
method = "glmnet",
preProcess = c("center", "scale"),
trControl = caret.control,
tuneLength = 9)
Display the results of the cross validation run - Around 81.5% mean accuracy achieved with an alpha blend of and a lambda of 0.013275.
What is the standard deviation?
cat(paste("\nCross validation standard deviation:",
sd($resample$Accuracy), "\n", sep = " "))
What are the model coefficients of the final, best model?
coefficients <- coef($finalModel,$bestTune$lambda)
.Transform test data frame to dummy variables.
test.dummy <- predict(dummy.vars, test)
Make predictions.
preds <- predict(, test.dummy, type = "raw")