How to do text analytics with glmnet and caret packages?`

Note: Get your train and test data from Titanic - Machine Learning from Disaster | Kaggle.
Set your working directory properly.

Next up read your data and store them in train and test variables.

train <- read.csv("train.csv",stringAsFactors=FALSE)
test <- read.csv("test.csv", stringAsFactors=FALSE)

Combine the data to make data cleaning easier:

survived <- train$Survived
data.combined <- rbind(train[, -2], test)

Transform some variables to factors:

data.combined$Pclass <- as.factor(data.combined$Pclass)
data.combined$Sex <- as.factor(data.combined$Sex)

Test data is missing a fare, clean up:

data.combined[is.na(data.combined$Fare),]

Summarize Fare values similar to the missing record:

similar.fares <- train[train$Pclass == "3" & train$Sex == "male" &
                       train$SibSp == 0 & train$Parch == 0 &
                       !is.na(train$Age) & train$Age >= 50,]
summary(similar.fares$Fare)

Replace missing fare with median:

data.combined$Fare[is.na(data.combined$Fare)] <- 7.75

Create a feature for family size:.

data.combined$FamilySize <- 1 + data.combined$SibSp + data.combined$Parch

Replace the missing values of Embarked with the most common value:

data.combined$Embarked[data.combined$Embarked == ""] <- "S"

Make a factor.

data.combined$Embarked <- as.factor(data.combined$Embarked)

Use the mighty stringr package to pull out titles:

install.packages("stringr")
library(stringr)

Use regular expression (regex).

data.combined$Title <- str_extract(data.combined$Name, "[a-zA-Z]+\\.")

Implement logic to collapse Titles, starting with adult males should have the title of “Mr.”.

data.combined$Title[data.combined$Sex == "male" &
                    data.combined$Age >= 16] <- "Mr."

Collapse male children to Title of “Master.”.

data.combined$Title[data.combined$Sex == "male" &
                    data.combined$Age <= 16] <- "Master."

Collapse female children to Title of “Girl.”.

data.combined$Title[data.combined$Sex == "female" &
                      data.combined$Age < 16] <- "Girl."

Collapse titles and map to “Miss.”.

table(data.combined$Title[data.combined$Sex == "female"])

data.combined$Title[data.combined$Title == "Ms." |
                    data.combined$Title == "Mlle."] <- "Miss."

Collapse titles and map to “Mrs.”.

table(data.combined$Title[data.combined$Sex == "female"])

data.combined$Title[data.combined$Title == "Countess." |
                    data.combined$Title == "Dona." |
                    data.combined$Title == "Lady." |
                    data.combined$Title == "Mme." |
                    data.combined$Title == "Dr."] <- "Mrs."
table(data.combined$Title)

Make a factor:

data.combined$Title <- as.factor(data.combined$Title)

Before we impute missing ages, add a tracking feature:

data.combined$MissingAge <- as.factor(ifelse(is.na(data.combined$Age),
                                             "Y", "N"))

Leverage a mighty Random Forest to impute missing ages:

install.packages("randomForest")
library(randomForest)

Setup training data:

ignore.features <- c(1, 3, 8, 10)
age.train <- data.combined[1:891,]
age.train <- age.train[!is.na(train$Age), -ignore.features]

Set seed for reproducibility:

set.seed(86482)
rf.age <- randomForest(Age ~ ., data = age.train, 
                       importance = TRUE)
rf.age

What’s the MAE of this model?

mean(abs(rf.age$predicted - age.train$Age))

Impute missing ages and overwrite NAs in combined data frame

data.combined$Age[is.na(data.combined$Age)] <- 
  predict(rf.age, data.combined[is.na(data.combined$Age),])

Double-check our work.

summary(data.combined$Age)

Split data back out.

train <- data.combined[1:891,]
test <- data.combined[892:1309,]

Use the mighty caret package to perform cross validation with our features using the mighty glmnet package.

Set seed to ensure reproducibility between runs

set.seed(12345)

Set up caret to perform 10-fold cross validation repeated 3 times:

caret.control <- trainControl(method = "repeatedcv",
                              number = 10,
                              repeats = 3)

Leverage caret to create dummy variables (i.e., one-hot encoding)as glmnet will not work with factor variables.

dummy.vars <- dummyVars(~ ., data = train[, -ignore.features])
train.dummy <- predict(dummy.vars, train)
View(train.dummy)

Use caret to train a the mighty glmnet package as a binary (i.e., logistic) regression model. The code below illustrates the following awesomeness:

1 - Instead of using the formula interface, explicitly set the predictors and the class label.
2 - Pre process our numeric data by centering and scaling (i.e., standardizing) the numeric columns.

glmnet.cv <- train(x = train.dummy,
                   y = as.factor(survived),
                   method = "glmnet",
                   preProcess = c("center", "scale"),
                   trControl = caret.control,
                   tuneLength = 9)

Display the results of the cross validation run - Around 81.5% mean accuracy achieved with an alpha blend of and a lambda of 0.013275.

glmnet.cv

What is the standard deviation?

cat(paste("\nCross validation standard deviation:",  
          sd(glmnet.cv$resample$Accuracy), "\n", sep = " "))

What are the model coefficients of the final, best model?

coefficients <- coef(glmnet.cv$finalModel, glmnet.cv$bestTune$lambda)
coefficients

.Transform test data frame to dummy variables.

test.dummy <- predict(dummy.vars, test)

Make predictions.

preds <- predict(glmnet.cv, test.dummy, type = "raw")