How to do text analytics with glmnet and caret packages?`

Note: Get your train and test data from Titanic: Machine Learning from Disaster | Kaggle.
Set your working directory properly.

Next up read your data and store them in train and test variables.

train <- read.csv("train.csv",stringAsFactors=FALSE)
test <- read.csv("test.csv", stringAsFactors=FALSE)

Combine the data to make data cleaning easier:

survived <- train$Survived
data.combined <- rbind(train[, -2], test)

Transform some variables to factors:

data.combined$Pclass <- as.factor(data.combined$Pclass)
data.combined$Sex <- as.factor(data.combined$Sex)

Test data is missing a fare, clean up:


Summarize Fare values similar to the missing record:

similar.fares <- train[train$Pclass == "3" & train$Sex == "male" &
                       train$SibSp == 0 & train$Parch == 0 &
                       !$Age) & train$Age >= 50,]

Replace missing fare with median:

data.combined$Fare[$Fare)] <- 7.75

Create a feature for family size:.

data.combined$FamilySize <- 1 + data.combined$SibSp + data.combined$Parch

Replace the missing values of Embarked with the most common value:

data.combined$Embarked[data.combined$Embarked == ""] <- "S"

Make a factor.

data.combined$Embarked <- as.factor(data.combined$Embarked)

Use the mighty stringr package to pull out titles:


Use regular expression (regex).

data.combined$Title <- str_extract(data.combined$Name, "[a-zA-Z]+\\.")

Implement logic to collapse Titles, starting with adult males should have the title of “Mr.”.

data.combined$Title[data.combined$Sex == "male" &
                    data.combined$Age >= 16] <- "Mr."

Collapse male children to Title of “Master.”.

data.combined$Title[data.combined$Sex == "male" &
                    data.combined$Age <= 16] <- "Master."

Collapse female children to Title of “Girl.”.

data.combined$Title[data.combined$Sex == "female" &
                      data.combined$Age < 16] <- "Girl."

Collapse titles and map to “Miss.”.

table(data.combined$Title[data.combined$Sex == "female"])

data.combined$Title[data.combined$Title == "Ms." |
                    data.combined$Title == "Mlle."] <- "Miss."

Collapse titles and map to “Mrs.”.

table(data.combined$Title[data.combined$Sex == "female"])

data.combined$Title[data.combined$Title == "Countess." |
                    data.combined$Title == "Dona." |
                    data.combined$Title == "Lady." |
                    data.combined$Title == "Mme." |
                    data.combined$Title == "Dr."] <- "Mrs."

Make a factor:

data.combined$Title <- as.factor(data.combined$Title)

Before we impute missing ages, add a tracking feature:

data.combined$MissingAge <- as.factor(ifelse($Age),
                                             "Y", "N"))

Leverage a mighty Random Forest to impute missing ages:


Setup training data:

ignore.features <- c(1, 3, 8, 10)
age.train <- data.combined[1:891,]
age.train <- age.train[!$Age), -ignore.features]

Set seed for reproducibility:

rf.age <- randomForest(Age ~ ., data = age.train, 
                       importance = TRUE)

What’s the MAE of this model?

mean(abs(rf.age$predicted - age.train$Age))

Impute missing ages and overwrite NAs in combined data frame

data.combined$Age[$Age)] <- 
  predict(rf.age, data.combined[$Age),])

Double-check our work.


Split data back out.

train <- data.combined[1:891,]
test <- data.combined[892:1309,]

Use the mighty caret package to perform cross validation with our features using the mighty glmnet package.

Set seed to ensure reproducibility between runs


Set up caret to perform 10-fold cross validation repeated 3 times:

caret.control <- trainControl(method = "repeatedcv",
                              number = 10,
                              repeats = 3)

Leverage caret to create dummy variables (i.e., one-hot encoding)as glmnet will not work with factor variables.

dummy.vars <- dummyVars(~ ., data = train[, -ignore.features])
train.dummy <- predict(dummy.vars, train)

Use caret to train a the mighty glmnet package as a binary (i.e., logistic) regression model. The code below illustrates the following awesomeness:

1 - Instead of using the formula interface, explicitly set the predictors and the class label.
2 - Pre process our numeric data by centering and scaling (i.e., standardizing) the numeric columns. <- train(x = train.dummy,
                   y = as.factor(survived),
                   method = "glmnet",
                   preProcess = c("center", "scale"),
                   trControl = caret.control,
                   tuneLength = 9)

Display the results of the cross validation run - Around 81.5% mean accuracy achieved with an alpha blend of and a lambda of 0.013275.

What is the standard deviation?

cat(paste("\nCross validation standard deviation:",  
          sd($resample$Accuracy), "\n", sep = " "))

What are the model coefficients of the final, best model?

coefficients <- coef($finalModel,$bestTune$lambda)

.Transform test data frame to dummy variables.

test.dummy <- predict(dummy.vars, test)

Make predictions.

preds <- predict(, test.dummy, type = "raw")