How to calculate variable importance of titanic dataset?

datasciencedojo · October 18, 2017, 12:08am

The following code snippet shows the variable importance of dataset.

Installing and loading packages:

1.Install required packages if you have not installed them.
2. Load the libraries like caret and randomForest.

install.packages(c("caret","randomForest"))   #(1)
library(caret)     #(2)
library(randomForest)  #(2)

Read the file and stored it as dataframe::

titanic.raw <- read.csv("titanic.csv", stringAsFactors=FALSE)

Selecting Relevant Features:

Create the list of relevant features.
Select only relevant features from dataset.

 features <- c(
    "Survived",
    "Pclass",
    "Sex",
    "Age",
    "SibSp",
    "Parch",
    "Fare",
    "Embarked" 
)                       #(3)

titanic <- titanic.raw[,features]              #(4)

Converting some features into factor columns:

titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Sex <- as.factor(titanic$Sex)
titanic$Embarked <- as.factor(titanic$Embarked)

Using dummyVars function:

dummy_vars <- dummyVars(~., data = titanic)
titanic <- data.frame(predict(dummy_vars, newdata = titanic))

Converting Survived value into factor column:

titanic$Survived <- as.factor(titanic$Survived)

cleaning missing values:

titanic[is.na(titanic$Age),"Age"] <- median(titanic$Age, na.rm = TRUE)

modeling:

titanic.forest <- randomForest(Survived~., data=titanic, importance=TRUE)

Displaying the importance of each variable:

varImpPlot(titanic.forest)

Rplot01

toobamukhtar · April 15, 2019, 6:24pm

toobamukhtar · April 16, 2019, 12:36pm

Feature Engineering methods can be used to figure out which variable will have the most impact on the dependent variable. There are several libraries to calculate the variable/feature importance which you can use. You can also use hypothesis testing to determine whether a particular feature X will have an impact on the dependent variable Y.