How to calculate variable importance of titanic dataset?

The following code snippet shows the variable importance of dataset.

Installing and loading packages:

1.Install required packages if you have not installed them.
2. Load the libraries like caret and randomForest.

install.packages(c("caret","randomForest"))   #(1)
library(caret)     #(2)
library(randomForest)  #(2)

Read the file and stored it as dataframe::

titanic.raw <- read.csv("titanic.csv", stringAsFactors=FALSE)    

Selecting Relevant Features:

  1. Create the list of relevant features.
  2. Select only relevant features from dataset.
 features <- c(
)                       #(3)

titanic <- titanic.raw[,features]              #(4)

Converting some features into factor columns:

titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Sex <- as.factor(titanic$Sex)
titanic$Embarked <- as.factor(titanic$Embarked)

Using dummyVars function:

dummy_vars <- dummyVars(~., data = titanic)
titanic <- data.frame(predict(dummy_vars, newdata = titanic))

Converting Survived value into factor column:

titanic$Survived <- as.factor(titanic$Survived)

cleaning missing values:

titanic[$Age),"Age"] <- median(titanic$Age, na.rm = TRUE)


titanic.forest <- randomForest(Survived~., data=titanic, importance=TRUE)

Displaying the importance of each variable:



Feature Engineering methods can be used to figure out which variable will have the most impact on the dependent variable. There are several libraries to calculate the variable/feature importance which you can use. You can also use hypothesis testing to determine whether a particular feature X will have an impact on the dependent variable Y.