The following code snippet shows the variable importance of dataset.
Installing and loading packages:
1.Install required packages if you have not installed them.
2. Load the libraries like caret and randomForest.
install.packages(c("caret","randomForest")) #(1)
library(caret) #(2)
library(randomForest) #(2)
Read the file and stored it as dataframe::
titanic.raw <- read.csv("titanic.csv", stringAsFactors=FALSE)
Selecting Relevant Features:
- Create the list of relevant features.
- Select only relevant features from dataset.
features <- c(
"Survived",
"Pclass",
"Sex",
"Age",
"SibSp",
"Parch",
"Fare",
"Embarked"
) #(3)
titanic <- titanic.raw[,features] #(4)
Converting some features into factor columns:
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Sex <- as.factor(titanic$Sex)
titanic$Embarked <- as.factor(titanic$Embarked)
Using dummyVars function:
dummy_vars <- dummyVars(~., data = titanic)
titanic <- data.frame(predict(dummy_vars, newdata = titanic))
Converting Survived value into factor column:
titanic$Survived <- as.factor(titanic$Survived)
cleaning missing values:
titanic[is.na(titanic$Age),"Age"] <- median(titanic$Age, na.rm = TRUE)
modeling:
titanic.forest <- randomForest(Survived~., data=titanic, importance=TRUE)
Displaying the importance of each variable:
varImpPlot(titanic.forest)