How to do Text Analytics on Titanic Dataset with R?

datasciencedojo · October 18, 2017, 4:59pm

Installing and loading quanteda library

1 Install quanteda package if you have not installed it.
2. Load the quanteda package.
3. Read train data and set string as factors to false. By default, read.csv set strings as factors.
4. Read the test data and set strings as factors to false.

install.packages("quanteda")   #(1)
library(quanteda)   #(2)
train <- read.csv(train, stringsAsFactors = FALSE)  #(3)
test <- read.csv(test, stringsAsFactors = FALSE)   #(4)

Combining train and test data for cleaning

5.First store the Survived column data in survived variable.
6.Combine the test and train data for cleaning data after excluding survived column in train data.

survived <- train$Survived    #(5)
data.combined <- rbind(train[, -2], test)   #(6)

Transforming some variables to factors

7.Convert Pclass variable into factor.
8. Convert Sex variable into factor.

data.combined$Pclass <- as.factor(data.combined$Pclass)  #(7)
data.combined$Sex <- as.factor(data.combined$Sex)   #(8)

Using the quanteda package

9.Use the quanteda package to create n-grams from the characters of passenge Tickets.

 ticket.ngrams <- dfm(data.combined$Ticket, what = "character", 
                         remove_numbers = FALSE, remove_punct = TRUE,
                        remove_symbols = TRUE, remove_hyphens = TRUE,
                         ngrams = 1:3)   #(9)

Using TF-IDF to enhance n-gram features

10.We are going to use TF-IDF to enhance the n-gram features.

ticket.tfidf <- tfidf(ticket.ngrams)  #(10)

Using the cosine similarity between tickets

11.We are going to use cosine.sum to calculate the cosine similarity between all Tickets
12. Use View function to view consine.sum.

cosine.sim <- as.matrix(textstat_simil(ticket.tfidf, 
                                           me  thod = "cosine", 
                                           margin = "documents"))   #(11)
 View(cosine.sim)  #(12)

Converting tf-df values into dataframe

13.To use tf-idf values, we will need to convert to a dataframe. Then we can use the cbind() function.
14. You can also check the dimension of new dataframe.

ticket.tfidf.df <- as.data.frame(ticket.tfidf)  #(13)
dim(ticket.tfidf.df)  #(14)

Fixing the column format

14.However, the column names are not in a good format,
15.Fix up the column names on the data frame.
16. Now you can see the column names are in good format.

names(ticket.tfidf.df)[1:10]    #(14)
names(ticket.tfidf.df) <- paste("Ticket_",
                                    names(ticket.tfidf.df),
                                    sep = "")     #(15)
names(ticket.tfidf.df)[1:10]     #(16)

Combining the columns

17.Combine the columns of two dataframes.
18. Check the dimension of data.combined…
19.Get the names of data.combined. .

data.combined <- cbind(data.combined, ticket.tfidf.df)     #(17)
dim(data.combined)   #(18)
names(data.combined)[1:20]      #(19)

Transform some variables to factors

20.Convert Pclass in data.combined into factor.
21. Convert Sex variable in data.combined into factor.

data.combined$Pclass <- as.factor(data.combined$Pclass)  # (20)
data.combined$Sex <- as.factor(data.combined$Sex)      # (21)

Splitting data back out

Select the first 891 rows of data.combined and store them in train variable.
Creating the Survived and assing the survived value.
Select remaining rows from data.combined and store them in test variable.

train <- data.combined[1:891,]    #(22)
train$Survived <- as.factor(survived)   #(23)
test <- data.combined[892:1309,]  #(24)