1 Install quanteda package if you have not installed it.
2. Load the quanteda package.
3. Read train data and set string as factors to false. By default, read.csv set strings as factors.
4. Read the test data and set strings as factors to false.
install.packages("quanteda") #(1) library(quanteda) #(2) train <- read.csv(train, stringsAsFactors = FALSE) #(3) test <- read.csv(test, stringsAsFactors = FALSE) #(4)
5.First store the Survived column data in survived variable.
6.Combine the test and train data for cleaning data after excluding survived column in train data.
survived <- train$Survived #(5) data.combined <- rbind(train[, -2], test) #(6)
7.Convert Pclass variable into factor.
8. Convert Sex variable into factor.
data.combined$Pclass <- as.factor(data.combined$Pclass) #(7) data.combined$Sex <- as.factor(data.combined$Sex) #(8)
9.Use the quanteda package to create n-grams from the characters of passenge Tickets.
ticket.ngrams <- dfm(data.combined$Ticket, what = "character", remove_numbers = FALSE, remove_punct = TRUE, remove_symbols = TRUE, remove_hyphens = TRUE, ngrams = 1:3) #(9)
10.We are going to use TF-IDF to enhance the n-gram features.
ticket.tfidf <- tfidf(ticket.ngrams) #(10)
11.We are going to use cosine.sum to calculate the cosine similarity between all Tickets
12. Use View function to view consine.sum.
cosine.sim <- as.matrix(textstat_simil(ticket.tfidf, me thod = "cosine", margin = "documents")) #(11) View(cosine.sim) #(12)
13.To use tf-idf values, we will need to convert to a dataframe. Then we can use the cbind() function.
14. You can also check the dimension of new dataframe.
ticket.tfidf.df <- as.data.frame(ticket.tfidf) #(13) dim(ticket.tfidf.df) #(14)
14.However, the column names are not in a good format,
15.Fix up the column names on the data frame.
16. Now you can see the column names are in good format.
names(ticket.tfidf.df)[1:10] #(14) names(ticket.tfidf.df) <- paste("Ticket_", names(ticket.tfidf.df), sep = "") #(15) names(ticket.tfidf.df)[1:10] #(16)
17.Combine the columns of two dataframes.
18. Check the dimension of data.combined…
19.Get the names of data.combined. .
data.combined <- cbind(data.combined, ticket.tfidf.df) #(17) dim(data.combined) #(18) names(data.combined)[1:20] #(19)
20.Convert Pclass in data.combined into factor.
21. Convert Sex variable in data.combined into factor.
data.combined$Pclass <- as.factor(data.combined$Pclass) # (20) data.combined$Sex <- as.factor(data.combined$Sex) # (21)
- Select the first 891 rows of data.combined and store them in train variable.
- Creating the Survived and assing the survived value.
- Select remaining rows from data.combined and store them in test variable.
train <- data.combined[1:891,] #(22) train$Survived <- as.factor(survived) #(23) test <- data.combined[892:1309,] #(24)