Pre-process the Quora training data (e.g., stop word removal, lower casing, stemming, etc.with the goal of creating 100 topic models. A potential feature strategy is:
1 - For each of the 100 topics find the top 100 training questions in terms of
similarity.
2 - For each question in the training set, find the average cosine similarity with the
top 100 questions for each of the 100 topics.
3 - For each question in the test set, find the average cosine similarity with the top
100 training questions for each of the 100 topics.
Installing and loading the libraries:
1.Install packages if you have not installed them.
2.Load required libraries needed for topic modelling.
install.packages(c("quanteda", "mallet", "wordcloud")) #1
library(quanteda)
library(mallet)
library(wordcloud)
Read training data:
3.Read csv file and store it in dataframe format.
train <- read.csv("train.csv", stringsAsFactors = FALSE) #(3)
Transform training data:
4.Select the relevant columns from train data.
train.q1 <- train[, c(1, 2, 4)]
train.q2 <- train[, c(1, 3, 5)]
5.Name the selected columns:
names(train.q1) <- c("mapping", "qid", "question")
names(train.q2) <- c("mapping", "qid", "question")
6.Combine train.q1 and train.q2 data.
train.questions ← rbind(train.q1, train.q2)
Clean up unused variables to free up memory:
rm(train.q1)
rm(train.q2)
gc()
Use the quanteda pacakge to pre-process string data:
question.tokens <- tokens(train.questions$question, removePunct = TRUE,
removeNumbers = TRUE, removeSymbols = TRUE,
removeHyphens = TRUE)
question.tokens <- tokens_tolower(question.tokens)
question.tokens <- removeFeatures(question.tokens, stopwords("english"))
question.tokens <- tokens_wordstem(question.tokens)
train.questions$question <- unlist(lapply(question.tokens, paste, collapse = " "))
Write out the quanteda stop words to a file on disk for use by mallet:
write.csv(stopwords("english"), file = "stopwords.txt", quote = FALSE,
row.names = FALSE)
Leverage the mallet package for topic modelling. Initialize mallet with the raw question pre-processed text data:
mallet.instances <- mallet.import(as.character(train.questions$qid),
train.questions$question,
"stopwords.txt")
Create a topic trainer object:
num.topics <- 100
num.iter <- 1000
topic.model <- MalletLDA(num.topics = num.topics)
Load trainer with pre-processed text data:
topic.model$loadDocuments(mallet.instances)
Build topic models:
NOTE - This took 65 minutes to run on my workstation.
topic.model$train(num.iter)
Get token-to-topic mappings:
topic.words.m <- mallet.topic.words(topic.model, smoothed=TRUE,
normalized=TRUE)
Visualize the top 100 words of each topic as a wordcloud saved to disk as a .png file:
for(i in 1:num.topics) {
file.name <- paste0("TopicModel_", i, "_", num.iter, "Iter.png")
png(file.name)
topic.top.words <- mallet.top.words(topic.model,
topic.words.m[i,],
100)
wordcloud(topic.top.words$words, topic.top.words$weights,
c(4,.8), rot.per = 0, random.order = FALSE,
colors = brewer.pal(8, "Dark2"))
dev.off()
}