How to do topic modelling in r?

datasciencedojo · October 14, 2017, 1:10am

Pre-process the Quora training data (e.g., stop word removal, lower casing, stemming, etc.with the goal of creating 100 topic models. A potential feature strategy is:
1 - For each of the 100 topics find the top 100 training questions in terms of
similarity.
2 - For each question in the training set, find the average cosine similarity with the
top 100 questions for each of the 100 topics.
3 - For each question in the test set, find the average cosine similarity with the top
100 training questions for each of the 100 topics.

Installing and loading the libraries:

1.Install packages if you have not installed them.
2.Load required libraries needed for topic modelling.

install.packages(c("quanteda", "mallet", "wordcloud"))   #1
library(quanteda)
library(mallet)
library(wordcloud)

Read training data:

3.Read csv file and store it in dataframe format.

train <- read.csv("train.csv", stringsAsFactors = FALSE)      #(3)

Transform training data:

4.Select the relevant columns from train data.

train.q1 <- train[, c(1, 2, 4)]     
train.q2 <- train[, c(1, 3, 5)]

5.Name the selected columns:

names(train.q1) <- c("mapping", "qid", "question")
names(train.q2) <- c("mapping", "qid", "question")

6.Combine train.q1 and train.q2 data.
train.questions ← rbind(train.q1, train.q2)

Clean up unused variables to free up memory:

rm(train.q1)
rm(train.q2)
gc()

Use the quanteda pacakge to pre-process string data:

question.tokens <- tokens(train.questions$question, removePunct = TRUE, 
                          removeNumbers = TRUE, removeSymbols = TRUE, 
                          removeHyphens = TRUE)
question.tokens <- tokens_tolower(question.tokens)
question.tokens <- removeFeatures(question.tokens, stopwords("english"))
question.tokens <- tokens_wordstem(question.tokens)
train.questions$question <- unlist(lapply(question.tokens, paste, collapse = " "))

Write out the quanteda stop words to a file on disk for use by mallet:

write.csv(stopwords("english"), file = "stopwords.txt", quote = FALSE, 
          row.names = FALSE)

Leverage the mallet package for topic modelling. Initialize mallet with the raw question pre-processed text data:

mallet.instances <- mallet.import(as.character(train.questions$qid),
                                  train.questions$question,
                                  "stopwords.txt")

Create a topic trainer object:

num.topics <- 100
num.iter <- 1000
topic.model <- MalletLDA(num.topics = num.topics)

Load trainer with pre-processed text data:

topic.model$loadDocuments(mallet.instances)

Build topic models:

NOTE - This took 65 minutes to run on my workstation.

topic.model$train(num.iter)

Get token-to-topic mappings:

topic.words.m <- mallet.topic.words(topic.model, smoothed=TRUE,
                                    normalized=TRUE)

Visualize the top 100 words of each topic as a wordcloud saved to disk as a .png file:

for(i in 1:num.topics) {
  file.name <- paste0("TopicModel_", i, "_", num.iter, "Iter.png")
  png(file.name)
  topic.top.words <- mallet.top.words(topic.model,
                                      topic.words.m[i,],
                                      100)
  wordcloud(topic.top.words$words, topic.top.words$weights,
            c(4,.8), rot.per = 0, random.order = FALSE,
            colors = brewer.pal(8, "Dark2"))
  dev.off()
}