Building Our First Model - Introduction to Text Analytics with R

Originally published at:

We are now ready to build our first model in RStudio and to do that, we cover:

– Correcting column names derived from tokenization to ensure smooth model training.
– Using caret to set up stratified cross validation.
– Using the doSNOW package to accelerate caret machine learning training by using multiple CPUs in parallel.
– Using caret to train single decision trees on text features and tune the trained model for optimal accuracy.
– Evaluating the results of the cross validation process.

Kaggle Dataset can be found here

The data and R code used in this series is available here

About the Series

This data science tutorial introduces the viewer to the exciting world of text analytics with R programming. As exemplified by the popularity of blogging and social media, textual data if far from dead – it is increasing exponentially! Not surprisingly, knowledge of text analytics is a critical skill for data scientists if this wealth of information is to be harvested and incorporated into data products. This data science training provides introductory coverage of the following tools and techniques:

– Tokenization, stemming, and n-grams
– The bag-of-words and vector space models
– Feature engineering for textual data (e.g. cosine similarity between documents)
– Feature extraction using singular value decomposition (SVD)
– Training classification models using textual data
– Evaluating accuracy of the trained classification models