Data Pipelines - Text Analytics with R

Originally published at:

In our next installment of introduction to text analytics, data pipelines, we take cover:

– Exploration of textual data for pre-processing “gotchas”
– Using the quanteda package for text analytics
– Creation of a prototypical text analytics pre-processing pipeline, including (but not limited to): tokenization, lower casing, stop word removal, and stemming.
– Creation of a document-frequency matrix used to train machine learning models

Kaggle Dataset can be found here

The data and R code used in this series is available here