Cosine Similarity - Introduction to Text Analytics with R

Originally published at:

Cosine Similarity includes specific coverage of:

– How cosine similarity is used to measure similarity between documents in vector space.
– The mathematics behind cosine similarity.
– Using cosine similarity in text analytics feature engineering.
– Evaluation of the effectiveness of the cosine similarity feature.

The data and R code used in this series is available here

Watch the whole series on text analytics here

Next Up is Part 11 of this video series here

About the Series

This data science tutorial is an Introduction to Text Analytics with R. As exemplified by the popularity of blogging and social media, textual data if far from dead – it is increasing exponentially! Not surprisingly, knowledge of text analytics is a critical skill for data scientists if this wealth of information is to be harvested and incorporated into data products. This data science training provides introductory coverage of the following tools and techniques:

– Tokenization, stemming, and n-grams
– The bag-of-words and vector space models
– Feature engineering for textual data (e.g. cosine similarity between documents)
– Feature extraction using singular value decomposition (SVD)
– Training classification models using textual data
– Evaluating accuracy of the trained classification models