5- How to improve classification on small texts data-set?

My data mostly consists of precise tweets or comments (350-400 chars long). I used both Bag-Of-Word model and Naive Bayes classification. As a result, I’m having a lot of misclassified cases which are of the type mentioned below:

  • He sucked on a lemon early morning to get rid of hangover.

  • That movie sucked big time.

Now the problem is that during sentiment classification both are classified as Negative just because of the word “sucked”.

Similarly, during document classification both are classified into movies due to the presence of word sucked. I have a huge number of misclassification instances and don’t have any idea on how to improve the accuracy.

The bag of words model, as the name suggests, is going to take the words in a sentence and put them in a bag. This suggests that their relative ordering will have zero value.

The bag of words model is short sighted. And naive bayes is just counting.

So you’re essentially counting the number of times each word, without regard to the words around it, is associated with a certain sentiment.

To address your problem, you need to use the bigram or trigram model, after removing the stop words. Then you model will learn a different association for "sucked " and "sucked ".

It might also help to use a different classifier than one that relies on basic counting, but this depends on the data.