Do I need to standardize text data when doing text classification?

toobamukhtar · March 2, 2019, 12:20pm

I am developing a spam filter using Scikit. Here are the steps I follow:

Xdata = ["This is spam" , "This is Ham" , "This is another spam"]

Matrix = Countvectorizer (XData) . Matrix will contain count of each word in all documents. So Matrix[i][j] will give me counts of word j in document i

Matrix_idfX = TFIDFVectorizer(Matrix) . It will normalize score.

Matrix_idfX_Select = SelectKBest( Matrix_IdfX , 300) . It will reduce matrix to 300 best score columns

Multinomial.train(Matrix_Idfx_Select)

Now my question is that do I need to perform normalization or standardization in any of the above four steps? If yes, then after which step and why?

Rabeez · April 3, 2019, 10:10am

You say that

[tfidf] will normalize score.

This isn’t fully correct since TF-IDF score does a lot more than simply normalizing the word counts. You can look at its wikipedia page for details and formulas.

As for normalization or standardization, there are two points where you could do this.

Right after Countvectorizer you can normalize the counts to create a representation which is invariant of the sentence length. This would be done instead of the TF-IDF representation.
The other option is to normalize the vectors after the TFIDFVectorizer, which won’t be as useful since (like I said) there is a normalization factor included in the IDF term.

You should compare the normalized count and TF-IDF representations both and measure which perform better for your combination of model and data.