I am developing a spam filter using Scikit. Here are the steps I follow:
["This is spam" , "This is Ham" , "This is another spam"]
Countvectorizer (XData) . Matrix will contain count of each word in all documents. So Matrix[i][j] will give me counts of word
j in document
TFIDFVectorizer(Matrix) . It will normalize score.
SelectKBest( Matrix_IdfX , 300) . It will reduce matrix to 300 best score columns
Now my question is that do I need to perform normalization or standardization in any of the above four steps? If yes, then after which step and why?