I’m working on a medical dataset which contains doctors’ notes, prescriptions and reports.
So far, I’ve tried using scikit-learn’s ‘english’ stopwords list in TfidfVectorizer.
While analysing the text, I have found that there are a lot of medical domain-related stopwords. How can I remove these stopwords? Is there a systematic approach to curate a domain-driven stopwords list?
The term stopwords usually refers to words of low importance to the NLP task and which usually occur very often. Since you’re saying that the words you want removed are domain-related I’m going to assume that they do not contribute any meaningful information for your machine learning model.
A simple way to identify these words would be to look at the most frequently occurring words and see if the possible stopwords are there. Your could check the words by looking for high term frequency but in this case looking at low inverse document frequency (the denominator term from TF-IDF) would be more relevant since it would reveal the most frequent words across multiple documents.
You will still have to go through the sorted list of words and manually decide whether to place a word in the stopwords pile since these are domain-specific terms and a preexisting list (like the one in scikit-learn) probably does not exist. Your best bet for such a list would be to consult domain experts to get a head start on creating such a list.
As the other answer suggested one approach would be to use tf-idf score. The words which occur in most of the queries will be of little help in differentiating the good search queries from bad ones. But ones which occur very frequently (high tf or term-frequency) in only few queries (high idf or inverse document frequency) as likely to be more important in distinguishing the good queries from the bad ones.