In one data-set I have only a few number of outliers while in the other data-set I have a lot of outlier values. I have to run regression analysis on both of them. Any idea about how should I treat those outlier values? What are the most common ways to treat outlier values?
It is not necessary to remove the outliers from the data every single time. It depends on the type of problem you are trying to solve. Removing the outlier values also results in loss of information sometimes so you must check the impact of these values on the dependent variable in case of regression analysis.
To treat the outliers, we need to combine business understanding and the understanding of data. For example, if you are dealing with age of people and you see a value age = 200 (in years), the error is most likely happening because the data was collected incorrectly, or the person has entered age in months. Depending on what you think is likely, you would either remove (in case one) or replace by 200/12 years.
The following ways can be used to remove the outlier values from data:
- You can remove the top 1% and bottom 1% of values
- Create a box plot. You’ll get Q1, Q2 and Q3. ( data points > Q3 + 1.5IQR ) and ( data points < Q1 – 1.5IQR ) will be considered as outliers. IQR is Interquartile Range. IQR = Q3-Q1
I think it depends on the context. As a data scientist, you need to combine your understanding of statistical distributions, business goals, and apply a skeptical, pragmatic attitude toward the problem.
The first I’d ask myself is whether the outlier value are suggestive of problems with data acquisition that I may not have considered yet. If the data you are looking roughly like they have a certain type of statistical distribution, but you have outliers in that dataset that clearly look extremely weird, and that are very unlikely to be a part of that distribution, that could be suggestive that there are some bugs in how the data was acquired. Maybe in this case, you want to look further into where the problems in data acquisition occurred before moving further. If you decide you want to just ignore it for the sake of pragmatism, perhaps you want to simply remove them.
Lets say you don’t have the above problem, but you have some outlier values which look like they could be naturally occurring, lets say within 2–4 SDs, and your aren’t sure what to do about thwm. Maybe the statistical technique, or other downstream steps will have no problem with these naturally occurring outlier values. Finally, if your dataset if super small, you also need to be careful about just removing items with extreme z-scores, because these are much more likely to occur in the case of small sample sizes.
One case where outliers could be a problem is if you have a parametric statistical test that could be affected by a few outliers that throw everything off. Instead of just removing the outliers, consider a non-parametric test that is less susceptible to the effects of outliers, like bootstrapping.