FAQ: Should I drop the outliers from my analysis?
Outliers are sample observations that are either much larger or much smaller than the other observations. Imagine Jane, the general manager of a chain of convenience food stores, has asked a statistician, Vanessa, to assist her with the analysis of data on the daily sales at the convenience stores she manages.
What do you notice about the data?
Vanessa pointed out to Jane the presence of outliers in the data from store 4 on days 5 and 9. Vanessa recommends to Jane that she checks the accuracy of the data. Are the outliers due to recording or measurement error? If the outliers can’t be attributed to errors in the data, Jane should investigate what might have caused the increased sales on these days.
Should we remove the outliers?
Vanessa explained to Jane we should never drop a data value just because it is an outlier. The nature of the outlier should be investigated before deciding what to do.
Whenever there are outliers in the data, we should look for possible causes of error in the data. If you find an error but cannot recover the correct data value, then you should replace the incorrect data value with a missing value. However, outliers can also be real observations, and sometimes these are the most interesting ones!
What should we do if we need to keep the outlier?
- Transform the data: If the data set is not normally distributed, we can try transforming the data to normalize it. For example, if the data set has some high-valued outliers (i.e. is right skewed), the log transformation will “pull” the high values in. This often works well for count data.
- Try a different model/analysis: Different analyses may make different distributional assumptions, and you should pick one that is appropriate for your data. Alternatively, the outliers may be able to be modelled using an appropriate explanatory variable.
In the example above, Vanessa suggested that since the mean for store 4 is highly influenced by the outliers, the median, another measure of central tendency, is a more appropriate statistics for summarizing the daily sales at each store.