Growing up, I was an outlier. I was short, chubby, wore red glasses, listened to music by artists unknown to my friends, did not practice Buddhism, and was nerdy. I felt that I was different from everyone else – and not in a particularly good way. I just did not fit into the mainstream. I was a complete outlier from stereotypical image of Asian teen girls.
Statisticians must consider outliers in their analyses. Statistically, outliers are those few data points that deviate from the majority of the population or study sample. They are so different from the rest of the data that they stand out if you plot the results.
For example, in the chart below, the majority of the data points are centered around a diagonal line. However, there are two data dots on the bottom right corner that are clearly not in the same space as the rest of data. Without running a regression diagnostic, a visual inspection of the graph would reveal these two points might potentially be classified as statistical outliers.
Statistical outliers are influential data points that might make drawing inference from a statistical analysis difficult, or even misleading. The presence of outliers in any statistical analyses, particularly in a criterion-related validation study with small sample sizes, could be troublesome. Should you keep outliers in the analyses? Or should you simply mark them as outliers and exclude those data points in the sample? The answer is not always easy.
A Lottery Winner Proves the Importance of Outlier Analysis
Imagine you are conducting a study to investigate whether or not earnings are related to education level. You decide to survey how much money a person has in the bank and their highest degree level completed. You hypothesize that there is a positive correlation between the two; as the education level goes up, the money owned would increase correspondingly. Put another way, someone with a post graduate degree would likely have more money than someone who has a high school diploma. You collect the data and are excited to run a correlational analysis to test your hypothesis.
Cut to the analysis: you are stunned to find out the relationship is not as expected! There is no significant and positive relationship. Actually, it’s a negative relationship. As you comb through the data, you notice something odd – an individual without a high school diploma has over $25 million in net worth. As you look closer, you find out he is a lottery winner.
A-ha! The moment of truth – you remove this individual from the sample and re-run the correlation – a significant and positive relationship emerges.
Granted, this scenario seems extreme. What are the chances your study sample would have a lottery winner? We all know it is a low probability event. However, it would be a mistake to underestimate how often statistician runs into this issue. In fact, when you look at any statistical books, there is always a chapter detailing the steps, procedures, and judgment necessary to identify, examine and determine what to do with outliers. Making inferences from any statistical analyses without a thorough outlier analyses would be irresponsible and almost borderline unethical. The conclusion from any data analysis is only as good as any data points it includes.
As illustrated in the above hypothetical example, examining statistical outlier prior to data analysis is critical, and, validating employee assessments used in the hiring processes is just as important. The secret of the trade (or, perhaps just less spoken of), is that real life data is messy! So, the next time you hear a statistician or I/O Psychologist talk about outlier analyses, you'll know how important it is.