Data Story · Statistics

Outliers, are they totally BAD?

I’ve always considered myself different from every other person and I’m sure there are others who will also claim to be different. Talk and claims are cheap but the evidence is key. The question that usually comes up is, “what makes you different” and if you can show or give concrete reasons why your claim is true, then and only then can you be seen as being different.

Proving to someone that your quite special might seem easy, but it can be difficult and daunting when you’re faced with a task of picking or differentiating the special from the rest given a very large number of options (ask the judges in talent shows… they’ll have a lot to share). In statistics/data science, these “special or exceptional” people can generally be called outliers.

outlier detection

Now, what are outliers in data?. An outlier is generally an observation that deviates from the rest, so as to raise eyebrows and questions like, “you don’t belong here!” and “you’re significantly different from the rest”. They can be observations that are incorrectly measured or recorded, a relatively extreme value, a contaminant (e.g when you buy beans and find some corn bits in it) or a legitimate, but surprising/unexpected data value. Outliers are not totally bad, in fact, they can tell you a lot. For instance, assume you have access to demographics of women receiving antenatal care (ANC) at a particular clinic and you discover in their database a woman aged 9. Your first instinct usually thinks it is a mistake and you’ll probably want to put a “1” before the 9, but closer investigations may reveal this to be a legitimate data instance. On the other hand, if the age was 999, then you would have to exclude the observation or correct it if possible because it will distort analysis and give misleading results.

Example: Let’s assume further that you want to calculate the average age of the women taking ANC at the clinic and they’re just 11 of them: 21, 35, 37, 27, 19, 40, 17, 34, 26, 18 and 999. The average age of women receiving ANC will be about 116 years. This is ridiculously misleading, given the data.

The same way you will look at the image above and differentiate the green apples from the red using colour as the basis of separation, there are techniques and algorithms that can do same with a given dataset/s. Outlier detection involves finding these kinds of observations in data. There are several methods to achieve this, you can look them up here. Applications include Intrusion Detection Systems, fraud detection, medical diagnostics, law enforcement, earth science, sports etc.

Happy reading.

If you spot a typo or any error in the post, please let me know in the comment section – so I can fix it.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s