Outlier Detection Using Python - Anand Sudhanaboina

Before writing code I would like to emphasize the difference between anomaly and a outlier:

Outlier: Legitimate data point that’s far away from the mean or median in a distribution.
Anomaly: Illegitimate data point that’s generated by a different process than whatever generated the rest of the data.

Outlier detection varies between single dataset and multiple datasets. In single dataset outlier detection we figure out the outliers within the dataset. We can do this by using two methods, Median Absolute Deviation (MAD) and Standard deviation (SD). Though MAD and SD give different results they are intended to do the same work. I’m not explaining the mathematical expressions as you can find them from wikipedia.

Let’s consider a sample dataset:

dataset

I’ve written a Python script using numpy library, this script calculates both MAD and SD:

from __future__ import division
import numpy

# Sample Dataset
x = [10, 9, 13, 14, 15,8, 9, 10, 11, 12, 9, 0, 8, 8, 25,9,11,10]

# Median absolute deviation
def mad(data, axis=None):
    return numpy.mean(numpy.abs(data - numpy.mean(data, axis)), axis)
_mad = numpy.abs(x - numpy.median(x)) / mad(x)

# Standard deviation
_sd = numpy.abs(x - numpy.mean(x)) / numpy.std(x)

print _mad
print _sd

Let’s visualize the output:

visualize

It’s clear that we’ve detected a spike if there is a change in the dataset. After comparing results of several datasets I would like to mention MAD is more sensitive when compared to SD, but more computing intensive. I’ve experimented the same code with 1 M data points, SD performed near to 2x when compared with MAD.

mad-sd

Multiple dataset outlier detection: In this we figure out anomaly in different datasets when compared with target dataset. For example, say you have data of your web site traffic on hourly basis for 10 days including today, and you would like to figure out if there is an outlier in today’s data when compared with other 9 days data. I’ve done this using Mahalanobis distance algorithm and implemented using Python with numpy.

Let’s consider sample dataset:

multiple-dataset

The highlighted path is the target dataset.Let’s feed this to the algorithm:

from __future__ import division
import numpy as np

# Base dataset
dataset = np.array(
        [
          [9,9,10,11,12,13,14,15,16,17,18,19,18,17,11,10,8,7,8],
          [8,6,10,13,12,11,12,12,13,14,1,16,20,21,19,18,11,5,5],
        ])

# target: dataset to be compared
target = [0,0,0,0,10,9,15,11,15,17,13,14,18,17,14,22,11,5,5]

# Avg of SD of each dataset
dataset_std = dataset.std()

# Avg of arrays in dataset
dataset_sum_avg = np.array([0] * len(dataset[0])) # Create a empty dataset
for data in dataset:
    dataset_sum_avg = dataset_sum_avg + ( data / len(dataset)) # Add up all datapoints of dataset

# Substract the target dataset with avg of datapoints sum and divide by SD
data = np.abs(target - dataset_sum_avg) / dataset_std

print data

This gives us the outliers, on visualizing the result we get:

If you have a look on the data we fed into the algorithm it’s clear that we are able to detect the outliers for today’s input when compared to other x days.

Feel free to explore are a few other algorithms Cosine similarity, Sørensen–Dice coefficient, Jaccard index, SimRank and others.

Comments