Anomaly Detection: Justin Behnke

I examined a few different methods for identifying outliers given a data set.

I initially intended to use the z-index method but found many sources stating that it is unreliable unless it is known beforehand that the data set distribution is Gaussian or Gaussian-like.

Interquartile Range Method

Khan Academy: Judging outliers in a dataset

As discussed in the video above, this method defines an outlier as either of the following:

outlier < Q1 - [1.5 * (Q3-Q1)] OR outlier > Q3 + [1.5 * (Q3-Q1)]

Where Q1 and Q3 are the 25th and 75th percentile, respectively.

The benefit of this approach is that it can be used to identify outliers in a data set regardless of the distribution.

Here is a possible implementation in ECMAScript 2015 (JavaScript):

function getPercentile(data, percentile) {
  const sortedData = data.sort((a, b) => a - b)
  const index = (percentile / 100) * sortedData.length

  if (Math.floor(index) === index) {
    return (sortedData[(index - 1)] + sortedData[index]) / 2
  }
  return sortedData[Math.floor(index)]
}

function getOutliers(data) {
  const q1 = getPercentile(data, 25)
  const q3 = getPercentile(data, 75)

  const interQuartileRange = q3 - q1

  const lowerLimit = q1 - (1.5 * interQuartileRange)
  const upperLimit = q3 + (1.5 * interQuartileRange)

  return data.filter(dataPoint => dataPoint < lowerLimit || dataPoint > upperLimit)
}