Anomaly Detection
I examined a few different methods for identifying outliers given a data set.
I initially intended to use the z-index method but found many sources stating that it is unreliable unless it is known beforehand that the data set distribution is Gaussian or Gaussian-like.
Interquartile Range Method
Khan Academy: Judging outliers in a dataset
As discussed in the video above, this method defines an outlier as either of the following:
outlier < Q1 - [1.5 * (Q3-Q1)]
OR outlier > Q3 + [1.5 * (Q3-Q1)]
Where Q1 and Q3 are the 25th and 75th percentile, respectively.
The benefit of this approach is that it can be used to identify outliers in a data set regardless of the distribution.
Here is a possible implementation in ECMAScript 2015 (JavaScript):
function getPercentile(data, percentile) {
const sortedData = data.sort((a, b) => a - b)
const index = (percentile / 100) * sortedData.length
if (Math.floor(index) === index) {
return (sortedData[(index - 1)] + sortedData[index]) / 2
}
return sortedData[Math.floor(index)]
}
function getOutliers(data) {
const q1 = getPercentile(data, 25)
const q3 = getPercentile(data, 75)
const interQuartileRange = q3 - q1
const lowerLimit = q1 - (1.5 * interQuartileRange)
const upperLimit = q3 + (1.5 * interQuartileRange)
return data.filter(dataPoint => dataPoint < lowerLimit || dataPoint > upperLimit)
}