Outlier detection in PyOD using KNN

Are outliers the real issue?

An outlier can be a real pain when analyzing your data.

But are they always the issue? For that, we need to detect outliers in the dataset and then find out the cause.

Firstly, let’s see what is an Outlier.

An outlier is any data point that differs greatly from the rest of the observations in a dataset.

Outliers can impact the results of our analysis and statistical modeling in a drastic way. It's important to detect outliers.

Let’s see an example, we'll use the PYOD library with the K-Nearest Neighbors (KNN) algorithm for outlier detection and visualize the detected outliers.

PyOD provides access to around 20 outlier detection algorithms for detecting outliers in multivariate data.

Necessary libraries:

> numpy

> matplotlib.pyplot

> pyod.models.knn

Data Generation: The “generate_data” function creates a dataset with a specified proportion of outliers. You can replace this with your actual dataset.

Initialize the KNN Model with the specified contamination (proportion of outliers).

Train the KNN model on the training data using the fit() method.

Use the predict() method to predict outliers in the test data.

The decision_function() method returns outlier scores for each data point.

Create a scatter plot to visualize the outliers.

Outlier scores determine the color of each point in the plot.

So are these outliers bad?

The answer is:

Outliers are not necessarily a bad thing. These are just observations that do not follow the same pattern as the other ones.

But it can be the case that an outlier is very interesting for Science. For example, if in a biological experiment, a rat is not dead whereas all others are, then it would be very interesting to understand why.

This could lead to new scientific discoveries. So, it is important to detect outliers.

This being said, if the aim of the analysis is to explain the overall pattern in some populations, then removing the outliers and doing the analysis again without them is a good idea, since they can alter the results and interpretation.

For example, one outlier could lead to reject the normality hypothesis.

Join the conversation

or to participate.