Finding outliers in the Exploring Ebay Car Sales Data mission

Greetings everyone. In the sub-mission called “Exploring the Odometer and Price Columns” the following task is to be done:

  • If you find there are outliers, remove them and write a markdown paragraph explaining your decision.

The question is - how does one find those outliers?

1 Like

Susan writes many good articles.
The definition of outliers change with time and the goal of the analysis.
It’s purpose could be simply for understanding data,parameter estimation, or building accurate prediction models.
Outliers can be found in 1D,2D,3D … Usually people find 1D outliers first by looking at values too far apart from the mean at the edges, and a human imposed threshold defines far. Otherwise, if this threshold is part of a modelling pipeline like anomaly/novelty detection, the threshold can be tuned to optimize the downstream modelling metric such as % of total points identified as outliers (you can think of this like some modelling hyperparameter.). Besides being far apart from the dense points, containing impossible values like going to the negative could be seen as outlier too in 1D.

For 2D, it’s usually about not following a linear or curvature pattern in a scatterplot. This outlier could occur in the middle of the range on either axis and thus be hidden if you only did 1D histogram on either of the axis. Same goes for 3D patterns hidden when squashed to 2D.

But 3D is the most we can plot, so going higher you need mathematical compression techniques like PCA/T-SNE to compress down to within 3D find outliers, or to not reduce dimension but use mathematical quantities and your own thresholds to define them.

You may need to consider groupby analysis. As with any aggregation operation like averaging, having a global threshold may not be as meaningful as per group threshold or adaptive thresholding in the field of image analysis. Groupby also provides the advantage of comparing among groups such as product lines or marketing campaigns. It allows you to see that 2 lines which you know have a cause effect relationship which should show strong correlation suddenly breaking the correlation at some point in time which is an outlier.

Besides global/groupby outliers, you can consider differences from local neighbors. In space-time analysis, the neighbors could be neighbor in space, neighbor in time, or neighbor in space-time. You define the distance metric.
You can also look at differencing/dynamic changes using integration/differentiation for EDA. A outlier in acceleration time plot may not be as obvious in velocity time plot.

Besides calculus you can do mathematical/statistical transforms like log transforms to put things into a more symmetrical shape for easier visualization. We don’t judge spaces well when they are too spread out, or maybe because the plot simply doesn’t show it well when very skewed points pull the axis range out and make the close points really too close together.

Besides thinking about identifying individual points as outliers, entire clusters can be labeled outliers. People may do K-means analysis to look at the clusters and throw away those groups.

In summary, below are some ways to find outliers:

  1. EDA through Plotting
  2. Data Transformation
  3. Modeling ( svm.OneClassSVM , neighbors.LocalOutlierFactor , ensemble.IsolationFores )

They are not exclusive of each other and one can be part of another, one can be sandwiching other in a workflow.