Detecting outliers and deleting them in Python or R

Hello all,

can some help me to detect and remove(clean) all the outliers in my Dataset
the dataset contains 5000 rows and 12 columns
I tried using boxplot function in R to detect the outliers but i don’t know how to remove them
I need a clean dataset to do Kmean clustering.

Whitewine.xlsx (351.7 KB) wineDS

Hello @mohamed96.banihani. To get you started, I can provide you with an example of how to delete the outliers in R for a single column in your dataframe, pH. This approach is based on this method for identifying outliers. I tested this locally in R Studio, so hopefully this will work for you.

If your goal is to delete all outliers from all columns and combine them into a single dataframe, you’ll want to be sure to calculate the interquartile range for each column against the original dataset of 5000 rows. For example, the output in this example is a dataframe called no_outliers that filters to include only the pH values that are not outliers, so you probably wouldn’t want to calculate the outliers for the next column against this data.

library(readxl)
library(dplyr)
library(ggplot2)

# Load data
df <- read_xlsx("Whitewine.xlsx")

# Quartiles
df_summary <- summary(df$pH)

# Estimate interquartile range
# (3rd quartile minus 1st quartile)
iqr <- df_summary[[5]] - df_summary[[2]]

# Identify bounds for outliers
lower_bound <- df_summary[[2]] - (1.5 * iqr)
upper_bound <- df_summary[[5]] + (1.5 * iqr)

# Identify outlier(s)
outliers <- df %>% 
  filter(pH > upper_bound | pH < lower_bound)

# Remove outliers from dataframe
# Keep only the points within the bounds
no_outliers <- df %>%
  filter(pH < upper_bound & pH > lower_bound)

# Check range of no_outliers to make sure it's within bounds
range(no_outliers$pH)

Let me know if you have any questions, thanks!

Hey Mr.Casey
thank you for your amazing help
i have tried the code
just to make sure that i understand everything if i want to remove all the outliers from all the columns i should do them separately.
then how can i join them back to gather so i can have one data set that contains all clean data
the aim after having a clean data set is to do K-mean clustering to them

Thanks again for your support

Hi @mohamed96.banihani. This is an interesting and challenging problem. I will continue to think about the approach to take on this. I’m thinking the next step might be to turn what has been created so far into a function, and then iterate with the function to modify the dataframe in place. But modification in place comes with it’s own challenges.

Also, do you have a solid understanding of why you want to remove the outliers from your dataset in the first place? In general it’s best to leave outliers in the data or deal with them in another way.

Hello,
thanks for your help
this data-set was assigned to me to do K-Mean clustering on it and to see how the data is related to the 12th column(Quality column ) after that my instructor told me the first step is to clean the data set by removing the outliers and he mentioned that the outliers should be in the range of 100 to 150 outliers .
again thanks for your support

1 Like

Hi @mohamed96.banihani. Has the instructor provided information about what defines an outlier? Using this quick method, there are over 180 in the volatile acidity variable alone:

boxplot(original$volatile acidity)$out

You can learn more about this quick method here.

1 Like

hello
**yes he mentioned that I can use the function boxplot to remove the outliers but unfortunately he did not mention any more information **
i used boxplot to remove the outlier but as you mentioned the column volatile acidity contains 186 outliers
**I have tried this code with the same result **

white <- read_excel(“C:/Users/GTS/Desktop/ML project/Whitewine.xlsx”)
removing the quality column
whitewine <- white[,-12]

outlier <- boxplot(whitewine$volatile acidity ,plot = FALSE)$out

whitewine <-
whitewine[-which(whitewine$'volatile acidity' %in% outlier ) , ]

This should create a new dataset with 4712 rows

Hi @mohamed96.banihani. I’ll be curious to hear what the instructor says about outliers and why there are only 100 to 150. The code below is exploratory, but it provides and approach to use the original untouched dataset for the outlier boundaries that you can then filter against. Each time you apply this approach to your new dataframe with outliers removed, it will shrink in size. This code is exploratory. If it works as desired, with some work you can turn it into a function, and then iterate. But before building functions, it is good to make sure it works as desired.

Let me know if you have any questions. Best.

library(readxl)
library(dplyr)

# Load data
original <- read_xlsx("Whitewine.xlsx")

# Quick way how many outliers there are in the pH column
boxplot(original$pH)$out

original_summary <- summary(original$pH)

# Estimate interquartile range
# (3rd quartile minus 1st quartile)
iqr <- original_summary[[5]] - original_summary[[2]]

# Identify bounds for outliers
lower_bound <- original_summary[[2]] - (1.5 * iqr)
upper_bound <- original_summary[[5]] + (1.5 * iqr)

# Identify outlier(s)
outliers <- original %>% 
  filter(pH > upper_bound | pH < lower_bound)

# Remove outliers from dataframe, but store as new dataframe "new_outliers"
# Keep only the points within the bounds, store as "no_outliers"
no_outliers <- original %>%
  filter(pH < upper_bound & pH > lower_bound)

# Repeat for alcohol variable (it has no outliers)
# the alcohol variable has no outliers...
# so our no_outliers datafram should remain unaffected
# Use "original" dataset to define outlier boundaries
original_summary <- summary(original$alcohol)
iqr <- original_summary[[5]] - original_summary[[2]]
# The bounds are established with the original data
lower_bound <- original_summary[[2]] - (1.5 * iqr)
upper_bound <- original_summary[[5]] + (1.5 * iqr)

# Filter out the outliers for the alcohol column from "no_outliers"
# We only remove outliers from no_outliers and keep original as is
no_outliers <- no_outliers %>%
  filter(alcohol < upper_bound & alcohol > lower_bound)

# Repeat for fixed accidity
original_summary <- summary(original$`fixed acidity`)
iqr <- original_summary[[5]] - original_summary[[2]]
# Remember that the bounds are based on the original data
lower_bound <- original_summary[[2]] - (1.5 * iqr)
upper_bound <- original_summary[[5]] + (1.5 * iqr)

# Removing fixed acidity outliers from the no_outliers data, not the original
no_outliers <- no_outliers %>%
  filter(`fixed acidity` < upper_bound & `fixed acidity` > lower_bound)

# Over 100 outliers removed...you can see how many there are with this:
boxplot(original$`fixed acidity`)$out
boxplot(original$`volatile acidity`)$out

length(boxplot(original$`fixed acidity`)$out)
length(boxplot(original$`volatile acidity`)$out)
length(boxplot(original$pH)$out)
1 Like

hello ,
Thanks for the help
the code worked really fine
after submitting the code to the instructor i told him that their is more than 150 outlier and he told me that we should only remove that amount to see how the output would change by removing just a few of them

again thanks for the support

2 Likes