Data cleaning- how to identify the patten in a column in a quicker way?

Hi all! The instruction said that there are ‘kg’ and kgs’ in the weight column, so we can clean both easily. But in the real-life, we need to identify it by checking every value in the column. Any way we can let python check it and let us know? If the data is big, it is impossible to check every single value by people. Thank you a lot in advance.

@candiceliu93 From what I have gathered here, when you use the Series.unique() option on a column, it should return all the unique values in that column… So if the kg/kgs is an extension of a number…
Apply the Series.str.split()[-1]. unique()
Here, you first split 20 kgs to [‘20’, ‘kgs’] then pick only the last value and then return only the unique last values… Which will give you all possible options.

I think there maybe other ways of doing it, but that’s what I can think of now. I hope ot helps. Feel free to reach out for clarity.

Thank you for answering!!

For my understanding, Series.str.split() is to split vault by whitespace. For example ‘1.37kg,1.45kg…’ there is no whitespace between the number and the unit. and for the [-1], could you explain it?

Thank you so much!!

Hello @candiceliu93 sorry for the oversight on my end, I had shared a solution to one with white spaces. kindly see the below:

import pandas as pd
import re
#creating sample data
my_list  = ['1.23kg','1.234kgs','1.34Kg','1.56KG','1.45kGs','1.34 kg']
my_list = pd.Series(my_list) 

# either extract the unique string after the decimal by using Series.str.extract() 
my_list.str.extract(r'([a-z]+)',flags=re.I)[0].unique()
# or extract the count of the string after the decimal.
my_list.str.extract(r'([a-z]+)',flags=re.I)[0].value_counts()

# to create uniformity, use Series.replace() 
my_list.replace(to_replace=r' ?[A-Za-z]+',value='kg',regex=True)

I hope this is clear, please feel free to reach out. Also, sorry for the delayed response.