Working With Missing And Duplicate Data - Correcting Duplicate Values

Hi,

I’m working on “Working With Missing And Duplicate Data” and doing the excercice, “Correcting Duplicate Values”
I have a doubt.
The question is, how python-pandas chooses wich row to delete in the duplicate values?
In the example, Somalia-Region in 2016 has some values in some cols that is not on the duplicated val doesn’t

Appart of checking if the rows we want are deleted, how can I know wich rows will be deleted. is like first rows found first row deleted?

Thanks for your time

Hi @JulianSanjuan!

Welcome to the DataQuest community!

The data for SOMALILAND REGION for year 2016 has some data (index == 260) and has NaN value for all columns but Country and Year.
How Pandas chooses which data to save and which mark as duplicates?
By keep variable with default value first.
One can choose from 3 options:

  1. first - keep the first occurrence
  2. last - keep the last occurrence
  3. False - keep none, i.e. all will be dropped.

So again, which stays, and which is dropped depends on the keep variable, but nevertheless, with keep specified you need only get the index of duplicated data:

combined['COUNTRY'] = combined['COUNTRY'].str.upper()

# Use 'Country' and 'Year' columns to identify the duplicates
dups = combined.duplicated(['COUNTRY', 'YEAR'])

# Get the data marked as duplicates
marked_as_duplicates = dups[dups == True]

# list the index values for data marked as duplicates
index_dups = list(marked_as_duplicates.index)

Hope this answers your questions.

Thanks a lot kakoori for the explanation, didn’t know about keep argument. my bad to not check the pandas documentation. Next time before posting I’ll make sure to check it first.

Thanks anyway. Really good explanation.

2 Likes