Hi,
I’m working on “Working With Missing And Duplicate Data” and doing the excercice, “Correcting Duplicate Values”
I have a doubt.
The question is, how python-pandas chooses wich row to delete in the duplicate values?
In the example, Somalia-Region in 2016 has some values in some cols that is not on the duplicated val doesn’t
Appart of checking if the rows we want are deleted, how can I know wich rows will be deleted. is like first rows found first row deleted?
Thanks for your time
Hi @JulianSanjuan!
Welcome to the DataQuest community!
The data for SOMALILAND REGION
for year 2016 has some data (index == 260
) and has NaN
value for all columns but Country
and Year
.
How Pandas chooses which data to save and which mark as duplicates?
By keep
variable with default value first
.
One can choose from 3 options:
-
first
- keep the first occurrence
-
last
- keep the last occurrence
-
False
- keep none, i.e. all will be dropped.
So again, which stays, and which is dropped depends on the keep
variable, but nevertheless, with keep
specified you need only get the index of duplicated data:
combined['COUNTRY'] = combined['COUNTRY'].str.upper()
# Use 'Country' and 'Year' columns to identify the duplicates
dups = combined.duplicated(['COUNTRY', 'YEAR'])
# Get the data marked as duplicates
marked_as_duplicates = dups[dups == True]
# list the index values for data marked as duplicates
index_dups = list(marked_as_duplicates.index)
Hope this answers your questions.
Thanks a lot kakoori for the explanation, didn’t know about keep argument. my bad to not check the pandas documentation. Next time before posting I’ll make sure to check it first.
Thanks anyway. Really good explanation.
2 Likes