In our case, since the second duplicate row above contains more missing values than the first row... | 7. Correcting Duplicates Values

Screen Link: Learn data science with Python and R projects

What actually happened:

I was studying the behavior of df.drop_duplicates() in the example dataframe and the statement says something which seems to me to be either incorrect or I just don’t understand it.

On exercice we see:

combined[combined['COUNTRY'] == 'SOMALILAND REGION']

and we get this:

COUNTRY DYSTOPIA RESIDUAL ECONOMY GDP PER CAPITA FAMILY FREEDOM GENEROSITY HAPPINESS RANK HAPPINESS SCORE HEALTH LIFE EXPECTANCY LOWER CONFIDENCE INTERVAL REGION STANDARD ERROR TRUST GOVERNMENT CORRUPTION UPPER CONFIDENCE INTERVAL WHISKER HIGH WHISKER LOW YEAR
90 SOMALILAND REGION 2.11032 0.18847 0.95152 0.46582 0.50318 91.0 5.057 0.43873 NaN Sub-Saharan Africa 0.06161 0.39928 NaN NaN NaN 2015
162 SOMALILAND REGION NaN NaN NaN NaN NaN NaN NaN NaN NaN Sub-Saharan Africa NaN NaN NaN NaN NaN 2015
260 SOMALILAND REGION 2.43801 0.25558 0.75862 0.39130 0.51479 97.0 5.057 0.33108 4.934 Sub-Saharan Africa NaN 0.36794 5.18 NaN NaN 2016
326 SOMALILAND REGION NaN NaN NaN NaN NaN NaN NaN NaN NaN Sub-Saharan Africa NaN NaN NaN NaN NaN 2016
488 SOMALILAND REGION NaN NaN NaN NaN NaN NaN NaN NaN NaN Sub-Saharan Africa NaN NaN NaN NaN NaN 2017
489 SOMALILAND REGION NaN NaN NaN NaN NaN NaN NaN NaN NaN Sub-Saharan Africa NaN NaN NaN NaN NaN 2017

According to the df.drop_duplicates() method will define duplicates as rows in which all columns have the same values. We will have to specify that rows with the same values only in the COUNTRY and YEAR columns should be dropped.

DQ says: In our case, since the second duplicate row above contains more missing values than the first row, we’ll keep the first row. (!)

If I count the NaN in the first row (4) and in the second row (3), therefore the second row would be the correct one.

COUNTRY DYSTOPIA RESIDUAL ECONOMY GDP PER CAPITA FAMILY FREEDOM GENEROSITY HAPPINESS RANK HAPPINESS SCORE HEALTH LIFE EXPECTANCY LOWER CONFIDENCE INTERVAL REGION STANDARD ERROR TRUST GOVERNMENT CORRUPTION UPPER CONFIDENCE INTERVAL WHISKER HIGH WHISKER LOW YEAR
90 SOMALILAND REGION 2.11032 0.18847 0.95152 0.46582 0.50318 91.0 5.057 0.43873 NaN Sub-Saharan Africa 0.06161 0.39928 NaN NaN NaN 2015
260 SOMALILAND REGION 2.43801 0.25558 0.75862 0.39130 0.51479 97.0 5.057 0.33108 4.934 Sub-Saharan Africa NaN 0.36794 5.18 NaN NaN 2016

Being the one that offers me the most data is the one I should choose, isn’t it?
Am I wrong?

I hope I didn’t bother you too much.

Thx.

A

1 Like

Hi Alberto,

You have to compare this data by year. It means comparing the row 90 with 162, 260 with 326, 488 with 489. Those ones that you are trying to compare, i.e. 90 and 260, are of different years, 2015 and 2016, so it’s not correct comparing them.

1 Like

Once again I am glad you are here.

Thank you very much.

A.

1 Like

That’s great, Alberto, happy it was helpful! :slightly_smiling_face:

1 Like