Working With Missing And Duplicate Data - DataFrame.duplicated() returned error

combined = pd.merge(left=combined, right=regions, on='COUNTRY', how='left')
combined = combined.drop('REGION_x', axis = 1)
missing = combined.isnull().sum()

https://app.dataquest.io/m/347/working-with-missing-and-duplicate-data/6/identifying-duplicates-values

I run the above which works fine, but then I get error on the below.
I see from my version of the combined dataframe there is a COUNTRY column but the YEAR has become YEAR_y and a float datatype after running pd.merge is this what it should be? I think it was int before. Do I just change YEAR to YEAR_y, I see that it works then, but I am not sure why mine differs to the instructions?

Versus my output why has lots of duplicated rows

In dataquest works fine. please share your version of combined and regions dataframe data. To confirm further why that happend…

1 Like

Hey,

As per my observation and analysis i don’t see any column named "Year_y’ in our data frame. If you could share more details about your earlier analysis (, combined data frame,Identifying missing values, correct data cleaning & visualizing missing data ) so that we can try to figure out what went wrong exactly.

Best
K!

1 Like

HI DishinGoyani

PLease see here –
please note I am running this on local jupyter installation

Hi

I believe that happened after this got executed
regions = pd.merge(left=happiness2015, right=happiness2016, on=[‘COUNTRY’, ‘REGION’], how=‘left’)
regions = pd.merge(left=regions, right=happiness2017, on=‘COUNTRY’, how=‘left’)

The suffixes are added for any clashes in column names that are not involved in the merge operation, see online docs.

Also please note I am running this on local jupyter installation

https://stackoverflow.com/questions/23197537/pandas-merge-returns-a-column-with-x-appended-to-the-name

Oh also I added the year columns to dataframe, but there were added before regions = pd.merge() so the dataframes should be the same.

https://community.dataquest.io/t/cannot-find-year-column-in-csv-world-happiness/488907/5

I am quite unsure if you’ve combined the data frames(all the three datasets- happiness_2015t02017). It should look like this: Please note that I’ve run the below code in my local machine :slight_smile:

then you should see that DQ created a dataframe named regions containing all of the countries and corresponding regions from the happiness2015, happiness2016, and happiness2017 dataframes

regions = combined[[‘COUNTRY’,‘REGION’]].dropna().drop_duplicates() #creating a dataframe all countries and correspoding regions
regions.
Next step is we should use the pd.merge() function to assign the REGION in the regions dataframe to the corresponding country in combined . This is how the output looks and we have “Year” column only.

I hope this helps.
Let me know your feedback.
Best
K!

Hi

I had already combined as per answer to Dishin above, did you read it? In order of a,b
(a)This is what they looked like after these lines

combined = pd.concat([happiness2015, happiness2016, happiness2017],sort=False, 
                     ignore_index=True)

(b)This is what they looked like after these lines

combined = pd.merge(left=combined, right=regions, on='COUNTRY', how='left')
combined = combined.drop('REGION_x', axis = 1)
missing = combined.isnull().sum()

missing output