combined = pd.merge(left=combined, right=regions, on='COUNTRY', how='left')
combined = combined.drop('REGION_x', axis = 1)
missing = combined.isnull().sum()
https://app.dataquest.io/m/347/working-with-missing-and-duplicate-data/6/identifying-duplicates-values
I run the above which works fine, but then I get error on the below.
I see from my version of the combined dataframe there is a COUNTRY column but the YEAR has become YEAR_y and a float datatype after running pd.merge is this what it should be? I think it was int before. Do I just change YEAR to YEAR_y, I see that it works then, but I am not sure why mine differs to the instructions?
Versus my output why has lots of duplicated rows
In dataquest works fine. please share your version of combined
and regions
dataframe data. To confirm further why that happend…
1 Like
Hey,
As per my observation and analysis i don’t see any column named "Year_y’ in our data frame. If you could share more details about your earlier analysis (, combined data frame,Identifying missing values, correct data cleaning & visualizing missing data ) so that we can try to figure out what went wrong exactly.
Best
K!
1 Like
HI DishinGoyani
PLease see here –
please note I am running this on local jupyter installation
Hi
I believe that happened after this got executed
regions = pd.merge(left=happiness2015, right=happiness2016, on=[‘COUNTRY’, ‘REGION’], how=‘left’)
regions = pd.merge(left=regions, right=happiness2017, on=‘COUNTRY’, how=‘left’)
The suffixes are added for any clashes in column names that are not involved in the merge operation, see online docs.
Also please note I am running this on local jupyter installation
https://stackoverflow.com/questions/23197537/pandas-merge-returns-a-column-with-x-appended-to-the-name
Oh also I added the year columns to dataframe, but there were added before regions = pd.merge() so the dataframes should be the same.
https://community.dataquest.io/t/cannot-find-year-column-in-csv-world-happiness/488907/5
I am quite unsure if you’ve combined the data frames(all the three datasets- happiness_2015t02017). It should look like this: Please note that I’ve run the below code in my local machine 
then you should see that DQ created a dataframe named regions containing all of the countries and corresponding regions from the happiness2015, happiness2016, and happiness2017 dataframes
regions = combined[[‘COUNTRY’,‘REGION’]].dropna().drop_duplicates() #creating a dataframe all countries and correspoding regions
regions.
Next step is we should use the pd.merge()
function to assign the REGION
in the regions
dataframe to the corresponding country in combined
. This is how the output looks and we have “Year” column only.
I hope this helps.
Let me know your feedback.
Best
K!
Hi
I had already combined as per answer to Dishin above, did you read it? In order of a,b
(a)This is what they looked like after these lines
combined = pd.concat([happiness2015, happiness2016, happiness2017],sort=False,
ignore_index=True)
(b)This is what they looked like after these lines
combined = pd.merge(left=combined, right=regions, on='COUNTRY', how='left')
combined = combined.drop('REGION_x', axis = 1)
missing = combined.isnull().sum()
missing output