Missing values for regions in 2017

screen link: https://app.dataquest.io/m/347/working-with-missing-and-duplicate-data/5/using-data-from-additional-sources-to-fill-in-missing-values

of course, we are given ready “regions” that contain countries and corresponding regions.
Could somebody show the code of how “regions” was built?
I tried to do this on my own, but failed.

Thank you in advance!


regions.csv (5.2 KB)

:point_up_2: is the dataset, and creating the DataFrame is as below :point_down:

import pandas as pd
regions = pd.read_csv('regions.csv')

This would give you your expected results

regions = happiness2015[['COUNTRY','REGION']].merge(happiness2016[['COUNTRY','REGION']], on='COUNTRY').drop('REGION_x',axis=1)

Hints are already given in the mission


I think I could not explain my question clearly.
I have already understood your answers. There are missings values in REGION in 2017, and dataquest itself has built regions dataframe from corresponding values from 2015 ,2016 and 2017 dataframes. How does dataquest built this regions dataframe. I myself tried to build dataframe like regions.csv but couldn’t do so.

Could someone show how the regions dataframe was built?
Thank you in advance!

1 Like

Could you show how I can perform first step there?
I cannot do the first step in hints.

Thank you beforehand!

1 Like

I don’t think these are steps to generate regions csv file it just hints.

By first point it saying create regions file somehow using these files happiness2015 or happiness2016 or happiness2017 files.

In second point giving hint to use pd.merge function like

happiness2015[['COUNTRY','REGION']].merge(happiness2016[['COUNTRY','REGION']], on='COUNTRY')

and then last to drop extra columns like

happiness2015[['COUNTRY','REGION']].merge(happiness2016[['COUNTRY','REGION']], on='COUNTRY').drop('REGION_x', axis=1)

There are some null values with this approach:C

The best way which I found is
combined[['COUNTRY', 'REGION']].dropna().drop_duplicates()
pd.concat([happiness2015[['COUNTRY', 'REGION']], happiness2016[['COUNTRY', 'REGION']]]).drop_duplicates().dropna().reset_index(drop = True)


Hey guys,

I had been struggling to resolve this problem too, and this is my current solution after spending sometimes looking at https://kanoki.org/2019/08/17/pandas-coalesce-replace-value-from-another-column/ to figure out how to replace missing region values in 2016 with region values in 2015 and vice versa.


import numpy as np


# Merge dataframe include only COUNTRY & REGION columns, of happiness2015 and happiness2016


regions = happiness2015[[‘COUNTRY’,‘REGION’]].merge(happiness2016[[‘COUNTRY’,‘REGION’]], on=‘COUNTRY’)


# replace missing values of region in 2015 by values in 2016 and vice versa

regions[‘REGION’] = np.where(regions2015_merged_2016[‘REGION_x’].isnull(),regions2015_merged_2016[‘REGION_y’],regions2015_merged_2016[‘REGION_x’])
regions[‘REGION’] = np.where(regions2015_merged_2016[‘REGION_y’].isnull(),regions2015_merged_2016[‘REGION_x’],regions2015_merged_2016[‘REGION_y’])

1 Like

I tried your solution but it should 7 columns have missing data.

regions1 = happiness2015[[‘COUNTRY’,‘REGION’]].merge(happiness2016[[‘COUNTRY’,‘REGION’]], on=‘COUNTRY’).drop(‘REGION_x’,axis=1)


I tried this:

#Get all rows with REGION not null
#Get country & REGION pair based on the above boolean


#Concatenate x2 and x4 to get the list of all country, region from 2015 and 2016 with region as not null

remove duplicates using drop_duplicates()


#Check shape of combined1 and regions

You could try using this:

regions = pd.merge(happiness2015[[‘COUNTRY’,‘REGION’]], happiness2016[[‘COUNTRY’,‘REGION’]], on=‘COUNTRY’, how= ‘outer’)

regions[“REGION”] = regions[“REGION_x”].fillna(regions[“REGION_y”])
Above step is done to include region for both columns _x and _y. Some columns have region in _y column but NaN in _x column.
Thus we created a new column combining region of both the columns.

regions.drop([“REGION_x”, “REGION_y”], axis= 1)

Please see the solution below. I was able to create same dataframe as the provided regions dataframe by using the steps below:

 data_2015 =happiness2015[['COUNTRY','REGION']]
data_2016 =happiness2016[['COUNTRY','REGION']]

merged = pd.merge(left = data_2015, right= data_2016, how= 'inner', on ='COUNTRY') # this has missing values in REGION_x and REGION_y which has missing values updated. So, we can drop REGION_x as it does not add any benefit to our analysis.

merged = pd.merge(left = data_2015, right= data_2016, how= 'inner', on ='COUNTRY').drop('REGION_x', axis = 'columns')

I used information from This stackoverflow question and think I figured it out.

regions_new = pd.merge(happiness2015[['COUNTRY','REGION']], happiness2016[['COUNTRY','REGION']], on='COUNTRY')

regions_new = regions_new.rename(columns={"REGION_y":"REGION"}) 

regions_new.update(regions_new[["COUNTRY", "REGION_x"]].merge(happiness2015, 'left'))

regions_new = regions_new.drop("REGION_x", axis=1)

I tried .fillna(regions[“REGION_y”]) as mentioned above and that did not work for me. but the update() method did.

Crystal clear sir thank you so much !! .

I tried this and I was able to reproduce the regions df

region = combined.drop_duplicates([‘COUNTRY’, ‘REGION’])
regions = region[region[‘REGION’].notnull()][[‘COUNTRY’, ‘REGION’]]