Missing values for regions in 2017

screen link: https://app.dataquest.io/m/347/working-with-missing-and-duplicate-data/5/using-data-from-additional-sources-to-fill-in-missing-values

of course, we are given ready “regions” that contain countries and corresponding regions.
Could somebody show the code of how “regions” was built?
I tried to do this on my own, but failed.

Thank you in advance!

2 Likes

regions.csv (5.2 KB)

:point_up_2: is the dataset, and creating the DataFrame is as below :point_down:

import pandas as pd
regions = pd.read_csv('regions.csv')

This would give you your expected results

regions = happiness2015[['COUNTRY','REGION']].merge(happiness2016[['COUNTRY','REGION']], on='COUNTRY').drop('REGION_x',axis=1)

Hints are already given in the mission
image

I think I could not explain my question clearly.
I have already understood your answers. There are missings values in REGION in 2017, and dataquest itself has built regions dataframe from corresponding values from 2015 ,2016 and 2017 dataframes. How does dataquest built this regions dataframe. I myself tried to build dataframe like regions.csv but couldn’t do so.

Could someone show how the regions dataframe was built?
Thank you in advance!

Could you show how I can perform first step there?
I cannot do the first step in hints.

Thank you beforehand!

1 Like

I don’t think these are steps to generate regions csv file it just hints.

By first point it saying create regions file somehow using these files happiness2015 or happiness2016 or happiness2017 files.

In second point giving hint to use pd.merge function like

happiness2015[['COUNTRY','REGION']].merge(happiness2016[['COUNTRY','REGION']], on='COUNTRY')

and then last to drop extra columns like

happiness2015[['COUNTRY','REGION']].merge(happiness2016[['COUNTRY','REGION']], on='COUNTRY').drop('REGION_x', axis=1)
1 Like

There are some null values with this approach:C
image


The best way which I found is
combined[['COUNTRY', 'REGION']].dropna().drop_duplicates()
or
pd.concat([happiness2015[['COUNTRY', 'REGION']], happiness2016[['COUNTRY', 'REGION']]]).drop_duplicates().dropna().reset_index(drop = True)

2 Likes

Hey guys,

I had been struggling to resolve this problem too, and this is my current solution after spending sometimes looking at https://kanoki.org/2019/08/17/pandas-coalesce-replace-value-from-another-column/ to figure out how to replace missing region values in 2016 with region values in 2015 and vice versa.

`

import numpy as np

`

# Merge dataframe include only COUNTRY & REGION columns, of happiness2015 and happiness2016

`

regions = happiness2015[[‘COUNTRY’,‘REGION’]].merge(happiness2016[[‘COUNTRY’,‘REGION’]], on=‘COUNTRY’)

`

# replace missing values of region in 2015 by values in 2016 and vice versa

regions[‘REGION’] = np.where(regions2015_merged_2016[‘REGION_x’].isnull(),regions2015_merged_2016[‘REGION_y’],regions2015_merged_2016[‘REGION_x’])
regions[‘REGION’] = np.where(regions2015_merged_2016[‘REGION_y’].isnull(),regions2015_merged_2016[‘REGION_x’],regions2015_merged_2016[‘REGION_y’])
regions[‘REGION’].isnull().sum()

1 Like

I tried your solution but it should 7 columns have missing data.

regions1 = happiness2015[[‘COUNTRY’,‘REGION’]].merge(happiness2016[[‘COUNTRY’,‘REGION’]], on=‘COUNTRY’).drop(‘REGION_x’,axis=1)

print(regions1[‘REGION_y’].isnull().sum())

I tried this:

#Get all rows with REGION not null
#Get country & REGION pair based on the above boolean
x1=happiness2015[‘REGION’].notnull()
x2=happiness2015[x1][[‘COUNTRY’,‘REGION’]]

x3=happiness2016[‘REGION’].notnull()
x4=happiness2016[x3][[‘COUNTRY’,‘REGION’]]

#Concatenate x2 and x4 to get the list of all country, region from 2015 and 2016 with region as not null

remove duplicates using drop_duplicates()

combined1=pd.concat([x2,x4]).drop_duplicates()

#Check shape of combined1 and regions
print(combined1.shape)
print(regions.shape)

You could try using this:

regions = pd.merge(happiness2015[[‘COUNTRY’,‘REGION’]], happiness2016[[‘COUNTRY’,‘REGION’]], on=‘COUNTRY’, how= ‘outer’)

regions[“REGION”] = regions[“REGION_x”].fillna(regions[“REGION_y”])
Above step is done to include region for both columns _x and _y. Some columns have region in _y column but NaN in _x column.
Thus we created a new column combining region of both the columns.

regions.drop([“REGION_x”, “REGION_y”], axis= 1)