of course, we are given ready “regions” that contain countries and corresponding regions.
Could somebody show the code of how “regions” was built?
I tried to do this on my own, but failed.
Thank you in advance!
of course, we are given ready “regions” that contain countries and corresponding regions.
Could somebody show the code of how “regions” was built?
I tried to do this on my own, but failed.
Thank you in advance!
regions.csv (5.2 KB)
is the dataset, and creating the DataFrame is as below
import pandas as pd
regions = pd.read_csv('regions.csv')
This would give you your expected results
regions = happiness2015[['COUNTRY','REGION']].merge(happiness2016[['COUNTRY','REGION']], on='COUNTRY').drop('REGION_x',axis=1)
Hints are already given in the mission
I think I could not explain my question clearly.
I have already understood your answers. There are missings values in REGION in 2017, and dataquest itself has built regions dataframe from corresponding values from 2015 ,2016 and 2017 dataframes. How does dataquest built this regions dataframe. I myself tried to build dataframe like regions.csv but couldn’t do so.
Could someone show how the regions dataframe was built?
Thank you in advance!
Could you show how I can perform first step there?
I cannot do the first step in hints.
Thank you beforehand!
I don’t think these are steps to generate regions
csv file it just hints.
By first point it saying create regions
file somehow using these files happiness2015
or happiness2016
or happiness2017
files.
In second point giving hint to use pd.merge
function like
happiness2015[['COUNTRY','REGION']].merge(happiness2016[['COUNTRY','REGION']], on='COUNTRY')
and then last to drop extra columns like
happiness2015[['COUNTRY','REGION']].merge(happiness2016[['COUNTRY','REGION']], on='COUNTRY').drop('REGION_x', axis=1)
There are some null values with this approach:C
combined[['COUNTRY', 'REGION']].dropna().drop_duplicates()
pd.concat([happiness2015[['COUNTRY', 'REGION']], happiness2016[['COUNTRY', 'REGION']]]).drop_duplicates().dropna().reset_index(drop = True)
Hey guys,
I had been struggling to resolve this problem too, and this is my current solution after spending sometimes looking at https://kanoki.org/2019/08/17/pandas-coalesce-replace-value-from-another-column/ to figure out how to replace missing region values in 2016 with region values in 2015 and vice versa.
`
import numpy as np
`
# Merge dataframe include only COUNTRY & REGION columns, of happiness2015 and happiness2016
`
regions = happiness2015[[‘COUNTRY’,‘REGION’]].merge(happiness2016[[‘COUNTRY’,‘REGION’]], on=‘COUNTRY’)
`
# replace missing values of region in 2015 by values in 2016 and vice versa
regions[‘REGION’] = np.where(regions2015_merged_2016[‘REGION_x’].isnull(),regions2015_merged_2016[‘REGION_y’],regions2015_merged_2016[‘REGION_x’])
regions[‘REGION’] = np.where(regions2015_merged_2016[‘REGION_y’].isnull(),regions2015_merged_2016[‘REGION_x’],regions2015_merged_2016[‘REGION_y’])
regions[‘REGION’].isnull().sum()
I tried your solution but it should 7 columns have missing data.
regions1 = happiness2015[[‘COUNTRY’,‘REGION’]].merge(happiness2016[[‘COUNTRY’,‘REGION’]], on=‘COUNTRY’).drop(‘REGION_x’,axis=1)
print(regions1[‘REGION_y’].isnull().sum())
I tried this:
#Get all rows with REGION not null
#Get country & REGION pair based on the above boolean
x1=happiness2015[‘REGION’].notnull()
x2=happiness2015[x1][[‘COUNTRY’,‘REGION’]]
x3=happiness2016[‘REGION’].notnull()
x4=happiness2016[x3][[‘COUNTRY’,‘REGION’]]
#Concatenate x2 and x4 to get the list of all country, region from 2015 and 2016 with region as not null
combined1=pd.concat([x2,x4]).drop_duplicates()
#Check shape of combined1 and regions
print(combined1.shape)
print(regions.shape)
You could try using this:
regions = pd.merge(happiness2015[[‘COUNTRY’,‘REGION’]], happiness2016[[‘COUNTRY’,‘REGION’]], on=‘COUNTRY’, how= ‘outer’)
regions[“REGION”] = regions[“REGION_x”].fillna(regions[“REGION_y”])
Above step is done to include region for both columns _x and _y. Some columns have region in _y column but NaN in _x column.
Thus we created a new column combining region of both the columns.
regions.drop([“REGION_x”, “REGION_y”], axis= 1)
Please see the solution below. I was able to create same dataframe as the provided regions dataframe by using the steps below:
data_2015 =happiness2015[['COUNTRY','REGION']]
data_2016 =happiness2016[['COUNTRY','REGION']]
merged = pd.merge(left = data_2015, right= data_2016, how= 'inner', on ='COUNTRY') # this has missing values in REGION_x and REGION_y which has missing values updated. So, we can drop REGION_x as it does not add any benefit to our analysis.
merged = pd.merge(left = data_2015, right= data_2016, how= 'inner', on ='COUNTRY').drop('REGION_x', axis = 'columns')
I used information from This stackoverflow question and think I figured it out.
regions_new = pd.merge(happiness2015[['COUNTRY','REGION']], happiness2016[['COUNTRY','REGION']], on='COUNTRY')
regions_new = regions_new.rename(columns={"REGION_y":"REGION"})
regions_new.update(regions_new[["COUNTRY", "REGION_x"]].merge(happiness2015, 'left'))
regions_new = regions_new.drop("REGION_x", axis=1)
I tried .fillna(regions[“REGION_y”])
as mentioned above and that did not work for me. but the update()
method did.
Crystal clear sir thank you so much !! .
I tried this and I was able to reproduce the regions df
region = combined.drop_duplicates([‘COUNTRY’, ‘REGION’])
regions = region[region[‘REGION’].notnull()][[‘COUNTRY’, ‘REGION’]]