Working With Missing And Duplicate Data : constructing a dataframe

https://app.dataquest.io/m/347/working-with-missing-and-duplicate-data/5/using-data-from-additional-sources-to-fill-in-missing-values

(1)
Regarding the regions dataframe, I wondered how I would construct it in order to complete this task, since I am working with the csv files from Kaggle?

I saw some information here but it is too complex
https://stackoverflow.com/questions/39748413/fill-missing-values-of-1-data-frame-from-another-data-frame-using-pandas

https://datascience.stackexchange.com/questions/17769/how-to-fill-missing-value-based-on-other-columns-in-pandas-dataframe

Recall once more that each year contains the same countries. Since the regions are fixed values - the region a country was assigned to in 2015 or 2016 won’t change - we should be able to assign the 2015 or 2016 region to the 2017 row.

In order to do so, we’ll use the following strategy:

  1. Create a dataframe containing all of the countries and corresponding regions from the happiness2015 , happiness2016 , and happiness2017 dataframes.
  2. Use the pd.merge() function to assign the REGION in the dataframe above to the corresponding country in combined .
  3. The result will have two region columns - the original column with missing values will be named REGION_x . The updated column without missing values will be named REGION_y . We’ll drop REGION_x to eliminate confusion.

We’ve already created a dataframe named regions containing all of the countries and corresponding regions from the happiness2015 , happiness2016 , and happiness2017 dataframes.

  • Use the pd.merge() function to assign the REGION in the regions dataframe to the corresponding country in combined .
    • Set the left parameter equal to combined .
    • Set the right parameter equal to regions .
    • Set the on parameter equal to 'COUNTRY' .
    • Set the how parameter equal to 'left' to make sure we don’t drop any rows from combined .
    • Assign the result back to combined .
  • Use the DataFrame.drop() method to drop the original region column with missing values, now named REGION_x .
    • Pass 'REGION_x' into the df.drop() method.
    • Set the axis parameter equal to 1 .
    • Assign the result back to combined .
  • Use the DataFrame.isnull() and DataFrame.sum() methods to check for missing values. Assign the result to a variable named missing

Hi @jamesberentsen

I don’t actually understand your problem. I mean, the instructions are pretty clear. So can you be more specific about what you don’t understand?

Thanks

Hi alegiraldo666

Sure, so it is mentioned that –

we’ve already created a dataframe named regions containing all of the countries and corresponding regions from the happiness2015 , happiness2016 , and happiness2017 dataframes

I want to know how to create this dataframe.

I get this error otherwise

I think because I am working on jupyter locally and I am working with the csv files from Kaggle not the ones from exercise

Regards
JB

.

Ok i get it now.

regions is a dataframe that only have two columns country and region. So to create it you need to combine the three original dataframes, in the exersice they did this

regions = pd.merge(left=wh_15, right=wh_16, on=['Country', 'Region'], how='left')
regions = pd.merge(left=regions, right=wh_17, on='Country', how='left')

If you look at the files from Kaggle the last 3 years doesn’t have a region column, but it doesn’t matter. You need to do 4 merges. After this, you can continue with the cell that you’re showing.

Hope you understand, i’ll be here if you need more help or if you have more questions

Good luck!

1 Like

Thanks alegiraldo666,


That created the regions dataframe, but when I ran everything again from the reading the csv files to the end I got an error —

Since I got an error
when I tried to run the next part –

combined = pd.merge(left=combined, right=regions, on='COUNTRY', how='left')
combined = combined.drop('REGION_x', axis = 1)
missing = combined.isnull().sum()




    628         # validate the merge keys dtypes. We may need to coerce

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py in _get_merge_keys(self)
    973                     if not is_rkey(rk):
    974                         if rk is not None:
--> 975                             right_keys.append(right._get_label_or_level_values(rk))
    976                         else:
    977                             # work-around for merge_asof(right_index=True)

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in _get_label_or_level_values(self, key, axis)
   1772             values = self.axes[axis].get_level_values(key)._values
   1773         else:
-> 1774             raise KeyError(key)
   1775 
   1776         # Check for duplicates

KeyError: 'COUNTRY'

Hi @jamesberentsen,

COUNTRY

Column names of dataframes are case sensitive.

In this screenshot:

You are using Country, So it is highly possible to be an error due to incorrect case

Best,
Sahil

1 Like

Many thanks Sahil,

Yes that is correct , that was cause of error and so I managed to fix now.

Thanks
JB

1 Like