Advice on solving this problem

I have two CSV files: financial aid status and countries. Both files have two columns: an index column ‘username’ and an ‘answer’ column. I’ve read these into df and created a new df through a concat on these two datafames. The question I need to answer is “Among countries that have 4 or more financial aid recipients, what are the top 4 countries with the highest percentage of financial aid recipients”? A financial aid recipient is someone who answered either B or C in their financial aid status. The percentage of financial aid recipients is the number of users who received financial aid divided by the total number of users.

The way I am thinking of approaching this is:
a) create a series base on the countries column
b) for each unique country in the series, I will select all the rows from that match the country, and the rows where financial_aid_answer is either B or C. If the number of rows > 4, then I will add that a new row to a new_df called that has [country, # of financial aid recipients, # of users]
c) I will sort the new_df and then return new_df.head(4) for the top 4 countries.

I am very new to dataframes and I thought they would be right for solving this challenge. But my logic on how this can be done is quite fuzzy. Would love some advice or thoughts on how best to approach this. Thank you!

I’m thinking of these steps

  1. Join the tables on username column (you can use df.join, df.merge, pd.concat for this
  2. Add boolean column indicating if a user answered B
  3. Add boolean column indicating if a user answered C
  4. OR columns from 2 and 3 indicating if user answered B or C
  5. group by country, aggregate by mean of column generated in step 4. mean works because True False is subtype of integer so integer operations work on them. eg. mean([True,True,False]) = 0.666…

Emphasis on step 5, note that you don’t need to for loop manually or create new dataframes.

Steps 2,3,4 are split up to break down the steps in case the analysis changes direction, helps debugging too.
Some may want to combine the 3 steps into 1 with series.isin or series.str.contains.

1 Like

Thanks hanqi! I am trying to do this but getting stuck. After I join the tables on username column, I can no longer call on the username column. Not sure how to move forward. Any advice would be greatly appreciated!

import pandas as pd

#read in the two CSV files into dataframes
f_c_df = pd.read_csv('fin_aid_status.csv', index_col=0)
countries_df = pd.read_csv('countries.csv', index_col=0)

# Rename the Answer columns to more descriptive column names
f_c_df = f_c_df.rename(columns={'Answer': 'Fin_Aid_Status'})
countries_df = countries_df.rename(columns={'Answer': 'Country'})

# join the two dataframes
user_data = countries_df.join(f_c_df)
#Create a Groupby Object based on user_data and grouped by country and Fin_Aid_Status
data_by_country_by_FAStatus = user_data.groupby(['Country', 'Fin_Aid_Status'])

Traceback (most recent call last):
File “/Users/lancyya/PycharmProjects/DLS_Exercise/venv/”, line 23, in
File “/Users/lancyya/PycharmProjects/DLS_Exercise/venv/lib/python3.9/site-packages/pandas/core/groupby/”, line 1538, in getitem
return super().getitem(key)
File “/Users/lancyya/PycharmProjects/DLS_Exercise/venv/lib/python3.9/site-packages/pandas/core/”, line 232, in getitem
raise KeyError(f"Column not found: {key}")
KeyError: ‘Column not found: username’

df.columns to see what columns are available, or just print the whole df and manually check. Think about what happens when you join two tables and both contain a column with the same name, which one does the result use? How does joining even work? (pandas tries to implement the joining behaviour of sql)