Challenge: Cleaning Data

I answer correctly and then I look for the correct answer for dataquest to compare. I then realize than you don’t have to iterate over columns in pandas, you could do something much better like this:

cols = [“Death1”, “Death2”, “Death3”, “Death4”, “Death5”]
true_avengers[“Deaths”] = (true_avengers[cols] == “YES”).apply(sum, 1)

Much better would be get the names using regex, something like “^Death[\d]”

But the official solution is too long, lets compare:

def clean_deaths(row):
num_deaths = 0
columns = [‘Death1’, ‘Death2’, ‘Death3’, ‘Death4’, ‘Death5’]

for c in columns:
    death = row[c]
    if pd.isnull(death) or death == 'NO':
        continue
    elif death == 'YES':
        num_deaths += 1
return num_deaths

true_avengers[‘Deaths’] = true_avengers.apply(clean_deaths, axis=1)

The main point is that:
true_avengers[cols] == “YES”
returns a data.frame so we can use any method of data.frames directly. Pretty useful with boolean operations!

1 Like

This is awesome - thanks for comparing and for your thoughtfulness here.

Hello all!

I too have come up with a different solution than DQ, however mine doesn´t pass the DQ error checks. Here´s what I did:

deaths = {"YES":1, "NO":0}
death_cols = ["Death1", "Death2","Death3","Death4","Death5"]
    
for col in death_cols:
    true_avengers[col] = true_avengers[col].map(deaths)
    
true_avengers["Deaths"] = true_avengers[death_cols].sum(axis=1, skipna=True).astype(int)
true_avengers.Death1+true_avengers.Death2+true_avengers.Death3+true_avengers.Death4+true_avengers.Death5
true_avengers.Deaths.sum() #total deaths: 88
true_avengers.Deaths.describe() #confirmed the numbers to be the same as in the DQ solution

count    159.000000
mean       0.553459
std        0.768426
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        5.000000
Name: Deaths, dtype: float64

Am I missing something, or is DQ very picky about the path you choose to solve the problem? If that´s the case it´s very frustrating… Just for reference, here´s the original DQ solution:

def clean_deaths(row):
    num_deaths = 0
    columns = ['Death1', 'Death2', 'Death3', 'Death4', 'Death5']
    
    for c in columns:
        death = row[c]
        if pd.isnull(death) or death == 'NO':
            continue
        elif death == 'YES':
            num_deaths += 1
    return num_deaths

true_avengers['Deaths'] = true_avengers.apply(clean_deaths, axis=1)

.sum() and .describe() produce the same results as my code above. However to pass the mission I had to use DQ´s code :-/

I took a simpler syntax and have a pass:

deaths_df = true_avengers[['Death1', 'Death2', 'Death3', 'Death4', 'Death5']].applymap(lambda x: 1 if x == 'YES' else 0)
              
true_avengers['Deaths'] = deaths_df.sum(axis=1)
1 Like