A bit disappointed by suggested solution on Data Cleaning Challenge

https://app.dataquest.io/m/102/challenge%3A-cleaning-data/5/consolidating-deaths

def clean_deaths(row):
    num_deaths = 0
    columns = ['Death1', 'Death2', 'Death3', 'Death4', 'Death5']
    
    for c in columns:
        death = row[c]
        if pd.isnull(death) or death == 'NO':
            continue
        elif death == 'YES':
            num_deaths += 1
    return num_deaths

true_avengers['Deaths'] = true_avengers.apply(clean_deaths, axis=1)

The above is the suggested answer. I had submitted an answer quite similar to the above (although I just used nested loops instead of apply), but I had been hoping for a suggested answer that was much more concise and elegant than this. Anyone have a better way of doing this? I could do this faster in Excel

2 Likes

I’m thinking that implementing the above but with a lambda function inside apply could shorten it a lot, but might be hard to read the code.

Hey, Preston. I think the given solution is perfectly fine. It uses a function that abstracts way the death count and then applies it.

At any rate, here’s a one-liner:

true_avengers["Deaths"] = true_avengers[
    [c for c in true_avengers.columns if c.startswith("Death")]
].fillna("").sum(axis=1).str.count("YES")

Here’s the strategy:

  • [c for c in true_avengers.columns if c.startswith("Death")] just gets the column names because I couldn’t be bothered with typing them, even thought it ends up being about the same length of characters.
  • DataFrame.fillna replaces the missing values with empty strings — this is necessary to properly employ the next step.
  • DataFrame.sum(axis=1) sums each row. When you sum strings, this is the same as concatenating them.
  • Finally we count the number of occurrences of YES to get the total number of deaths.

I hope this is the kind of solution you were looking for.

10 Likes

Hey Preston,

I ended up using a lambda function that seemed to work.

death_cols = ["Death1", "Death2", "Death3", "Death4", "Death5"]
true_avengers["Deaths"] = true_avengers[death_cols].apply(lambda x: (x=="YES").sum(), axis=1)

I don’t think it’s too hard to read. Anyway, just shows how many ways there are to do the same thing!

14 Likes

Hello Preston,

you don’t need any loops to solve the task. You can use the mask function to convert the strings into integers and then sum them up by column:

death_fields = ['Death1', 'Death2', 'Death3', 'Death4', 'Death5']

fill_na = true_avengers[death_fields].fillna('NO')

true_avengers['Deaths'] = fill_na.mask(fill_na == 'YES', 1).mask(fill_na == 'NO', 0).sum(axis=1)
4 Likes

Hello!

I focused to ‘YES’ values as they give the answer.
And avoided loops :slight_smile: :

Death_cols = ['Death1', 'Death2', 'Death3', 'Death4', 'Death5']
true_avengers_bool = true_avengers[Death_cols] == 'YES'
true_avengers['Deaths'] = true_avengers_bool.sum(axis=1)


Maxim.

15 Likes

Hi everyone,

I would have the suggestion to update the “Data Cleaning Project Walkthrough”. There were e.g. several examples during this course, where vectorized string methods were not used - although they were taught in earlier missions.

One of the great joys of programming is the myriad ways in which a problem can be solved. Especially these challenges at the end of step 2 would be a great opportunity for synthesis.
Both approaches from @Bruno and @smaximm could be compared side by side and shown as viable approaches to equally solve the same problem using technics thought earlier in this learning path.
Anyway, thanks for all the great content! Dataquest is a great learning resource!

Best, Tim

7 Likes

Hi @tim1albers,

Thank you for the suggestion! I will let the content team know about it.

Best,
Sahil

2 Likes

Yes, Please.
Use the below given much simpler and more elegant solution based on the power of boolean indexing. No need to write such complicated functions. Just write a simple function and use boolean indexing.

print(true_avengers['Death1'].dtypes)
def is_death(x):
    
    if x=='YES':
        return True
    else:
        return False
Death_columns=['Death1','Death2','Death3','Death4','Death5'] 
dead_avengers=true_avengers[Death_columns]
for c in Death_columns:
    dead_avengers[c]=dead_avengers[c].apply(is_death)

dead_avengers['Deaths']=dead_avengers[Death_columns].sum(axis=1)

true_avengers['Deaths']=dead_avengers['Deaths']
print(true_avengers['Deaths'].head(20))
2 Likes

Hi Preston. Here’s another suggestion that may be complementary to the answers above. I think is simple and easy to understand. Let me know your thoughts.

true_avengers['Deaths'] = 0
for i in range(1,6):
    true_avengers['Deaths'] +=  true_avengers["Death{0}".format(i)].apply(lambda x: 1 if str(x).upper() == "YES" else 0)  
1 Like

Hey All,

Just curious if there is any benefit to the proposed solution? It seems unnecessary to loop through each column within the clean_deaths function.

This was my solution, still utilizing a separate function to make the code look cleaner:

def count_deaths(row):
    deaths = row.str.contains("YES").sum()
    return deaths

death_cols = ['Death1','Death2','Death3','Death4','Death5']
true_avengers['Deaths'] = true_avengers[death_cols].apply(count_deaths, axis=1)

Or was the solution provided a more common way to sum data when dealing with more complex data in each column?

1 Like

@smaximm I’ve done the same :smirk:

Since we only want to count towards the “YES” value, here’s what I did

columns = ['Death1', 'Death2', 'Death3', 'Death4', 'Death5']
true_avengers['Deaths'] = (true_avengers[columns] == 'YES').sum(axis=1)
1 Like

I did that:

def to_num(x):
    if x == 'YES':
        return 1
    elif x == 'NO':
        return 0
    else:
        return 0

true_avengers['Deaths'] = true_avengers[ ['Death1', 'Death2', 'Death3', 'Death4', 'Death5'] ].applymap(to_num).sum(axis=1)

it’s probably worth comparing performance of each to see if they are equivalent on this aspect.

this was it my answer:

columns = ['Death1','Death2','Death3','Death4','Death5']
true_avengers['Deaths'] = (true_avengers[columns] == 'YES').sum(axis=1)
2 Likes

This solution, i tried on my own. Instead of using looping for columns in Dataframe

col_list =

#Define the range of Columns
for i in range(1,6):
v = “Death{}”.format(i)
#true_avengers[v].fillna(‘NO’)
col_list.append(v)

#Sum along the columns
true_avengers[‘Deaths’] = (true_avengers[col_list]==“YES”).sum(axis=1)

print(true_avengers[‘Deaths’])

1 Like

Hi, everyone!
I´ve checked the performance of all the solutions except suggested by @Bruno (the following error araised: Can only use .str accessor with string values, which use np.object_ dtype in pandas).
The fastest one is the one suggested by @smaximm:

and the second fastest is the one-line representation of the latter suggested by @anthony.cua and @vorunplz:

columns = ['Death1','Death2','Death3','Death4','Death5']
true_avengers['Deaths'] = (true_avengers[columns] == 'YES').sum(axis=1)   

The difference in execution time is minimum, only 0.001 secs, so it´s a question of tastes which one to choose if you are looking for the efficiency.

I like using dictionaries to convert values to numeric.

import numpy as np

convert_map = {'YES': 1, 'NO': 0, np.nan: 0}
columns = ['Death1', 'Death2', 'Death3', 'Death4', 'Death5']

true_avengers['Deaths'] = true_avengers[columns].apply(lambda m: sum(m.map(convert_map)), axis=1)
1 Like

It is almost the same as my answer:

death_cols = ['Death'+str(x) for x in range(1,6)]
true_avengers['Deaths'] = (avengers[death_cols]=='YES').sum(axis=1)
1 Like

this method is really smart