def clean_deaths(row):
num_deaths = 0
columns = ['Death1', 'Death2', 'Death3', 'Death4', 'Death5']
for c in columns:
death = row[c]
if pd.isnull(death) or death == 'NO':
continue
elif death == 'YES':
num_deaths += 1
return num_deaths
true_avengers['Deaths'] = true_avengers.apply(clean_deaths, axis=1)
The above is the suggested answer. I had submitted an answer quite similar to the above (although I just used nested loops instead of apply), but I had been hoping for a suggested answer that was much more concise and elegant than this. Anyone have a better way of doing this? I could do this faster in Excel
Hey, Preston. I think the given solution is perfectly fine. It uses a function that abstracts way the death count and then applies it.
At any rate, hereās a one-liner:
true_avengers["Deaths"] = true_avengers[
[c for c in true_avengers.columns if c.startswith("Death")]
].fillna("").sum(axis=1).str.count("YES")
Hereās the strategy:
[c for c in true_avengers.columns if c.startswith("Death")] just gets the column names because I couldnāt be bothered with typing them, even thought it ends up being about the same length of characters.
DataFrame.fillna replaces the missing values with empty strings ā this is necessary to properly employ the next step.
DataFrame.sum(axis=1) sums each row. When you sum strings, this is the same as concatenating them.
Finally we count the number of occurrences of YES to get the total number of deaths.
I hope this is the kind of solution you were looking for.
I would have the suggestion to update the āData Cleaning Project Walkthroughā. There were e.g. several examples during this course, where vectorized string methods were not used - although they were taught in earlier missions.
One of the great joys of programming is the myriad ways in which a problem can be solved. Especially these challenges at the end of step 2 would be a great opportunity for synthesis.
Both approaches from @Bruno and @smaximm could be compared side by side and shown as viable approaches to equally solve the same problem using technics thought earlier in this learning path.
Anyway, thanks for all the great content! Dataquest is a great learning resource!
Yes, Please.
Use the below given much simpler and more elegant solution based on the power of boolean indexing. No need to write such complicated functions. Just write a simple function and use boolean indexing.
print(true_avengers['Death1'].dtypes)
def is_death(x):
if x=='YES':
return True
else:
return False
Death_columns=['Death1','Death2','Death3','Death4','Death5']
dead_avengers=true_avengers[Death_columns]
for c in Death_columns:
dead_avengers[c]=dead_avengers[c].apply(is_death)
dead_avengers['Deaths']=dead_avengers[Death_columns].sum(axis=1)
true_avengers['Deaths']=dead_avengers['Deaths']
print(true_avengers['Deaths'].head(20))
Hi Preston. Hereās another suggestion that may be complementary to the answers above. I think is simple and easy to understand. Let me know your thoughts.
true_avengers['Deaths'] = 0
for i in range(1,6):
true_avengers['Deaths'] += true_avengers["Death{0}".format(i)].apply(lambda x: 1 if str(x).upper() == "YES" else 0)
Hi, everyone!
I“ve checked the performance of all the solutions except suggested by @Bruno (the following error araised: Can only use .str accessor with string values, which use np.object_ dtype in pandas).
The fastest one is the one suggested by @smaximm:
and the second fastest is the one-line representation of the latter suggested by @anthony.cua and @vorunplz: