Another approach for summing up death count

Screen Link:
Total Deaths Count

My Code:

import time
start_time = time.time()
death_col = true_avengers.columns[true_avengers.columns.str.startswith('Death')].tolist()
z=pd.DataFrame()

for col  in true_avengers[death_col].columns.tolist():
    z[col]=pd.Series(np.where(true_avengers[col] == 'YES', 1, 0),
          true_avengers.index)
    
true_avengers['Deaths']=z.sum(axis=1)
print("--- %s seconds ---" % (time.time() - start_time))


What I expected to happen:
Another way using parallel sum execution

What actually happened:
if we use df.apply it uses axis wise operation which is slow.

#Dataquest Code
import time
start_time = time.time()

def clean_deaths(row):
    num_deaths = 0
    columns = ['Death1', 'Death2', 'Death3', 'Death4', 'Death5']
    
    for c in columns:
        death = row[c]
        if pd.isnull(death) or death == 'NO':
            continue
        elif death == 'YES':
            num_deaths += 1
    return num_deaths

true_avengers['Deaths'] = true_avengers.apply(clean_deaths, axis=1)


print("--- %s seconds ---" % (time.time() - start_time))


dataquest approach takes , — 0.010109663009643555 seconds —
Method I used takes only 0.008331060409545898 seconds –

Need to evaluate when dataframe size scales up

3 Likes

Recategorized your topic @eashwary

1 Like