Screen Link:
Total Deaths Count
My Code:
import time
start_time = time.time()
death_col = true_avengers.columns[true_avengers.columns.str.startswith('Death')].tolist()
z=pd.DataFrame()
for col in true_avengers[death_col].columns.tolist():
z[col]=pd.Series(np.where(true_avengers[col] == 'YES', 1, 0),
true_avengers.index)
true_avengers['Deaths']=z.sum(axis=1)
print("--- %s seconds ---" % (time.time() - start_time))
What I expected to happen:
Another way using parallel sum execution
What actually happened:
if we use df.apply it uses axis wise operation which is slow.
#Dataquest Code
import time
start_time = time.time()
def clean_deaths(row):
num_deaths = 0
columns = ['Death1', 'Death2', 'Death3', 'Death4', 'Death5']
for c in columns:
death = row[c]
if pd.isnull(death) or death == 'NO':
continue
elif death == 'YES':
num_deaths += 1
return num_deaths
true_avengers['Deaths'] = true_avengers.apply(clean_deaths, axis=1)
print("--- %s seconds ---" % (time.time() - start_time))
dataquest approach takes , — 0.010109663009643555 seconds —
Method I used takes only 0.008331060409545898 seconds –
Need to evaluate when dataframe size scales up