Fuzzy Language in Data Science - 6. Ranking Customers. Feedback to my solution?

Screen Link: https://app.dataquest.io/m/466/fuzzy-language-in-data-science/6/ranking-customers

My Code:

def scale(x, col):
    c_min = best_churn[col].min()    
    c_max = best_churn[col].max()    
    
    return ( x - c_min ) / ( c_max - c_min )

best_churn['scaled_tran'] = best_churn['nr_of_transactions'].apply(scale,col='nr_of_transactions')
best_churn['scaled_amount'] = best_churn['amount_spent'].apply(scale,col='amount_spent')

best_churn['score'] = 100* ( ( 0.5 * best_churn['scaled_tran'] ) + ( 0.5 * best_churn['scaled_amount'] ) )

best_churn.sort_values('score', inplace=True, ascending=False) 

What I expected to happen:

I got the correct answer but when I checked against the ‘see answer’ it used a longer form which fed my suspicions further: Is my above function (‘scale’) very unoptimised?

i.e. do ‘best_churn[col].min()’ and ‘best_churn[col].max()’ get calculated again and again for each and every row in the pandas.Series? So 100s of min() & max() calculations.

Or does the apply() function know to calculate them just once and then perform the ‘return ( x - c_min ) / ( c_max - c_min )’ once per row?

1 Like

Hey, Joseph.

Nicely done, I like your solution better, but not by much. Our solution is only longer because we’re repeating the same idea for two columns, but it’s basically the same thing output-wise. As for optimization, continue reading.

I don’t know, a good theoretical response would possibly require digging into to the source code, which I won’t do and I recommend you don’t.

A good practical response is to compare how long your solutions takes to execute with how long our solution takes to execute, across different scenarios (other datasets, varying numbers of rows, and so on), but such things would only make a difference for scripts running in production with a very large number of rows.

The use of i.e. suggests that this is a different version of the same questions, or that the questions are equivalent. They are not, at least no obviously so.

It could be that the extrema are being calculated repeatedly and your solution is still faster.

It seems it doesn’t know to calculate them just once; the extrema are indeed being calculated repeatedly. This can be verified by checking how many times is, for instance c_max, a new object:

ids = []

def scale(x, col):
    c_min = best_churn[col].min()
    c_max = best_churn[col].max()    
    ids.append(id(c_max))
    return ( x - c_min ) / ( c_max - c_min )

best_churn['scaled_tran'] = best_churn['nr_of_transactions'].apply(
    scale, col='nr_of_transactions'
)
best_churn['scaled_amount'] = best_churn['amount_spent'].apply(
    scale, col='amount_spent'
)

best_churn['score'] = 100*(
    ( 0.5 * best_churn['scaled_tran'] )
    + ( 0.5 * best_churn['scaled_amount'])
)

best_churn.sort_values('score', inplace=True, ascending=False)
print(len(ids))
13778

The output matches your intuition nicely, because best_churn has 6889 rows and 13778 (double of 6889) is the result of running pandas.Series.apply method for each row, twice.

1 Like

Thank-you for the reply Bruno! That’s a great help, it never occurred to me that I could use ‘id()’ like that - I’ll remember that in future. Thanks!

I didn`t get the 0.5, can anyone explain it?

I believe it was just an arbitrary formula: (1/2 * nr_of_transactions) + (1/2 * amount_spent)

Your question is different from the one on this topic, please ask it in a new post.

It has a little bit of arbitrariness, but it’s mostly not arbitrary. The coefficients (which happen to be the same) are weights and the assumption is that frequency of purchase is worth as much as total amount spent.