Hi,
I’d like to know what is the difference between min-max feature scaling, mentioned in the Ranking customer task, of the Fuzzy Language in Data Science mission (https://app.dataquest.io/m/466/fuzzy-language-in-data-science/6/ranking-customers), and the stanrdardization, mentioned in the Using Standardization for Comparisons, which is a part of the Z-Scores mission (https://app.dataquest.io/m/309/z-scores/8/using-standardization-for-comparisons). Both are in the Data Analyst Path.
I don’t have a background in mathematics, so I fear there might be a simple answer for this question I’m not seeing. On the other side, I tried to solve the task in the fuzzy language mission with z-scores and it worked!
I used the following code, which is pretty similar to the one contained in the answer key of the Dataquest platform, to “rescale”.
scaled_tran = (best_churn["nr_of_transactions"]-best_churn["nr_of_transactions"].min())/(best_churn["nr_of_transactions"].max()-best_churn["nr_of_transactions"].min())
best_churn["scaled_tran"] = scaled_tran
scaled_amount = (best_churn["amount_spent"]-best_churn["amount_spent"].min())/(best_churn["amount_spent"].max()-best_churn["amount_spent"].min())
best_churn["scaled_amount"] = scaled_amount
best_churn["score"] = ((scaled_tran*0.5) + (scaled_amount*0.5))*100
best_churn.sort_values("score",ascending = False).head(10)
To find the z-scores, I did:
test = best_churn.copy()
test["z_score"] = (best_churn["nr_of_transactions"]-best_churn["nr_of_transactions"].mean())/best_churn["nr_of_transactions"].std()
test["z_score_two"] = (best_churn["amount_spent"]-best_churn["amount_spent"].mean())/best_churn["amount_spent"].std()
test["final"] = (test["z_score"]+test["z_score_two"])/2
test.sort_values("final",ascending = False).head(10)
What I expected to happen: as far as I understood judging by the mathematical formula of for the min-max scaling is that the distance from a given value to the minimum value (X - min(X) has to be proportional to the range of this scale(max(x)-min(x)), much like converting from Celsius to Fahreheint in High-School.
As for the standardization, it’s finding the number of standard deviations that fit in the “distance” between a given value and the mean.
I get that in both cases, I’m finding a common value that allow for comparison between two scales and this is what I expected from both techniques.
Still, what are the differences between them and is there anything I should watch for when using one over the other ?