Why does the order of features matter?

I accidentally switched the order of ‘bathrooms’ and ‘bedrooms’ and got the wrong answer. I don’t understand why this would matter though. As far as I understand Euclidean distance, a+b = b+a.

This stack overflow post explains:

The names of your columns don’t matter but the order does. You need to make sure that the order is consistent from your training and test data. If you pass in two columns in your training data, your model will assume that any future inputs are those features in that order.

Just a really simple thought experiment. Imagine you train a model that subtracts two numbers. The features are (n_1, n_2), and your output is going to be n_1 - n_2.

Your model doesn’t process the names of your columns (since only numbers are passed in), and so it learns the relationship between the first column, the second column, and the output - namely output = col_1 - col_2.

Regardless of what you pass in, you’ll get the result of the first thing you passed in minus the second thing you pass in. You can name the first thing you pass in and the second thing you pass in to whatever you want, but at the end of the day you’ll still get the result of the subtraction.

To get a little more technical, what’s going on inside your model is mostly a series of matrix multiplications. You pass in the input matrix, the multiplications happen, and you get what comes out. Training the model just “tunes” the values in the matrices that your inputs get multiplied by with the intention of maximizing how close the output of these multiplications is to your label. If you pass in an input matrix that isn’t like the ones it was trained on, the multiplications still happen, but you’ll almost certainly get a terribly wrong output. There’s no intelligent feature rearranging going on underneath.

hey there! what you are saying makes sense that if the training and test dataframes dont have the same column order, the result would be wrong. But for this exercise I used a list for the feature column names that was referenced by both the training and test dataframes, so the column order was consistant. However, altering the order of the list of feature column names changed the output. I would expect the output to change if we changed the order of the training dataset rows, but I did not expect to see the output change when the columns were rearranged.

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
features = ['bedrooms', 'accommodates', 'bathrooms', 'number_of_reviews']
hyper_params = [1,2,3,4,5]
mse_values =[]
for i in hyper_params:
    knn = KNeighborsRegressor(n_neighbors=i, algorithm='brute')
    knn.fit(train_df[features], train_df['price'])
    predictions = knn.predict(test_df[features])
    mse_values.append(mean_squared_error(test_df['price'], predictions))

outputs:
[26364.92832764505,
15142.672639362912,
14666.83150044242,
16733.581626848692,
14478.088646188853]

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
features = ['accommodates', 'bedrooms',  'bathrooms', 'number_of_reviews']
hyper_params = [1,2,3,4,5]
mse_values =[]
for i in hyper_params:
    knn = KNeighborsRegressor(n_neighbors=i, algorithm='brute')
    knn.fit(train_df[features], train_df['price'])
    predictions = knn.predict(test_df[features])
    mse_values.append(mean_squared_error(test_df['price'], predictions))

outputs:
[26364.92832764505,
15100.52246871445,
14579.597901655923,
16212.300767918088,
14090.011649601822]

Similar results, but not the same.