Using applying a function that uses the training data on the test dataset

In the Machine Learning Introduction → Machine Learning Fundamentals → Evaluating Model performance, the function that is defined codes the training dataset as the temporary dataframe (‘temp_df’):

def predict_price(new_listing):
    **temp_df = train_df.copy()**
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbors_prices = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbors_prices.mean()

This function is then applied to the test dataframe. I don’t get this at all. If I apply this function to test_df the results will be from the train_df. Can someone please explain this? Thanks!!!

You are trying to predict the price for each row in the test_df given the accommodates feature. The predicted price is the average of the prices of the nearest neighbours. Those nearest neighbours are found in relation to data we already have - the training set, train_df.

The nearest neighbours to the price corresponding to the accommodates feature are calculated in the function as -

temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - new_listing))

new_listing corresponds to a value from test_df for the accommodates column. Above, an absolute difference is calculated using that value.

We then sort the column, select the first 5 - which will be our nearest neighbours, and then find the average price for those neighbours.

That averaged price is our predicted price for one row of the test set. Since we are using the apply() method on the test set, the above repeats for every row in the test set corresponding to the accommodates column.

1 Like