temp_df = train_df.copy()
temp_df['distance'] = temp_df['bathrooms'].apply(lambda x: np.abs(x - new_listing))
temp_df = temp_df.sort_values('distance')
nearest_neighbors_prices = temp_df.iloc[0:5]['price']
predicted_price = nearest_neighbors_prices.mean()
What I expected to happen:
I understand that we’re supposed to use test_df[“predicted_price”]=test_df[“bathrooms”].apply(lambda x:predict_price(x)) to predict the price of the listings in the testing set.
But when I go back to the code I see that within the predict_price function we have the following important piece of code: temp_df[‘distance’] = temp_df[‘bathrooms’].apply(lambda x: np.abs(x - new_listing))
Here we are using the train section of the data set, which has more rows than the test section. How are those compatible.
I’d appreciate a small walkthrough of what happens at say, 2 specifics rows of the testing dataset and follow what the code is doing.
What actually happened:
Replace this line with the output/error
You use the data in the train dataframe to predict the house price for points in the test dataframe. But how is this done?
With element-wise operation, you take one point at a time from the
test_df['bathrooms']. The value in one cell in the bathrooms column is the
x is supplied to the
predict_price function as
new_listing. The value in one cell of the
test_df['bathroom'] enters the function as
The idea of
nearest neighbors is using the prices of houses with the similar number of bathrooms to predict the price of a new house with the same bathroom.
Here, we create a new column called
distance in the train_df. We fill this column by subtracting the number of bathrooms in the train data from the number of bathroom for a cell in the test data. Then we sort the dataframe to get rows with the smallest difference at the top. Smallest difference means they are the closest neighbors.
Here, we take the first five neighbors and we predict the mean of these houses. So that particular cell in the test data will contain this price. This is the price predicted for this particular house from the train data.
Here, we find the difference between the predicted price from the test data and the actual price of the house.
Hope it helps!