Why we need to reshape the rows before calculating Euclidean Distance? 26 - 4

Hi,

I have two questions:

  1. I don’t completely understand why I have to reshape the rows before giving them to Euclidean Distance function of scikit-learn library.

  2. What does the reshape(1, -1) does? How does it work? What is the role of -1 in reshape method?

Screen Link: https://app.dataquest.io/m/26/clustering-basics/4/distance-between-senators

Thanks,

Hi
In order to answer your first question, if you don’t reshape the arrays before passing it to euclidean_distances(), you’ll be getting an error as follows:

euclidean_distances(votes.iloc[0,3:].values, votes.iloc[2,3:].values)

As per the error message, the API is expecting us to pass a 2D array, and since we have a 1D array for input we have reshape it.

In order to reshape the input we can make use of the numpy.ndarray.reshape method. which answers your second question.
This method returns an array containing the same data with a new shape.
In cases where the array dimension is unknown, we can pass -1 as the value, NumPy will calculate this number for us.
For more details you can check

Hope it helps.

Thanks,
Debasmita

1 Like

Thanks for your answer.

I know what reshape do and how it works. But I’m confused with the argument of reshape : (1, -1)
As you said, Euclidean distance function expects a 2D array, but (1, -1) is a 1D array, isn’t it? Maybe I’m missing something here.

Thanks for your time.

reshape(1, -1) is a 2D array. You can check using the shape attribute.

Hope its clear now.

Thanks

1 Like

I have same question but in another way why we need to use
distance = euclidean_distances(votes.iloc[0,3:].values.reshape(1,-1),votes.iloc[2,3:].values.reshape(1,-1))
although if we use :
distance = euclidean_distances(votes.iloc[0,3:],votes.iloc[2,3:])
we will get the same answer

1 Like

No, we don’t get the same answer. Did you run your code

distance = euclidean_distances(votes.iloc[0,3:],votes.iloc[2,3:])

Please run it again and let know.

Although I haven’t done this mission yet, I was curious to test it out and found that what @esramgamal is saying is true! So the question still stands: Why do we need to reshape the series objects when passing them to euclidean_distances?

I think something must have been updated in scikit-learn since this post was first created because I do no generate an error using this code:

Instead I get an output of:

array([[3.31662479]])

I did this and got the same answer:

a = votes.iloc[0,3:].values
b = votes.iloc[1,3:].values

euclidean_distances([a],[b])

The output:

array([[1.73205081]])

For reference this is the DQ answer:

euclidean_distances(votes.iloc[0,3:].values.reshape(1, -1), votes.iloc[1,3:].values.reshape(1, -1))

Is there any reason why doing this is incorrect?

I was basically thinking about the “double brackets” that you use when passing pandas objects in to sk learn.

We dont have to use the reshape method, but my guess is that the advantage to using it is that we will get to use vectorized operations to speed things up. In this case votes.iloc[0,3:] has the shape (15,) which means that it is 1 row with 15 columns. adding .values converts the series object into a numpy array (interestingly, pandas documentation says to use .to_numpy() instead), and using .reshape(1,-1) changes the shape of the array from 1 row with 15 columns, to 1 column with (in this case) 15 rows. The -1 is just a placeholder value that tells numpy you want 1 column, with however many rows it takes.

so all that .values.reshape(1,-1) is doing, is turning two rows of data into two columns of data, and my guess is that it allows for more efficient, vectorized processing. Maybe @Bruno, @the_doctor or someone can tell us for real.

3 Likes