154-2 Pandas dataframe question

hey! question regarding course 1 of step6: on “cross validation” mission, in 2.Holdout Validation:
if I use:

split_one["accommodates"]

instead of:

split_one[["accommodates"]]

I am getting this error:

"ValueError: Found input variables with inconsistent numbers of samples: [1, 1862]" 

Can you please, tell me why?
I can’t think of how those two lines differ…

What the double brackets do is they produce the output as a two-dimensional array, in this case a dataframe, as opposed to a series.

So while the output of split_one[“accommodates”] and split_one[[“accommodates”]] probably look very similar, they are actually different object types!

You can verify this yourself!

print(type(split_one["accommodates"]))
print(split_one["accommodates"].shape)

Output:

<class 'pandas.core.series.Series'>
(1862,)

When a single bracket is used to isolate that column, you see that you’re only returning a series object of shape (1862,). This is not a 2D object because the number of columns isn’t defined, only the rows are.

This is different when you use double brackets, because you’re now returning that column not as a series, but as a single-column dataframe.

print(type(split_one[["accommodates"]]))
print(split_one[["accommodates"]].shape)

Output:

<class 'pandas.core.frame.DataFrame'>
(1862, 1)

You see that both dimensions of the dataframe object are now specified, because it actually is a dataframe object, and not a series, even though it might look really similar to its series counterpart.

As to why that matters, understand that in any generic model.fit(X, y) line of code using scikit-learn, the data represented by the predictor variable X is expected by scikit-learn models to be 2-dimensional!

Incidentally, this is also why the “X” in model.fit(X, y) is represented using an upper case “X”, because an upper-case variable is conventionally used to define a matrix (i.e. 2D) object. The y variable is lower case because it doesn’t have to be 2-dimensional.

6 Likes

Besides dataframe, you can see these optional bracket ambiguities in df.groupby(‘col’) vs df.groupby([‘col’]).

Matplotlib also has a similar issue of accepting arguments both with () and without。
https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.axes.Axes.set_ylim.html#matplotlib.axes.Axes.set_ylim

1 Like

Wow… thank you for your answer @blueberrypudding85.