What the double brackets do is they produce the output as a two-dimensional array, in this case a dataframe, as opposed to a series.
So while the output of split_one[“accommodates”]
and split_one[[“accommodates”]]
probably look very similar, they are actually different object types!
You can verify this yourself!
print(type(split_one["accommodates"]))
print(split_one["accommodates"].shape)
Output:
<class 'pandas.core.series.Series'>
(1862,)
When a single bracket is used to isolate that column, you see that you’re only returning a series object of shape (1862,). This is not a 2D object because the number of columns isn’t defined, only the rows are.
This is different when you use double brackets, because you’re now returning that column not as a series, but as a single-column dataframe.
print(type(split_one[["accommodates"]]))
print(split_one[["accommodates"]].shape)
Output:
<class 'pandas.core.frame.DataFrame'>
(1862, 1)
You see that both dimensions of the dataframe object are now specified, because it actually is a dataframe object, and not a series, even though it might look really similar to its series counterpart.
As to why that matters, understand that in any generic model.fit(X, y)
line of code using scikit-learn, the data represented by the predictor variable X is expected by scikit-learn models to be 2-dimensional!
Incidentally, this is also why the “X
” in model.fit(X, y)
is represented using an upper case “X
”, because an upper-case variable is conventionally used to define a matrix (i.e. 2D) object. The y
variable is lower case because it doesn’t have to be 2-dimensional.