I think the documentation for the
fit method could be clearer on this.
Notice the bit below:
X : array-like or sparse matrix, shape (n_samples, n_features)
shape (n_samples, n_features). This hints at what the input should look like: bi-dimensional.
Let’s take a look at the error when we pass a one-dimensional parameter:
>>> from sklearn.linear_model import LinearRegression
>>> lr = LinearRegression()
>>> X1d = [0, 1, 2]
>>> y = [0, 1, 2]
>>> lr.fit(X1d, y)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/bruno/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/base.py", line 458, in fit
File "/home/bruno/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 756, in check_X_y
File "/home/bruno/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 552, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[0 1 2].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
ValueError: Expected 2D array, got 1D array instead. It confirms what the documentation suggests. Let’s now pass a 2D version of
X1d as a parameter:
>>> X2d = [, , ]
>>> lr.fit(X2d, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
It worked fine! The problem with passing a series is that it is one-dimensional (try printing its
shape attribute). When you pass a dataframe, even if it has just one column, it is in a 2D format (check its
For more details expand here
My answer can possibly raise another question: How is it that
fit knows the dimensions of
X2d when they aren’t even series, nor dataframes? And what does “array-like” even mean?
I’ll try to give some insight into this in this reply.
A dive into the source code (accessible by clicking where it says source in the documentation) answers this question. Note that all images are clickable to the relevant code snippet.
First we see that the input is potentially modified by a function called
In the definition of this function we see that once again “
X” (i.e. our first argument) is potentially modified by a function called
And now we look into the definition of
check_array. When “
X” isn’t a sparse matrix, we fall into the following case in a conditional statement:
So we see that our input is transformed into a
numpy.ndarray object and these happen to have a
array([0, 1, 2])
In fact, just below this we can find the code that checks how many dimensions our modified input has:
Do the error messages look familiar?
ndim attribute is equivalent to checking the length of the output of the
shape attribute, i.e., it counts the number of dimensions. See the documentation. Here’s an example:
And this is why we can pass lists, dataframes, and so on. It all boils down to Numpy arrays. Try printing
np.asarray(train[‘Gr Liv Area’]).ndim.
We reached the end of the rabbit hole.