Hey, Xuehong.
I think the documentation for the fit
method could be clearer on this.
Notice the bit below:
X : array-like or sparse matrix, shape (n_samples, n_features)
Specifically shape (n_samples, n_features)
. This hints at what the input should look like: bi-dimensional.
Let’s take a look at the error when we pass a one-dimensional parameter:
>>> from sklearn.linear_model import LinearRegression
>>> lr = LinearRegression()
>>> X1d = [0, 1, 2]
>>> y = [0, 1, 2]
>>> lr.fit(X1d, y)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/bruno/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/base.py", line 458, in fit
y_numeric=True, multi_output=True)
File "/home/bruno/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 756, in check_X_y
estimator=estimator)
File "/home/bruno/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 552, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[0 1 2].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Note ValueError: Expected 2D array, got 1D array instead
. It confirms what the documentation suggests. Let’s now pass a 2D version of X1d
as a parameter:
>>> X2d = [[0], [1], [2]]
>>> lr.fit(X2d, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
It worked fine! The problem with passing a series is that it is one-dimensional (try printing its shape
attribute). When you pass a dataframe, even if it has just one column, it is in a 2D format (check its shape
attribute).
For more details expand here
My answer can possibly raise another question: How is it that fit
knows the dimensions of X1d
and X2d
when they aren’t even series, nor dataframes? And what does “array-like” even mean?
I’ll try to give some insight into this in this reply.
A dive into the source code (accessible by clicking where it says source in the documentation) answers this question. Note that all images are clickable to the relevant code snippet.
First we see that the input is potentially modified by a function called check_X_y
:

In the definition of this function we see that once again “X
” (i.e. our first argument) is potentially modified by a function called check_array
:

And now we look into the definition of check_array
. When “X
” isn’t a sparse matrix, we fall into the following case in a conditional statement:

So we see that our input is transformed into a numpy.ndarray
object and these happen to have a shape
attribute.
>>> np.asarray(X1d)
array([0, 1, 2])
>>> np.asarray(X2d)
array([[0],
[1],
[2]])
>>> np.asarray(X1d).shape
(3,)
>>> np.asarray(X2d).shape
(3, 1)
In fact, just below this we can find the code that checks how many dimensions our modified input has:

Do the error messages look familiar?
The ndim
attribute is equivalent to checking the length of the output of the shape
attribute, i.e., it counts the number of dimensions. See the documentation. Here’s an example:
>>> np.asarray(X1d).ndim
1
>>> np.asarray(X2d).ndim
2
And this is why we can pass lists, dataframes, and so on. It all boils down to Numpy arrays. Try printing np.asarray(train[‘Gr Liv Area’]).ndim
.
We reached the end of the rabbit hole.