X dtype for linear regression fit()

Hi all,

I went back to review the The Linear Regression Model mission (https://app.dataquest.io/m/235/the-linear-regression-model/5/using-scikit-learn-to-train-and-predict) along with the documentation for sklearn.linear_model.LinearRegression and found something that I don’t quite get.

From the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html), in “fit( self , X , y , sample_weight=None)”, “X” is array-like or sparse matrix while “y” is array-like.

However, in the course solution, we used train[[‘Gr Liv Area’]] for X and train[‘SalePrice’] for y. Since train[[‘Gr Liv Area’]] only concerns one column, I thought I could just use train[‘Gr Liv Area’] instead, but I got an error. Can someone explain this?

Hey, Xuehong.

I think the documentation for the fit method could be clearer on this.

Notice the bit below:

    X : array-like or sparse matrix, shape (n_samples, n_features)

Specifically shape (n_samples, n_features). This hints at what the input should look like: bi-dimensional.

Let’s take a look at the error when we pass a one-dimensional parameter:

>>> from sklearn.linear_model import LinearRegression
>>> lr = LinearRegression()
>>> X1d = [0, 1, 2]
>>> y = [0, 1, 2]
>>> lr.fit(X1d, y)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bruno/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/base.py", line 458, in fit
    y_numeric=True, multi_output=True)
  File "/home/bruno/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 756, in check_X_y
    estimator=estimator)
  File "/home/bruno/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 552, in check_array
    "if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[0 1 2].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Note ValueError: Expected 2D array, got 1D array instead. It confirms what the documentation suggests. Let’s now pass a 2D version of X1d as a parameter:

>>> X2d = [[0], [1], [2]]
>>> lr.fit(X2d, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

It worked fine! The problem with passing a series is that it is one-dimensional (try printing its shape attribute). When you pass a dataframe, even if it has just one column, it is in a 2D format (check its shape attribute).

For more details expand here

My answer can possibly raise another question: How is it that fit knows the dimensions of X1d and X2d when they aren’t even series, nor dataframes? And what does “array-like” even mean?

I’ll try to give some insight into this in this reply.

A dive into the source code (accessible by clicking where it says source in the documentation) answers this question. Note that all images are clickable to the relevant code snippet.

First we see that the input is potentially modified by a function called check_X_y:

image

In the definition of this function we see that once again “X” (i.e. our first argument) is potentially modified by a function called check_array:

image

And now we look into the definition of check_array. When “X” isn’t a sparse matrix, we fall into the following case in a conditional statement:

image

So we see that our input is transformed into a numpy.ndarray object and these happen to have a shape attribute.

>>> np.asarray(X1d)
array([0, 1, 2])
>>> np.asarray(X2d)
array([[0],
       [1],
       [2]])
>>> np.asarray(X1d).shape
(3,)
>>> np.asarray(X2d).shape
(3, 1)

In fact, just below this we can find the code that checks how many dimensions our modified input has:

image

Do the error messages look familiar?

The ndim attribute is equivalent to checking the length of the output of the shape attribute, i.e., it counts the number of dimensions. See the documentation. Here’s an example:

>>> np.asarray(X1d).ndim
1
>>> np.asarray(X2d).ndim
2

And this is why we can pass lists, dataframes, and so on. It all boils down to Numpy arrays. Try printing np.asarray(train[‘Gr Liv Area’]).ndim.

We reached the end of the rabbit hole.

Please see this markdown guide to learn how to format your question (add hyperlinks, format code appropriately, among other things).

If you could format your question, that would be great.

Thanks!

Hi Bruno,

Apologies for not using the correct format. I will try it next time. Thank you so much for the detailed explanation.

Xuehong

1 Like