# X dtype for linear regression fit()

Hi all,

I went back to review the The Linear Regression Model mission (https://app.dataquest.io/m/235/the-linear-regression-model/5/using-scikit-learn-to-train-and-predict) along with the documentation for sklearn.linear_model.LinearRegression and found something that I don’t quite get.

From the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html), in “fit( self , X , y , sample_weight=None)”, “X” is array-like or sparse matrix while “y” is array-like.

However, in the course solution, we used train[[‘Gr Liv Area’]] for X and train[‘SalePrice’] for y. Since train[[‘Gr Liv Area’]] only concerns one column, I thought I could just use train[‘Gr Liv Area’] instead, but I got an error. Can someone explain this?

Hey, Xuehong.

I think the documentation for the `fit` method could be clearer on this.

Notice the bit below:

``````    X : array-like or sparse matrix, shape (n_samples, n_features)
``````

Specifically `shape (n_samples, n_features)`. This hints at what the input should look like: bi-dimensional.

Let’s take a look at the error when we pass a one-dimensional parameter:

``````>>> from sklearn.linear_model import LinearRegression
>>> lr = LinearRegression()
>>> X1d = [0, 1, 2]
>>> y = [0, 1, 2]
>>> lr.fit(X1d, y)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/bruno/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/base.py", line 458, in fit
y_numeric=True, multi_output=True)
File "/home/bruno/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 756, in check_X_y
estimator=estimator)
File "/home/bruno/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 552, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[0 1 2].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
``````

Note `ValueError: Expected 2D array, got 1D array instead`. It confirms what the documentation suggests. Let’s now pass a 2D version of `X1d` as a parameter:

``````>>> X2d = [[0], [1], [2]]
>>> lr.fit(X2d, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
``````

It worked fine! The problem with passing a series is that it is one-dimensional (try printing its `shape` attribute). When you pass a dataframe, even if it has just one column, it is in a 2D format (check its `shape` attribute).

For more details expand here

My answer can possibly raise another question: How is it that `fit` knows the dimensions of `X1d` and `X2d` when they aren’t even series, nor dataframes? And what does “array-like” even mean?

I’ll try to give some insight into this in this reply.

A dive into the source code (accessible by clicking where it says source in the documentation) answers this question. Note that all images are clickable to the relevant code snippet.

First we see that the input is potentially modified by a function called `check_X_y`:

In the definition of this function we see that once again “`X`” (i.e. our first argument) is potentially modified by a function called `check_array`:

And now we look into the definition of `check_array`. When “`X`” isn’t a sparse matrix, we fall into the following case in a conditional statement:

So we see that our input is transformed into a `numpy.ndarray` object and these happen to have a `shape` attribute.

``````>>> np.asarray(X1d)
array([0, 1, 2])
>>> np.asarray(X2d)
array([[0],
[1],
[2]])
>>> np.asarray(X1d).shape
(3,)
>>> np.asarray(X2d).shape
(3, 1)
``````

In fact, just below this we can find the code that checks how many dimensions our modified input has:

Do the error messages look familiar?

The `ndim` attribute is equivalent to checking the length of the output of the `shape` attribute, i.e., it counts the number of dimensions. See the documentation. Here’s an example:

``````>>> np.asarray(X1d).ndim
1
>>> np.asarray(X2d).ndim
2
``````

And this is why we can pass lists, dataframes, and so on. It all boils down to Numpy arrays. Try printing `np.asarray(train[‘Gr Liv Area’]).ndim`.

We reached the end of the rabbit hole.

If you could format your question, that would be great.

Thanks!

Hi Bruno,

Apologies for not using the correct format. I will try it next time. Thank you so much for the detailed explanation.

Xuehong

1 Like