Gradient Descent vs. lr.coef_ From SK Learn LR Model?

Screen Link:


I am trying to get a better understanding of where this gradient descent process fits in to the big picture of what we are doing:

What is the difference between using the gradient descent process to calculate a_1 (the parameter values) as opposed to calling:

lr.coef_ from the sklearn LinearRegression() model?

a_1 is just the slope of the regression line, correct? Isn’t that what lr.coef_ is calculating for you? How is using the gradient descent process different?

When the lesson talks about the “optimization process”, how is this different than going through the steps we learned in the last lessons:

lr = LinearRegression()
and then calling:
lr.coef_ to find the coefficient

How is the “optimization” described in this lesson different than the choosing the features that yield the lowest MSE?

Sorry if I am missing something or if this will soon be answered in the next lessons. I just don’t want to miss the big picture.

Thank you for your time!

There are mainly 2 ways to solve a linear regression.

  1. Analytical Solution
  2. Gradient Descent

From the docs you can see sklearn’s LinearRegression uses scipy.linalg.lstsq or scipy.optimize.nnls (for additional non negative coefficients constraint).
For better understanding, you can open the source scikit-learn/ at 95119c13af77c76e150b753485c662b7c52a41a2 · scikit-learn/scikit-learn · GitHub and observe how the class is initialized and fit.

positive is given as an input parameter when initializing LinearRegression. As you can see in the source it controls which of 2 scipy optimizers are called. Both of these correspond to the closed form analytical solution.

If you wanted gradient descent, you would use sklearn.linear_model.SGDRegressor with alpha = 0 to turn off regularization and make it normal linear regression.

lr.coef_ does not calculate anything. Calculation is done in def fit. You can see self.coef_ assigned in the screenshot above from output of fit. That’s how the api allows users to get information. There are things you can access from objects that are not in docs, if you study the source enough. Developers don’t put it in docs because it is not recommended for normal use.

Gradient descent comes into picture when the analytical solution (try this first) fails because the matrix to solve is non-invertible or too slow. Gradient descent gives a good enough approximate solution. It is iterative (improving performance step by step) rather than jumping to the best answer in 1 step (analytical closed-form solution). Gradient descent is also used when there is no closed-form solution easily available, like neural networks when the function is too big and complex to even write down in closed form.

The analytical solutions sets derivative of loss wrt coefficients to 0 and finds the global optimum immediately. GD also uses the derivative of loss wrt coefficients, but rather than jumping to global optimum immediately, moves towards it in the direction of largest descent iteratively.

The lesson teaches the theory. The framework gives a convenient way to implement that theory. By knowing theory, you can tweak it and push the field forward, then contribute to the framework source code to improve the workflow for future coders. The framework/workflow by itself is not an optimization process, it is just a set of abstracted code to call for people who don’t care about theory and trust that it works.

Both analytical and gradient descent go for the same goals to minimize MSE. They just do it in different ways resulting in different characteristics. Eg. Training speed, accuracy of optimal solution or whether the global optimal is found.

When there are more predictors than observations, closed-form cannot be solved to a unique solution (non-invertible point mentioned above). GD will still run (it runs even with 1 row of data), but not sure how GD performs in this case.

Sklearn’s Lasso uses coordinate descent, Ridge can use both closed-form and gradient descent based on which solver you initialize it with. In practice people just start with ElasticNet because it’s basically a combination of Lasso and Ridge that combines feature selection and handling multicollinearity.


Thank you @hanqi for this detailed explaination!