There are mainly 2 ways to solve a linear regression.
- Analytical Solution
- Gradient Descent
From the docs you can see sklearn’s LinearRegression uses scipy.linalg.lstsq
or scipy.optimize.nnls
(for additional non negative coefficients constraint).
For better understanding, you can open the source scikit-learn/_base.py at 95119c13af77c76e150b753485c662b7c52a41a2 · scikit-learn/scikit-learn · GitHub and observe how the class is initialized and fit.
positive
is given as an input parameter when initializing LinearRegression. As you can see in the source it controls which of 2 scipy optimizers are called. Both of these correspond to the closed form analytical solution.
If you wanted gradient descent, you would use sklearn.linear_model.SGDRegressor
with alpha = 0 to turn off regularization and make it normal linear regression.
lr.coef_
does not calculate anything. Calculation is done in def fit
. You can see self.coef_ assigned in the screenshot above from output of fit. That’s how the api allows users to get information. There are things you can access from objects that are not in docs, if you study the source enough. Developers don’t put it in docs because it is not recommended for normal use.
Gradient descent comes into picture when the analytical solution (try this first) fails because the matrix to solve is non-invertible or too slow. Gradient descent gives a good enough approximate solution. It is iterative (improving performance step by step) rather than jumping to the best answer in 1 step (analytical closed-form solution). Gradient descent is also used when there is no closed-form solution easily available, like neural networks when the function is too big and complex to even write down in closed form.
The analytical solutions sets derivative of loss wrt coefficients to 0 and finds the global optimum immediately. GD also uses the derivative of loss wrt coefficients, but rather than jumping to global optimum immediately, moves towards it in the direction of largest descent iteratively.
The lesson teaches the theory. The framework gives a convenient way to implement that theory. By knowing theory, you can tweak it and push the field forward, then contribute to the framework source code to improve the workflow for future coders. The framework/workflow by itself is not an optimization process, it is just a set of abstracted code to call for people who don’t care about theory and trust that it works.
Both analytical and gradient descent go for the same goals to minimize MSE. They just do it in different ways resulting in different characteristics. Eg. Training speed, accuracy of optimal solution or whether the global optimal is found.
When there are more predictors than observations, closed-form cannot be solved to a unique solution (non-invertible point mentioned above). GD will still run (it runs even with 1 row of data), but not sure how GD performs in this case.
Sklearn’s Lasso uses coordinate descent, Ridge can use both closed-form and gradient descent based on which solver
you initialize it with. In practice people just start with ElasticNet because it’s basically a combination of Lasso and Ridge that combines feature selection and handling multicollinearity.