Act fast, special offers end soon! Up to $294 is savings when you get Premium today.
Get offer codes

The Math Behind Gradient Descent

Imagine that you are standing on a flat surface and have to walk down a concave slope. Now imagine that it is a slippery slope and you have to complete your walk in seconds. How did you manage that? Welcome to gradient descent!

Gradient descent is a “first order iterative optimization algorithm for finding the local minimum of a differentiable function”.

To implement gradient descent, there must be a differentiable function. A differentiable function is a function whose derivative exists with respect to a particular variable. This differentiable function is what we called cost or loss function.

Gradient descent is the optimization technique for finding the bias and coefficient(s) in the linear regression and logistic regression algorithms. In this article, I shall apply both mean squared error and the log-loss cost functions to the logistic function.

The aim of this article is to show the math behind gradient descent when the cost functions are: mean squared error and log-loss. Hopefully, this article will inspire you to apply the same technique to other cost functions : for example tanh.




This article has shown the math behind applying gradient descent to the logistic function when the cost functions are mean squared error and log-loss. The process involves finding the local minimum of a differentiable function by first-order differentiating.


Could you explain the part about “Combining both”, why do you need to sum the differential of the cost function for a and b?

1 Like

Whenever the bias and the coefficient(s) are updated, this is the same as summing the partial differential equations.

What i was expecting to happen is when updating a, we use the equation for a, similarly updating b use equation for b, without summing in the term coming from the other variable.
Are you saying we should use the same summed equation to update both a and b?
Is that really correct since we are then applying an extra term from b’s equation to a and vice versa?

Or that’s fine because it works like the logloss equation where even though there is a sum of 2 terms, only 1 of the 2 terms will be non-zero at any moment depending on whether y=1/0?

Could you explain this further, how does summing partial differential equations apply during learning?
I know why there are partial differentials and how they apply individually to the variables they are differentiated with respect to, but don’t understand why sum them.

Thanks for the question!

You can do as you have described. Or you use matrix algebra.

With your method, you update a and b separately.

With matrix algebra you update at the same time.

For example we can take common factor out of the equation for log-loss: (y - yhat)(1+x).

If (y-yhat) is a float or int multiplying (1+x) which is a numpy array/ matrix, you can update as once.

And this is exactly how I implemented it.

For “summing the PDEs,” imagine that your coef_ is a numpy array and the first item is the bias. When you update the coef_ += (y - yhat)(1+x), you update the items of the coef_ individually in place.

Because the coef_ has both the bias and coefficients, your coef_ is the sum of the the PDEs.