Deriving the Derivative of the "Sum of Mean Squared Errors" Equation for Multi-variate Linear Equations

The questions I have are

  • How do you derive the equation for the MSE derivative of a multi-variate case of linear regression?
  • How are the two resulting equations from using the chain rule proven to be equivalent with respect to this specific equation?

I have used the linearity of differentiation property:image

I wrote the equation as the summation of derivatives instead of the derivative of a summation.
Next, I used the chain rule assuming:

  • g(x) = x^2
  • h(x) = a0 + a1x1 - y
  • f(x) = g(h(x)) = (a0 + a1x1 - y)^2


That is where I am stuck. I can only change the above function to:
I treated
as the sum of two functions then simplified their derivatives using the “Derivative of a Constant” rule. Next, I used the “Derivative of a Simple Linear Equation” rule:

That brings me back to the two questions. The minor question is, doesn’t using the Chain Rule inflate the value of the derivative for this statement:
How is image = image ?

The more important question of course is where I went wrong or where I go from here to get the DQ answer. Feel free to point to links. I only reviewed the properties/rules of derivatives and did not make any effort to use the proofs of the rules to find an answer or determine the equivalence of the terms after using the Chain Rule.

This photo of 2 rows, in the 1st row, after the * sign, can you check your braces. This line is not following the chain rule you correctly specified above. Why does the last term y have a derivative applied to it again. Basically i’m saying there should only be 2, not 3 d/da0 in that 1st row.

It’s better to start with simpler examples if you are learning chain rule, to prevent mistakes with complicated brackets like this. I sense it’s not just a simple typo here, but that you are not familiar with chain rule.

I hope this helps you.

It is actually a mistake. I had I solved it by hand the “long way”. While writing the post I immediately wrote the d/da0 symbol because I assumed the next step to get the answer after simplifying was to treat the second term after * as two terms then cancel out y(i), treating it as a constant just like the MSE derivative equation for a simple linear equation. I will also correct the images I put. I was experimenting whether it was faster to have expressions generated by a GUI program that uses some Latex then take pictures of them instead of using MS Word or noting them like python expressions. I rushed the images and did not match the equations with the text I wrote.

Does this step:
mean that I should have treated a1 as a constant as well? For ax with x being a variable and a being a constant, the derivative is a. So, I should have treated a1 as constant?

As a side note, I now have the impression that a1, a0 and x1 can all be “legally” or mathematically treated as constants. Based on the situation though, we have to pick which can be logically and reasonably treated as a constant.

Last, may I know how you make those images?

Yes, you should because you are differentiating it with respect to a0, so everything else is a constant and the derivative of a constant is zero.

No, a0 cannot be treated as constant in this case. We do not have to pick which can be treated as a constant either. If you are differentiating it with respect to a0, then a0 is not a constant, everything else is.

I use Google Colab to create my notebooks. Its text cells support LaTex which allows you to format equations like this.

As additional information, the reason that the derivative of any variable x equals zero is that, applying the power rule to x^1 results in x^0.