Abhijeet Pokhriyal

Nov 26, 2019

7 min read

Gradient Descent

One parameter

  • Let’s suppose true phenomenon is

• but we don’t know the exact relationship so we assume

• using sum of squared errors as the loss function

  • Only thing we can vary the residuals with is by changing a. x and y are constants given by our dataset.
  • Consider the differential below

Lets see how change in a changes residuals

  • We expected a parabola because

• We see that the minimum residuals are achieved when coff = 6 which is our true parameter.

• what if we had 2 params

Two Parameters

• Suppose our true relationship is

• about which we don’t know so we assume

• we want to find best a and b that describe my data/fit my data well.

• we again use our least square loss function.

• what we can change here are both a and b.

But now its difficult for us to visualize since there are 3 dimensions 1. Total Loss 2. a 3. b

We get slight errors in our a and b estimate but thats due to small sample size. If we increase the sample size to be 100 instead of 7

Gradient Descent

• Now the idea is to use a loss function and an algorithm that would find the minima of our loss function. One such algorith is gradient descent.

• The algorithm would travel the green line in our above examples and find the minima and therefore the corresponding values of a and b which are minimizing the loss function (in our case above — Sum of squared errors)

• No matter how many parameters concept remains the same — to find the minima of the loss function

• Lets return the simple 1 parameter example

  • What if we change our starting a estimate or our learning rate

We can see that it converges to the truth of 6

  • Lets change the learning rate
  • We see that the algorithm converges much more quickly when we increase the learning rate

• But what if we increase it too much ?

  • Converges at an even faster rate


• Now what we see is that it fails to converge. Begins to diverge and then never looks back.