Gradient Descent

7 min readNov 26, 2019

One parameter

Let’s suppose true phenomenon is

y <- 6x

• but we don’t know the exact relationship so we assume

y <- ax

• using sum of squared errors as the loss function

yhat <- ax1
 residual <- yhat — y
 residual <- ax-y
 
 residual² <- (ax — y)²• let residual² = loss function()• so lossfunction() <- (ax — y)²

Only thing we can vary the residuals with is by changing a. x and y are constants given by our dataset.
Consider the differential below

d lossfunction()
 — — — — — — — — <- 2*(ax-y)
 d a

Lets see how change in a changes residuals

library(ggplot2)
 library(dplyr)
 library(tidyr)
 a <- c(1,2,3,4,5,6,7,8,9,10,11)
 
 x <- c(8,9,10,11,12,13)
 y <- 6*x + rnorm(length(x) , sd = 1.5) # random noise
 
 residuals <- sapply(a, function(a) sum((a*x — y)²) )
 
 ggplot(data = data.frame(coff = a , res = residuals), aes(x = coff , y = res)) +
 geom_point() +
 geom_smooth()

We expected a parabola because

lossfunction() <- $_ (ax — y)²

• We see that the minimum residuals are achieved when coff = 6 which is our true parameter.

• what if we had 2 params

Two Parameters

• Suppose our true relationship is

y = 7*x1 + 2*x2

• about which we don’t know so we assume

y = a*x1 + b*x2

• we want to find best a and b that describe my data/fit my data well.

• we again use our least square loss function.

residual = (yhat — y)
 = (ax1 + bx2 — y)
 
 residual² = (ax1 + bx2 — y)²

• what we can change here are both a and b.

so d lossfunction() — — — — — — — = 2x1(ax1 + bx2 — y) d aso d lossfunction() — — — — — — = 2x2(ax1 + bx2 — y) d b

But now its difficult for us to visualize since there are 3 dimensions 1. Total Loss 2. a 3. b

a <- c(4,5,6,7,8,9,10)
 b <- c(-1,0,1,2,3,4,5)
 
 x1 <- c(8,9,10,11,12,13)
 x2 <- c(100,120,130,140,80,70)
 
 
 #since its linear combination we would want to scale all the independent variables
 
 work <- function(x1 , x2, a,b){
 
 x1 <- scale(x1)
 x2 <- scale(x2)
 
 # True
 # y = 7*x1 + 2*x2
 
 y <- 7*x1 + 2*x2 +rnorm(length(x) , sd = 1.5) # random noise
 
 ecomb <- expand.grid(a,b) %>% as.data.frame()
 colnames(ecomb) <- c(“a” , “b”)
 
 ecomb$residuals <- sapply(a, function(a) sapply(b , function(b) sum((a*x1 + b*x2 — y)²))) %>% 
 matrix(nrow = length(a)*length(b))
 
 head(ecomb %>% filter(a == 7 | b == 2))
 
 gathered <- ecomb %>% gather(key=”coefname” , value=”coefvalue” , -residuals)
 
 ggplot(data =gathered , aes(x = coefvalue , y = residuals) ) +
 geom_point(width = 0.1) + 
 geom_smooth(color = ‘green’) +
 geom_hline(yintercept = 0 , color = ‘blue’) +
 facet_grid(~coefname) +
 geom_vline(data = data.frame(coefname=c(“a” , “b”) , z=c(7.0,2.0)) , aes(xintercept = z) , color = ‘yellow’) 
 }
 
 work(x1 ,x2, a,b)

We get slight errors in our a and b estimate but thats due to small sample size. If we increase the sample size to be 100 instead of 7

## Same coef values as above
 a <- c(4,5,6,7,8,9,10)
 b <- c(-1,0,1,2,3,4,5)
 
 
 ### Increased Sample Size to 100
 x1 <- rnorm(100 , 10 , 5)
 x2 <- rnorm(100 , 100 , 10)
 
 
 work(x1 ,x2, a,b)

Gradient Descent

• Now the idea is to use a loss function and an algorithm that would find the minima of our loss function. One such algorith is gradient descent.

• The algorithm would travel the green line in our above examples and find the minima and therefore the corresponding values of a and b which are minimizing the loss function (in our case above — Sum of squared errors)

• No matter how many parameters concept remains the same — to find the minima of the loss function

• Lets return the simple 1 parameter example

a <- c(1,2,3,4,5,6,7,8,9,10,11)
 
 x <- rnorm(100)
 y <- 6*x + rnorm(length(x) , sd = 1.5) # random noise
 
 residFunc <- function(a) sum((a*x — y)²)
 
 residuals <- sapply(a, residFunc)
 
 slope = function(a){ sum(2*x*(a*x — y)) }
 
 
 learningRate <- 0.001
 startingA <- 11
 
 
 gradiantDescent <- function(startingA, learningRate , iter = 10){
 aestimate <- startingA
 estimates <- c()
 stepStop <- 0.1
 
 for( i in seq(1,iter,1) ){ 
 step <- slope(aestimate)*learningRate
 if(abs(step) < stepStop) break
 newAEstimate <- aestimate — step
 aestimate <- newAEstimate
 print(aestimate)
 estimates <- c(estimates , aestimate)
 }
 estimates
 }
 
 estimates <- gradiantDescent(startingA , learningRate , 20)[1] 10.13894
 [1] 9.421883
 [1] 8.824738
 [1] 8.327455
 [1] 7.913334
 [1] 7.568467
 [1] 7.281272
 [1] 7.042106
 [1] 6.842936
 [1] 6.677074
 [1] 6.538949
 [1] 6.423924


ggplot(data = data.frame(coff = a , res = residuals), aes(x = coff , y = res)) +
 geom_point() +
 geom_smooth() +
 geom_point(data = data.frame(x = estimates , y = sapply(estimates, residFunc )) , aes(x=x, y=y) , color = ‘red’ ,size=5)ggplot(data = data.frame(x = estimates ), aes(y = x , x = seq(1:length(x)))) +
 geom_point() +
 geom_smooth()

What if we change our starting a estimate or our learning rate

startingA = 100
 learningRate = 0.001
 iter = 100
 
 estimates <- gradiantDescent(startingA , learningRate , iter)[1] 84.25529
 [1] 71.14359
 [1] 60.2246
 [1] 51.13161
 [1] 43.55926
 [1] 37.25325
 [1] 32.00181
 [1] 27.62857
 [1] 23.98669
 [1] 20.95384
 [1] 18.42818
 [1] 16.32489
 [1] 14.57334
 [1] 13.11471
 [1] 11.90001
 [1] 10.88844
 [1] 10.04604
 [1] 9.344515
 [1] 8.760308
 [1] 8.2738
 [1] 7.868652
 [1] 7.531257
 [1] 7.250285
 [1] 7.016301
 [1] 6.821447
 [1] 6.659178
 [1] 6.524046
 [1] 6.411513ggplot(data = data.frame(x = estimates ), aes(y = x , x = seq(1:length(x)))) +
 geom_point() +
 geom_smooth() +
 geom_hline(yintercept = 6 , color = ‘red’)

We can see that it converges to the truth of 6

Lets change the learning rate

startingA = 100
 learningRate = 0.002
 iter = 100
 
 estimates <- gradiantDescent(startingA , learningRate , iter)[1] 68.51057
 [1] 47.55323
 [1] 33.60537
 [1] 24.32257
 [1] 18.14453
 [1] 14.03283
 [1] 11.29634
 [1] 9.475114
 [1] 8.26302
 [1] 7.456329
 [1] 6.919446
 [1] 6.562132
 [1] 6.324327
 [1] 6.166058
 [1] 6.060725ggplot(data = data.frame(x = estimates ), aes(y = x , x = seq(1:length(x)))) +
 geom_point() +
 geom_smooth() +
 geom_hline(yintercept = 6 , color = ‘red’)

We see that the algorithm converges much more quickly when we increase the learning rate

• But what if we increase it too much ?

startingA = 100
 learningRate = 0.005
 iter = 100
 
 estimates <- gradiantDescent(startingA , learningRate , iter)[1] 21.27643
 [1] 8.378402
 [1] 6.265195
 [1] 5.918968ggplot(data = data.frame(x = estimates ), aes(y = x , x = seq(1:length(x)))) +
 geom_point() +
 geom_smooth() +
 geom_hline(yintercept = 6 , color = ‘red’)

Converges at an even faster rate

startingA = 100
 learningRate = 0.1
 iter = 100
 
 estimates <- gradiantDescent(startingA , learningRate , iter)[1] -1474.471
 [1] 23281.27
 [1] -365958.5
 [1] 5754140
 [1] -90473445
 [1] 1422533037
 [1] -22366785121
 [1] 351677652498
 [1] -5.529501e+12
 [1] 8.694151e+13
 [1] -1.367e+15
 [1] 2.149362e+16
 [1] -3.379487e+17
 [1] 5.313639e+18
 [1] -8.354745e+19
 [1] 1.313634e+21
 [1] -2.065454e+22
 [1] 3.247557e+23
 [1] -5.106203e+24
 [1] 8.028589e+25
 [1] -1.262352e+27
 [1] 1.984822e+28
 [1] -3.120778e+29
 [1] 4.906865e+30
 [1] -7.715166e+31
 [1] 1.213072e+33
 [1] -1.907338e+34
 [1] 2.998948e+35
 [1] -4.715308e+36
 [1] 7.413978e+37
 [1] -1.165715e+39
 [1] 1.832879e+40
 [1] -2.881874e+41
 [1] 4.53123e+42
 [1] -7.124548e+43
 [1] 1.120208e+45
 [1] -1.761326e+46
 [1] 2.76937e+47
 [1] -4.354338e+48
 [1] 6.846417e+49
 [1] -1.076477e+51
 [1] 1.692567e+52
 [1] -2.661258e+53
 [1] 4.184352e+54
 [1] -6.579144e+55
 [1] 1.034453e+57
 [1] -1.626492e+58
 [1] 2.557367e+59
 [1] -4.021001e+60
 [1] 6.322305e+61
 [1] -9.940692e+62
 [1] 1.562996e+64
 [1] -2.457531e+65
 [1] 3.864028e+66
 [1] -6.075492e+67
 [1] 9.552624e+68
 [1] -1.501979e+70
 [1] 2.361593e+71
 [1] -3.713182e+72
 [1] 5.838315e+73
 [1] -9.179704e+74
 [1] 1.443344e+76
 [1] -2.2694e+77
 [1] 3.568226e+78
 [1] -5.610396e+79
 [1] 8.821343e+80
 [1] -1.386998e+82
 [1] 2.180807e+83
 [1] -3.428928e+84
 [1] 5.391375e+85
 [1] -8.476972e+86
 [1] 1.332852e+88
 [1] -2.095671e+89
 [1] 3.295068e+90
 [1] -5.180904e+91
 [1] 8.146045e+92
 [1] -1.28082e+94
 [1] 2.01386e+95
 [1] -3.166434e+96
 [1] 4.97865e+97
 [1] -7.828036e+98
 [1] 1.230819e+100
 [1] -1.935242e+101
 [1] 3.042821e+102
 [1] -4.784292e+103
 [1] 7.522442e+104
 [1] -1.182769e+106
 [1] 1.859693e+107
 [1] -2.924034e+108
 [1] 4.597521e+109
 [1] -7.228778e+110
 [1] 1.136596e+112
 [1] -1.787094e+113
 [1] 2.809885e+114
 [1] -4.418041e+115
 [1] 6.946578e+116
 [1] -1.092225e+118
 [1] 1.717328e+119
 [1] -2.700191e+120
 [1] 4.245567e+121ggplot(data = data.frame(x = estimates ), aes(y = x , x = seq(1:length(x)))) +
 geom_point() +
 geom_smooth() +
 geom_hline(yintercept = 6 , color = ‘red’)

estimates

[1] -1.474471e+03 2.328127e+04 -3.659585e+05 5.754140e+06 -9.047344e+07
 [6] 1.422533e+09 -2.236679e+10 3.516777e+11 -5.529501e+12 8.694151e+13
 [11] -1.367000e+15 2.149362e+16 -3.379487e+17 5.313639e+18 -8.354745e+19
 [16] 1.313634e+21 -2.065454e+22 3.247557e+23 -5.106203e+24 8.028589e+25
 [21] -1.262352e+27 1.984822e+28 -3.120778e+29 4.906865e+30 -7.715166e+31
 [26] 1.213072e+33 -1.907338e+34 2.998948e+35 -4.715308e+36 7.413978e+37
 [31] -1.165715e+39 1.832879e+40 -2.881874e+41 4.531230e+42 -7.124548e+43
 [36] 1.120208e+45 -1.761326e+46 2.769370e+47 -4.354338e+48 6.846417e+49
 [41] -1.076477e+51 1.692567e+52 -2.661258e+53 4.184352e+54 -6.579144e+55
 [46] 1.034453e+57 -1.626492e+58 2.557367e+59 -4.021001e+60 6.322305e+61
 [51] -9.940692e+62 1.562996e+64 -2.457531e+65 3.864028e+66 -6.075492e+67
 [56] 9.552624e+68 -1.501979e+70 2.361593e+71 -3.713182e+72 5.838315e+73
 [61] -9.179704e+74 1.443344e+76 -2.269400e+77 3.568226e+78 -5.610396e+79
 [66] 8.821343e+80 -1.386998e+82 2.180807e+83 -3.428928e+84 5.391375e+85
 [71] -8.476972e+86 1.332852e+88 -2.095671e+89 3.295068e+90 -5.180904e+91
 [76] 8.146045e+92 -1.280820e+94 2.013860e+95 -3.166434e+96 4.978650e+97
 [81] -7.828036e+98 1.230819e+100 -1.935242e+101 3.042821e+102 -4.784292e+103
 [86] 7.522442e+104 -1.182769e+106 1.859693e+107 -2.924034e+108 4.597521e+109
 [91] -7.228778e+110 1.136596e+112 -1.787094e+113 2.809885e+114 -4.418041e+115
 [96] 6.946578e+116 -1.092225e+118 1.717328e+119 -2.700191e+120 4.245567e+121

• Now what we see is that it fails to converge. Begins to diverge and then never looks back.

Gradient Descent

One parameter

Two Parameters

Gradient Descent

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Abhijeet Pokhriyal

No responses yet