Gradient Descent

Abhijeet Pokhriyal
7 min readNov 26, 2019

One parameter

  • Let’s suppose true phenomenon is
y <- 6x

• but we don’t know the exact relationship so we assume

y <- ax

• using sum of squared errors as the loss function

yhat <- ax1
residual <- yhat — y
residual <- ax-y

residual² <- (ax — y)²
• let residual² = loss function()• so lossfunction() <- (ax — y)²
  • Only thing we can vary the residuals with is by changing a. x and y are constants given by our dataset.
  • Consider the differential below
d lossfunction()
— — — — — — — — <- 2*(ax-y)
d a

Lets see how change in a changes residuals

library(ggplot2)
library(dplyr)
library(tidyr)
a <- c(1,2,3,4,5,6,7,8,9,10,11)

x <- c(8,9,10,11,12,13)
y <- 6*x + rnorm(length(x) , sd = 1.5) # random noise

residuals <- sapply(a, function(a) sum((a*x — y)²) )

ggplot(data = data.frame(coff = a , res = residuals), aes(x = coff , y = res)) +
geom_point() +
geom_smooth()
  • We expected a parabola because
lossfunction() <- $_ (ax — y)²

• We see that the minimum residuals are achieved when coff = 6 which is our true parameter.

• what if we had 2 params

Two Parameters

• Suppose our true relationship is

y = 7*x1 + 2*x2

• about which we don’t know so we assume

y = a*x1 + b*x2

• we want to find best a and b that describe my data/fit my data well.

• we again use our least square loss function.

residual = (yhat — y)
= (ax1 + bx2 — y)

residual² = (ax1 + bx2 — y)²

• what we can change here are both a and b.

so d lossfunction() — — — — — — — = 2x1(ax1 + bx2 — y) d aso d lossfunction() — — — — — — = 2x2(ax1 + bx2 — y) d b

But now its difficult for us to visualize since there are 3 dimensions 1. Total Loss 2. a 3. b

a <- c(4,5,6,7,8,9,10)
b <- c(-1,0,1,2,3,4,5)

x1 <- c(8,9,10,11,12,13)
x2 <- c(100,120,130,140,80,70)


#since its linear combination we would want to scale all the independent variables

work <- function(x1 , x2, a,b){

x1 <- scale(x1)
x2 <- scale(x2)

# True
# y = 7*x1 + 2*x2

y <- 7*x1 + 2*x2 +rnorm(length(x) , sd = 1.5) # random noise

ecomb <- expand.grid(a,b) %>% as.data.frame()
colnames(ecomb) <- c(“a” , “b”)

ecomb$residuals <- sapply(a, function(a) sapply(b , function(b) sum((a*x1 + b*x2 — y)²))) %>%
matrix(nrow = length(a)*length(b))

head(ecomb %>% filter(a == 7 | b == 2))

gathered <- ecomb %>% gather(key=”coefname” , value=”coefvalue” , -residuals)

ggplot(data =gathered , aes(x = coefvalue , y = residuals) ) +
geom_point(width = 0.1) +
geom_smooth(color = ‘green’) +
geom_hline(yintercept = 0 , color = ‘blue’) +
facet_grid(~coefname) +
geom_vline(data = data.frame(coefname=c(“a” , “b”) , z=c(7.0,2.0)) , aes(xintercept = z) , color = ‘yellow’)
}

work(x1 ,x2, a,b)

We get slight errors in our a and b estimate but thats due to small sample size. If we increase the sample size to be 100 instead of 7

## Same coef values as above
a <- c(4,5,6,7,8,9,10)
b <- c(-1,0,1,2,3,4,5)


### Increased Sample Size to 100
x1 <- rnorm(100 , 10 , 5)
x2 <- rnorm(100 , 100 , 10)


work(x1 ,x2, a,b)

Gradient Descent

• Now the idea is to use a loss function and an algorithm that would find the minima of our loss function. One such algorith is gradient descent.

• The algorithm would travel the green line in our above examples and find the minima and therefore the corresponding values of a and b which are minimizing the loss function (in our case above — Sum of squared errors)

• No matter how many parameters concept remains the same — to find the minima of the loss function

• Lets return the simple 1 parameter example

a <- c(1,2,3,4,5,6,7,8,9,10,11)

x <- rnorm(100)
y <- 6*x + rnorm(length(x) , sd = 1.5) # random noise

residFunc <- function(a) sum((a*x — y)²)

residuals <- sapply(a, residFunc)

slope = function(a){ sum(2*x*(a*x — y)) }


learningRate <- 0.001
startingA <- 11


gradiantDescent <- function(startingA, learningRate , iter = 10){
aestimate <- startingA
estimates <- c()
stepStop <- 0.1

for( i in seq(1,iter,1) ){
step <- slope(aestimate)*learningRate
if(abs(step) < stepStop) break
newAEstimate <- aestimate — step
aestimate <- newAEstimate
print(aestimate)
estimates <- c(estimates , aestimate)
}
estimates
}

estimates <- gradiantDescent(startingA , learningRate , 20)
[1] 10.13894
[1] 9.421883
[1] 8.824738
[1] 8.327455
[1] 7.913334
[1] 7.568467
[1] 7.281272
[1] 7.042106
[1] 6.842936
[1] 6.677074
[1] 6.538949
[1] 6.423924

ggplot(data = data.frame(coff = a , res = residuals), aes(x = coff , y = res)) +
geom_point() +
geom_smooth() +
geom_point(data = data.frame(x = estimates , y = sapply(estimates, residFunc )) , aes(x=x, y=y) , color = ‘red’ ,size=5)
ggplot(data = data.frame(x = estimates ), aes(y = x , x = seq(1:length(x)))) +
geom_point() +
geom_smooth()
  • What if we change our starting a estimate or our learning rate
startingA = 100
learningRate = 0.001
iter = 100

estimates <- gradiantDescent(startingA , learningRate , iter)
[1] 84.25529
[1] 71.14359
[1] 60.2246
[1] 51.13161
[1] 43.55926
[1] 37.25325
[1] 32.00181
[1] 27.62857
[1] 23.98669
[1] 20.95384
[1] 18.42818
[1] 16.32489
[1] 14.57334
[1] 13.11471
[1] 11.90001
[1] 10.88844
[1] 10.04604
[1] 9.344515
[1] 8.760308
[1] 8.2738
[1] 7.868652
[1] 7.531257
[1] 7.250285
[1] 7.016301
[1] 6.821447
[1] 6.659178
[1] 6.524046
[1] 6.411513
ggplot(data = data.frame(x = estimates ), aes(y = x , x = seq(1:length(x)))) +
geom_point() +
geom_smooth() +
geom_hline(yintercept = 6 , color = ‘red’)

We can see that it converges to the truth of 6

  • Lets change the learning rate
startingA = 100
learningRate = 0.002
iter = 100

estimates <- gradiantDescent(startingA , learningRate , iter)
[1] 68.51057
[1] 47.55323
[1] 33.60537
[1] 24.32257
[1] 18.14453
[1] 14.03283
[1] 11.29634
[1] 9.475114
[1] 8.26302
[1] 7.456329
[1] 6.919446
[1] 6.562132
[1] 6.324327
[1] 6.166058
[1] 6.060725
ggplot(data = data.frame(x = estimates ), aes(y = x , x = seq(1:length(x)))) +
geom_point() +
geom_smooth() +
geom_hline(yintercept = 6 , color = ‘red’)
  • We see that the algorithm converges much more quickly when we increase the learning rate

• But what if we increase it too much ?

startingA = 100
learningRate = 0.005
iter = 100

estimates <- gradiantDescent(startingA , learningRate , iter)
[1] 21.27643
[1] 8.378402
[1] 6.265195
[1] 5.918968
ggplot(data = data.frame(x = estimates ), aes(y = x , x = seq(1:length(x)))) +
geom_point() +
geom_smooth() +
geom_hline(yintercept = 6 , color = ‘red’)
  • Converges at an even faster rate
startingA = 100
learningRate = 0.1
iter = 100

estimates <- gradiantDescent(startingA , learningRate , iter)
[1] -1474.471
[1] 23281.27
[1] -365958.5
[1] 5754140
[1] -90473445
[1] 1422533037
[1] -22366785121
[1] 351677652498
[1] -5.529501e+12
[1] 8.694151e+13
[1] -1.367e+15
[1] 2.149362e+16
[1] -3.379487e+17
[1] 5.313639e+18
[1] -8.354745e+19
[1] 1.313634e+21
[1] -2.065454e+22
[1] 3.247557e+23
[1] -5.106203e+24
[1] 8.028589e+25
[1] -1.262352e+27
[1] 1.984822e+28
[1] -3.120778e+29
[1] 4.906865e+30
[1] -7.715166e+31
[1] 1.213072e+33
[1] -1.907338e+34
[1] 2.998948e+35
[1] -4.715308e+36
[1] 7.413978e+37
[1] -1.165715e+39
[1] 1.832879e+40
[1] -2.881874e+41
[1] 4.53123e+42
[1] -7.124548e+43
[1] 1.120208e+45
[1] -1.761326e+46
[1] 2.76937e+47
[1] -4.354338e+48
[1] 6.846417e+49
[1] -1.076477e+51
[1] 1.692567e+52
[1] -2.661258e+53
[1] 4.184352e+54
[1] -6.579144e+55
[1] 1.034453e+57
[1] -1.626492e+58
[1] 2.557367e+59
[1] -4.021001e+60
[1] 6.322305e+61
[1] -9.940692e+62
[1] 1.562996e+64
[1] -2.457531e+65
[1] 3.864028e+66
[1] -6.075492e+67
[1] 9.552624e+68
[1] -1.501979e+70
[1] 2.361593e+71
[1] -3.713182e+72
[1] 5.838315e+73
[1] -9.179704e+74
[1] 1.443344e+76
[1] -2.2694e+77
[1] 3.568226e+78
[1] -5.610396e+79
[1] 8.821343e+80
[1] -1.386998e+82
[1] 2.180807e+83
[1] -3.428928e+84
[1] 5.391375e+85
[1] -8.476972e+86
[1] 1.332852e+88
[1] -2.095671e+89
[1] 3.295068e+90
[1] -5.180904e+91
[1] 8.146045e+92
[1] -1.28082e+94
[1] 2.01386e+95
[1] -3.166434e+96
[1] 4.97865e+97
[1] -7.828036e+98
[1] 1.230819e+100
[1] -1.935242e+101
[1] 3.042821e+102
[1] -4.784292e+103
[1] 7.522442e+104
[1] -1.182769e+106
[1] 1.859693e+107
[1] -2.924034e+108
[1] 4.597521e+109
[1] -7.228778e+110
[1] 1.136596e+112
[1] -1.787094e+113
[1] 2.809885e+114
[1] -4.418041e+115
[1] 6.946578e+116
[1] -1.092225e+118
[1] 1.717328e+119
[1] -2.700191e+120
[1] 4.245567e+121
ggplot(data = data.frame(x = estimates ), aes(y = x , x = seq(1:length(x)))) +
geom_point() +
geom_smooth() +
geom_hline(yintercept = 6 , color = ‘red’)

estimates

[1] -1.474471e+03 2.328127e+04 -3.659585e+05 5.754140e+06 -9.047344e+07
[6] 1.422533e+09 -2.236679e+10 3.516777e+11 -5.529501e+12 8.694151e+13
[11] -1.367000e+15 2.149362e+16 -3.379487e+17 5.313639e+18 -8.354745e+19
[16] 1.313634e+21 -2.065454e+22 3.247557e+23 -5.106203e+24 8.028589e+25
[21] -1.262352e+27 1.984822e+28 -3.120778e+29 4.906865e+30 -7.715166e+31
[26] 1.213072e+33 -1.907338e+34 2.998948e+35 -4.715308e+36 7.413978e+37
[31] -1.165715e+39 1.832879e+40 -2.881874e+41 4.531230e+42 -7.124548e+43
[36] 1.120208e+45 -1.761326e+46 2.769370e+47 -4.354338e+48 6.846417e+49
[41] -1.076477e+51 1.692567e+52 -2.661258e+53 4.184352e+54 -6.579144e+55
[46] 1.034453e+57 -1.626492e+58 2.557367e+59 -4.021001e+60 6.322305e+61
[51] -9.940692e+62 1.562996e+64 -2.457531e+65 3.864028e+66 -6.075492e+67
[56] 9.552624e+68 -1.501979e+70 2.361593e+71 -3.713182e+72 5.838315e+73
[61] -9.179704e+74 1.443344e+76 -2.269400e+77 3.568226e+78 -5.610396e+79
[66] 8.821343e+80 -1.386998e+82 2.180807e+83 -3.428928e+84 5.391375e+85
[71] -8.476972e+86 1.332852e+88 -2.095671e+89 3.295068e+90 -5.180904e+91
[76] 8.146045e+92 -1.280820e+94 2.013860e+95 -3.166434e+96 4.978650e+97
[81] -7.828036e+98 1.230819e+100 -1.935242e+101 3.042821e+102 -4.784292e+103
[86] 7.522442e+104 -1.182769e+106 1.859693e+107 -2.924034e+108 4.597521e+109
[91] -7.228778e+110 1.136596e+112 -1.787094e+113 2.809885e+114 -4.418041e+115
[96] 6.946578e+116 -1.092225e+118 1.717328e+119 -2.700191e+120 4.245567e+121

• Now what we see is that it fails to converge. Begins to diverge and then never looks back.

--

--

Abhijeet Pokhriyal

School of Data Science @ University of North Carolina — Charlotte