Gradient Descent

Abhijeet Pokhriyal
7 min readNov 26, 2019

--

One parameter

  • Let’s suppose true phenomenon is
y <- 6x

• but we don’t know the exact relationship so we assume

y <- ax

• using sum of squared errors as the loss function

yhat <- ax1
residual <- yhat — y
residual <- ax-y

residual² <- (ax — y)²
• let residual² = loss function()• so lossfunction() <- (ax — y)²
  • Only thing we can vary the residuals with is by changing a. x and y are constants given by our dataset.
  • Consider the differential below
d lossfunction()
— — — — — — — — <- 2*(ax-y)
d a

Lets see how change in a changes residuals

library(ggplot2)
library(dplyr)
library(tidyr)
a <- c(1,2,3,4,5,6,7,8,9,10,11)

x <- c(8,9,10,11,12,13)
y <- 6*x + rnorm(length(x) , sd = 1.5) # random noise

residuals <- sapply(a, function(a) sum((a*x — y)²) )

ggplot(data = data.frame(coff = a , res = residuals), aes(x = coff , y = res)) +
geom_point() +
geom_smooth()
  • We expected a parabola because
lossfunction() <- $_ (ax — y)²

• We see that the minimum residuals are achieved when coff = 6 which is our true parameter.

• what if we had 2 params

Two Parameters

• Suppose our true relationship is

y = 7*x1 + 2*x2

• about which we don’t know so we assume

y = a*x1 + b*x2

• we want to find best a and b that describe my data/fit my data well.

• we again use our least square loss function.

residual = (yhat — y)
= (ax1 + bx2 — y)

residual² = (ax1 + bx2 — y)²

• what we can change here are both a and b.

so d lossfunction() — — — — — — — = 2x1(ax1 + bx2 — y) d aso d lossfunction() — — — — — — = 2x2(ax1 + bx2 — y) d b

But now its difficult for us to visualize since there are 3 dimensions 1. Total Loss 2. a 3. b

a <- c(4,5,6,7,8,9,10)
b <- c(-1,0,1,2,3,4,5)

x1 <- c(8,9,10,11,12,13)
x2 <- c(100,120,130,140,80,70)


#since its linear combination we would want to scale all the independent variables

work <- function(x1 , x2, a,b){

x1 <- scale(x1)
x2 <- scale(x2)

# True
# y = 7*x1 + 2*x2

y <- 7*x1 + 2*x2 +rnorm(length(x) , sd = 1.5) # random noise

ecomb <- expand.grid(a,b) %>% as.data.frame()
colnames(ecomb) <- c(“a” , “b”)

ecomb$residuals <- sapply(a, function(a) sapply(b , function(b) sum((a*x1 + b*x2 — y)²))) %>%
matrix(nrow = length(a)*length(b))

head(ecomb %>% filter(a == 7 | b == 2))

gathered <- ecomb %>% gather(key=”coefname” , value=”coefvalue” , -residuals)

ggplot(data =gathered , aes(x = coefvalue , y = residuals) ) +
geom_point(width = 0.1) +
geom_smooth(color = ‘green’) +
geom_hline(yintercept = 0 , color = ‘blue’) +
facet_grid(~coefname) +
geom_vline(data = data.frame(coefname=c(“a” , “b”) , z=c(7.0,2.0)) , aes(xintercept = z) , color = ‘yellow’)
}

work(x1 ,x2, a,b)

We get slight errors in our a and b estimate but thats due to small sample size. If we increase the sample size to be 100 instead of 7

## Same coef values as above
a <- c(4,5,6,7,8,9,10)
b <- c(-1,0,1,2,3,4,5)


### Increased Sample Size to 100
x1 <- rnorm(100 , 10 , 5)
x2 <- rnorm(100 , 100 , 10)


work(x1 ,x2, a,b)

Gradient Descent

• Now the idea is to use a loss function and an algorithm that would find the minima of our loss function. One such algorith is gradient descent.

• The algorithm would travel the green line in our above examples and find the minima and therefore the corresponding values of a and b which are minimizing the loss function (in our case above — Sum of squared errors)

• No matter how many parameters concept remains the same — to find the minima of the loss function

• Lets return the simple 1 parameter example

a <- c(1,2,3,4,5,6,7,8,9,10,11)

x <- rnorm(100)
y <- 6*x + rnorm(length(x) , sd = 1.5) # random noise

residFunc <- function(a) sum((a*x — y)²)

residuals <- sapply(a, residFunc)

slope = function(a){ sum(2*x*(a*x — y)) }


learningRate <- 0.001
startingA <- 11


gradiantDescent <- function(startingA, learningRate , iter = 10){
aestimate <- startingA
estimates <- c()
stepStop <- 0.1

for( i in seq(1,iter,1) ){
step <- slope(aestimate)*learningRate
if(abs(step) < stepStop) break
newAEstimate <- aestimate — step
aestimate <- newAEstimate
print(aestimate)
estimates <- c(estimates , aestimate)
}
estimates
}

estimates <- gradiantDescent(startingA , learningRate , 20)
[1] 10.13894
[1] 9.421883
[1] 8.824738
[1] 8.327455
[1] 7.913334
[1] 7.568467
[1] 7.281272
[1] 7.042106
[1] 6.842936
[1] 6.677074
[1] 6.538949
[1] 6.423924

ggplot(data = data.frame(coff = a , res = residuals), aes(x = coff , y = res)) +
geom_point() +
geom_smooth() +
geom_point(data = data.frame(x = estimates , y = sapply(estimates, residFunc )) , aes(x=x, y=y) , color = ‘red’ ,size=5)
ggplot(data = data.frame(x = estimates ), aes(y = x , x = seq(1:length(x)))) +
geom_point() +
geom_smooth()
  • What if we change our starting a estimate or our learning rate
startingA = 100
learningRate = 0.001
iter = 100

estimates <- gradiantDescent(startingA , learningRate , iter)
[1] 84.25529
[1] 71.14359
[1] 60.2246
[1] 51.13161
[1] 43.55926
[1] 37.25325
[1] 32.00181
[1] 27.62857
[1] 23.98669
[1] 20.95384
[1] 18.42818
[1] 16.32489
[1] 14.57334
[1] 13.11471
[1] 11.90001
[1] 10.88844
[1] 10.04604
[1] 9.344515
[1] 8.760308
[1] 8.2738
[1] 7.868652
[1] 7.531257
[1] 7.250285
[1] 7.016301
[1] 6.821447
[1] 6.659178
[1] 6.524046
[1] 6.411513
ggplot(data = data.frame(x = estimates ), aes(y = x , x = seq(1:length(x)))) +
geom_point() +
geom_smooth() +
geom_hline(yintercept = 6 , color = ‘red’)

We can see that it converges to the truth of 6

  • Lets change the learning rate
startingA = 100
learningRate = 0.002
iter = 100

estimates <- gradiantDescent(startingA , learningRate , iter)
[1] 68.51057
[1] 47.55323
[1] 33.60537
[1] 24.32257
[1] 18.14453
[1] 14.03283
[1] 11.29634
[1] 9.475114
[1] 8.26302
[1] 7.456329
[1] 6.919446
[1] 6.562132
[1] 6.324327
[1] 6.166058
[1] 6.060725
ggplot(data = data.frame(x = estimates ), aes(y = x , x = seq(1:length(x)))) +
geom_point() +
geom_smooth() +
geom_hline(yintercept = 6 , color = ‘red’)
  • We see that the algorithm converges much more quickly when we increase the learning rate

• But what if we increase it too much ?

startingA = 100
learningRate = 0.005
iter = 100

estimates <- gradiantDescent(startingA , learningRate , iter)
[1] 21.27643
[1] 8.378402
[1] 6.265195
[1] 5.918968
ggplot(data = data.frame(x = estimates ), aes(y = x , x = seq(1:length(x)))) +
geom_point() +
geom_smooth() +
geom_hline(yintercept = 6 , color = ‘red’)
  • Converges at an even faster rate
startingA = 100
learningRate = 0.1
iter = 100

estimates <- gradiantDescent(startingA , learningRate , iter)
[1] -1474.471
[1] 23281.27
[1] -365958.5
[1] 5754140
[1] -90473445
[1] 1422533037
[1] -22366785121
[1] 351677652498
[1] -5.529501e+12
[1] 8.694151e+13
[1] -1.367e+15
[1] 2.149362e+16
[1] -3.379487e+17
[1] 5.313639e+18
[1] -8.354745e+19
[1] 1.313634e+21
[1] -2.065454e+22
[1] 3.247557e+23
[1] -5.106203e+24
[1] 8.028589e+25
[1] -1.262352e+27
[1] 1.984822e+28
[1] -3.120778e+29
[1] 4.906865e+30
[1] -7.715166e+31
[1] 1.213072e+33
[1] -1.907338e+34
[1] 2.998948e+35
[1] -4.715308e+36
[1] 7.413978e+37
[1] -1.165715e+39
[1] 1.832879e+40
[1] -2.881874e+41
[1] 4.53123e+42
[1] -7.124548e+43
[1] 1.120208e+45
[1] -1.761326e+46
[1] 2.76937e+47
[1] -4.354338e+48
[1] 6.846417e+49
[1] -1.076477e+51
[1] 1.692567e+52
[1] -2.661258e+53
[1] 4.184352e+54
[1] -6.579144e+55
[1] 1.034453e+57
[1] -1.626492e+58
[1] 2.557367e+59
[1] -4.021001e+60
[1] 6.322305e+61
[1] -9.940692e+62
[1] 1.562996e+64
[1] -2.457531e+65
[1] 3.864028e+66
[1] -6.075492e+67
[1] 9.552624e+68
[1] -1.501979e+70
[1] 2.361593e+71
[1] -3.713182e+72
[1] 5.838315e+73
[1] -9.179704e+74
[1] 1.443344e+76
[1] -2.2694e+77
[1] 3.568226e+78
[1] -5.610396e+79
[1] 8.821343e+80
[1] -1.386998e+82
[1] 2.180807e+83
[1] -3.428928e+84
[1] 5.391375e+85
[1] -8.476972e+86
[1] 1.332852e+88
[1] -2.095671e+89
[1] 3.295068e+90
[1] -5.180904e+91
[1] 8.146045e+92
[1] -1.28082e+94
[1] 2.01386e+95
[1] -3.166434e+96
[1] 4.97865e+97
[1] -7.828036e+98
[1] 1.230819e+100
[1] -1.935242e+101
[1] 3.042821e+102
[1] -4.784292e+103
[1] 7.522442e+104
[1] -1.182769e+106
[1] 1.859693e+107
[1] -2.924034e+108
[1] 4.597521e+109
[1] -7.228778e+110
[1] 1.136596e+112
[1] -1.787094e+113
[1] 2.809885e+114
[1] -4.418041e+115
[1] 6.946578e+116
[1] -1.092225e+118
[1] 1.717328e+119
[1] -2.700191e+120
[1] 4.245567e+121
ggplot(data = data.frame(x = estimates ), aes(y = x , x = seq(1:length(x)))) +
geom_point() +
geom_smooth() +
geom_hline(yintercept = 6 , color = ‘red’)

estimates

[1] -1.474471e+03 2.328127e+04 -3.659585e+05 5.754140e+06 -9.047344e+07
[6] 1.422533e+09 -2.236679e+10 3.516777e+11 -5.529501e+12 8.694151e+13
[11] -1.367000e+15 2.149362e+16 -3.379487e+17 5.313639e+18 -8.354745e+19
[16] 1.313634e+21 -2.065454e+22 3.247557e+23 -5.106203e+24 8.028589e+25
[21] -1.262352e+27 1.984822e+28 -3.120778e+29 4.906865e+30 -7.715166e+31
[26] 1.213072e+33 -1.907338e+34 2.998948e+35 -4.715308e+36 7.413978e+37
[31] -1.165715e+39 1.832879e+40 -2.881874e+41 4.531230e+42 -7.124548e+43
[36] 1.120208e+45 -1.761326e+46 2.769370e+47 -4.354338e+48 6.846417e+49
[41] -1.076477e+51 1.692567e+52 -2.661258e+53 4.184352e+54 -6.579144e+55
[46] 1.034453e+57 -1.626492e+58 2.557367e+59 -4.021001e+60 6.322305e+61
[51] -9.940692e+62 1.562996e+64 -2.457531e+65 3.864028e+66 -6.075492e+67
[56] 9.552624e+68 -1.501979e+70 2.361593e+71 -3.713182e+72 5.838315e+73
[61] -9.179704e+74 1.443344e+76 -2.269400e+77 3.568226e+78 -5.610396e+79
[66] 8.821343e+80 -1.386998e+82 2.180807e+83 -3.428928e+84 5.391375e+85
[71] -8.476972e+86 1.332852e+88 -2.095671e+89 3.295068e+90 -5.180904e+91
[76] 8.146045e+92 -1.280820e+94 2.013860e+95 -3.166434e+96 4.978650e+97
[81] -7.828036e+98 1.230819e+100 -1.935242e+101 3.042821e+102 -4.784292e+103
[86] 7.522442e+104 -1.182769e+106 1.859693e+107 -2.924034e+108 4.597521e+109
[91] -7.228778e+110 1.136596e+112 -1.787094e+113 2.809885e+114 -4.418041e+115
[96] 6.946578e+116 -1.092225e+118 1.717328e+119 -2.700191e+120 4.245567e+121

• Now what we see is that it fails to converge. Begins to diverge and then never looks back.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Abhijeet Pokhriyal
Abhijeet Pokhriyal

Written by Abhijeet Pokhriyal

School of Data Science @ University of North Carolina — Charlotte

No responses yet

Write a response