Gradient Descent With Nesterov Momentum From Scratch

Last Updated on October 12, 2023

Gradient descent is an optimization algorithm that follows the damaging gradient of an objective function as a solution to discover the minimal of the function.

A limitation of gradient descent is that it’ll most likely get caught in flat areas or bounce spherical if the goal function returns noisy gradients. Momentum is an technique that accelerates the progress of the search to skim all through flat areas and clear out bouncy gradients.

In some circumstances, the acceleration of momentum could trigger the search to miss or overshoot the minima on the bottom of basins or valleys. Nesterov momentum is an extension of momentum that features calculating the decaying shifting frequent of the gradients of projected positions throughout the search home fairly than the exact positions themselves.

This has the impression of harnessing the accelerating benefits of momentum whereas allowing the search to decelerate when approaching the optima and cut back the likelihood of missing or overshooting it.

In this tutorial, you may uncover one of the best ways to develop the Gradient Descent optimization algorithm with Nesterov Momentum from scratch.

After ending this tutorial, you may know:

Gradient descent is an optimization algorithm that makes use of the gradient of the goal function to navigate the search home.
The convergence of gradient descent optimization algorithm is likely to be accelerated by extending the algorithm and together with Nesterov Momentum.
How to implement the Nesterov Momentum optimization algorithm from scratch and apply it to an objective function and contemplate the outcomes.

Kick-start your mission with my new book Optimization for Machine Learning, along with step-by-step tutorials and the Python provide code data for all examples.

Let’s get started.

Gradient Descent With Nesterov Momentum From Scratch
Photo by Bonnie Moreland, some rights reserved.

Tutorial Overview

This tutorial is break up into three elements; they’re:

Gradient Descent
Nesterov Momentum
Gradient Descent With Nesterov Momentum
1. Two-Dimensional Test Problem
2. Gradient Descent Optimization With Nesterov Momentum
3. Visualization of Nesterov Momentum

Gradient Descent

Gradient descent is an optimization algorithm.

It is technically generally known as a first-order optimization algorithm as a result of it explicitly makes use of the first order by-product of the objective objective function.

First-order methods rely upon gradient information to help direct the search for a minimal …

— Page 69, Algorithms for Optimization, 2023.

The first order by-product, or simply the “derivative,” is the pace of change or slope of the objective function at a specific degree, e.g. for a specific enter.

If the objective function takes a variety of enter variables, it is generally known as a multivariate function and the enter variables is likely to be thought to be a vector. In flip, the by-product of a multivariate objective function may also be taken as a vector and is referred to usually as a result of the “gradient.”

Gradient: First order by-product for a multivariate objective function.

The by-product or the gradient elements throughout the course of the steepest ascent of the objective function for a specific enter.

Gradient descent refers to a minimization optimization algorithm that follows the damaging of the gradient downhill of the objective function to seek out the minimal of the function.

The gradient descent algorithm requires a objective function that is being optimized and the by-product function for the goal function. The objective function f() returns a ranking for a given set of inputs, and the by-product function f'() offers the by-product of the objective function for a given set of inputs.

The gradient descent algorithm requires a starting point (x) within the problem, resembling a randomly chosen degree throughout the enter home.

The by-product is then calculated and a step is taken throughout the enter home that is anticipated to result in a downhill movement throughout the objective function, assuming we’re minimizing the objective function.

A downhill movement is made by first calculating how far to maneuver throughout the enter home, calculated as a result of the steps measurement (generally known as alpha or the tutorial charge) multiplied by the gradient. This is then subtracted from the current degree, ensuring we switch in direction of the gradient, or down the objective function.

x(t+1) = x(t) – step_size * f'(x(t))

The steeper the goal function at a given degree, the larger the magnitude of the gradient, and in flip, the larger the step taken throughout the search home. The measurement of the step taken is scaled using a step measurement hyperparameter.

Step Size (alpha): Hyperparameter that controls how far to maneuver throughout the search home in direction of the gradient each iteration of the algorithm.

If the step measurement is simply too small, the movement throughout the search home will most likely be small, and the search will take a really very long time. If the step measurement is simply too large, the search would possibly bounce throughout the search home and skip over the optima.

Now that we’re accustomed to the gradient descent optimization algorithm, let’s try the Nesterov momentum.

Want to Get Started With Optimization Algorithms?

Take my free 7-day email correspondence crash course now (with sample code).

Click to sign-up and likewise get a free PDF Ebook mannequin of the course.

Nesterov Momentum

Nesterov Momentum is an extension to the gradient descent optimization algorithm.

The technique was described by (and named for) Yurii Nesterov in his 1983 paper titled “A Method For Solving The Convex Programming Problem With Convergence Rate O(1/k^2).”

Ilya Sutskever, et al. are accountable for popularizing the equipment of Nesterov Momentum throughout the teaching of neural networks with stochastic gradient descent described of their 2013 paper “On The Importance Of Initialization And Momentum In Deep Learning.” They referred to the technique as “Nesterov’s Accelerated Gradient,” or NAG for temporary.

Nesterov Momentum is fairly like further standard momentum moreover the substitute is carried out using the partial by-product of the projected substitute fairly than the by-product current variable value.

While NAG simply is not typically thought to be a form of momentum, it actually appears to be intently related to classical momentum, differing solely throughout the actual substitute of the pace vector …

— On The Importance Of Initialization And Momentum In Deep Learning, 2013.

Traditional momentum consists of sustaining an additional variable that represents the ultimate substitute carried out to the variable, an exponentially decaying shifting frequent of earlier gradients.

The momentum algorithm accumulates an exponentially decaying shifting frequent of earlier gradients and continues to maneuver of their course.

— Page 296, Deep Learning, 2023.

This last substitute or last change to the variable is then added to the variable scaled by a “momentum” hyperparameter that controls how a number of the ultimate change in order so as to add, e.g. 0.9 for 90%.

It is less complicated to think about this substitute in relation to two steps, e.g calculate the change throughout the variable using the partial by-product then calculate the model new value for the variable.

change(t+1) = (momentum * change(t)) – (step_size * f'(x(t)))
x(t+1) = x(t) + change(t+1)

We can contemplate momentum in relation to a ball rolling downhill that will pace up and proceed to go within the equivalent course even throughout the presence of small hills.

Momentum is likely to be interpreted as a ball rolling down a nearly horizontal incline. The ball naturally gathers momentum as gravity causes it to hurry up, just because the gradient causes momentum to construct up on this descent methodology.

— Page 75, Algorithms for Optimization, 2023.

An problem with momentum is that acceleration can usually set off the search to overshoot the minima on the bottom of a basin or valley flooring.

Nesterov Momentum is likely to be thought to be a modification to momentum to beat this draw back of overshooting the minima.

It consists of first calculating the projected place of the variable using the change from the ultimate iteration and using the by-product of the projected place throughout the calculation of the model new place for the variable.

Calculating the gradient of the projected place acts like a correction problem for the acceleration that has been collected.

With Nesterov momentum the gradient is evaluated after the current velocity is utilized. Thus one can interpret Nesterov momentum as attempting in order so as to add a correction problem to the same old methodology of momentum.

— Page 300, Deep Learning, 2023.

Nesterov Momentum is easy to think about this in relation to the 4 steps:

1. Project the place of the reply.
2. Calculate the gradient of the projection.
3. Calculate the change throughout the variable using the partial by-product.
4. Update the variable.

Let’s endure these steps in further ingredient.

First, the projected place of the entire reply is calculated using the change calculated throughout the last iteration of the algorithm.

projection(t+1) = x(t) + (momentum * change(t))

We can then calculate the gradient for this new place.

gradient(t+1) = f'(projection(t+1))

Now we’re capable of calculate the model new place of each variable using the gradient of the projection, first by calculating the change in each variable.

change(t+1) = (momentum * change(t)) – (step_size * gradient(t+1))

And lastly, calculating the model new value for each variable using the calculated change.

x(t+1) = x(t) + change(t+1)

In the sphere of convex optimization further usually, Nesterov Momentum is known to reinforce the pace of convergence of the optimization algorithm (e.g. cut back the number of iterations required to hunt out the reply).

Like momentum, NAG is a first-order optimization methodology with increased convergence charge guarantee than gradient descent in positive situations.

— On The Importance Of Initialization And Momentum In Deep Learning, 2013.

Although the method is environment friendly in teaching neural networks, it won’t have the equivalent frequent impression of accelerating convergence.

Unfortunately, throughout the stochastic gradient case, Nesterov momentum does not improve the pace of convergence.

— Page 300, Deep Learning, 2023.

Now that we’re accustomed to the Nesterov Momentum algorithm, let’s uncover how we might implement it and contemplate its effectivity.

Gradient Descent With Nesterov Momentum

In this half, we’ll uncover one of the best ways to implement the gradient descent optimization algorithm with Nesterov Momentum.

Two-Dimensional Test Problem

First, let’s define an optimization function.

We will use a straightforward two-dimensional function that squares the enter of each dimension and description the differ of authentic inputs from -1.0 to 1.0.

The objective() function underneath implements this function.

# objective function<br />def objective(x, y):<br />	return x**2.0 + y**2.0

# objective function

def objective(x, y):

return x**2.0 + y**2.0

We can create a three-dimensional plot of the dataset to get a way for the curvature of the response ground.

The full occasion of plotting the goal function is listed underneath.

# 3d plot of the check out function<br />from numpy import arange<br />from numpy import meshgrid<br />from matplotlib import pyplot</p><p># objective function<br />def objective(x, y):<br />	return x**2.0 + y**2.0</p><p># define differ for enter<br />r_min, r_max = -1.0, 1.0<br /># sample enter differ uniformly at 0.1 increments<br />xaxis = arange(r_min, r_max, 0.1)<br />yaxis = arange(r_min, r_max, 0.1)<br /># create a mesh from the axis<br />x, y = meshgrid(xaxis, yaxis)<br /># compute targets<br />outcomes = objective(x, y)<br /># create a ground plot with the jet coloration scheme<br />decide = pyplot.decide()<br />axis = decide.gca(projection=’3d’)<br />axis.plot_surface(x, y, outcomes, cmap=’jet’)<br /># current the plot<br />pyplot.current()

# 3d plot of the check out function

from numpy import arange

from numpy import meshgrid

from matplotlib import pyplot

# objective function

def objective(x, y):

return x**2.0 + y**2.0

# define differ for enter

r_min, r_max = –1.0, 1.0

# sample enter differ uniformly at 0.1 increments

xaxis = arange(r_min, r_max, 0.1)

yaxis = arange(r_min, r_max, 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

outcomes = objective(x, y)

# create a ground plot with the jet coloration scheme

decide = pyplot.decide()

axis = decide.gca(projection=‘3d’)

axis.plot_surface(x, y, outcomes, cmap=‘jet’)

# current the plot

pyplot.current()

Running the occasion creates a three-dimensional ground plot of the goal function.

We can see the acquainted bowl type with the worldwide minima at f(0, 0) = 0.

Three-Dimensional Plot of the Test Objective Function

We may even create a two-dimensional plot of the function. This will most likely be helpful later as soon as we have to plot the progress of the search.

The occasion underneath creates a contour plot of the goal function.

# contour plot of the check out function<br />from numpy import asarray<br />from numpy import arange<br />from numpy import meshgrid<br />from matplotlib import pyplot</p><p># objective function<br />def objective(x, y):<br />	return x**2.0 + y**2.0</p><p># define differ for enter<br />bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])<br /># sample enter differ uniformly at 0.1 increments<br />xaxis = arange(bounds[0,0], bounds[0,1], 0.1)<br />yaxis = arange(bounds[1,0], bounds[1,1], 0.1)<br /># create a mesh from the axis<br />x, y = meshgrid(xaxis, yaxis)<br /># compute targets<br />outcomes = objective(x, y)<br /># create a crammed contour plot with 50 ranges and jet coloration scheme<br />pyplot.contourf(x, y, outcomes, ranges=50, cmap=’jet’)<br /># current the plot<br />pyplot.current()

# contour plot of the check out function

from numpy import asarray

from numpy import arange

from numpy import meshgrid

from matplotlib import pyplot

# objective function

def objective(x, y):

return x**2.0 + y**2.0

# define differ for enter

bounds = asarray([[–1.0, 1.0], [–1.0, 1.0]])

# sample enter differ uniformly at 0.1 increments

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

outcomes = objective(x, y)

# create a crammed contour plot with 50 ranges and jet coloration scheme

pyplot.contourf(x, y, outcomes, ranges=50, cmap=‘jet’)

# current the plot

pyplot.current()

Running the occasion creates a two-dimensional contour plot of the goal function.

We can see the bowl type compressed to contours confirmed with a coloration gradient. We will use this plot to plot the actual elements explored in the middle of the progress of the search.

Two-Dimensional Contour Plot of the Test Objective Function

Now that we have a check out objective function, let’s take a look at how we might implement the Nesterov Momentum optimization algorithm.

Gradient Descent Optimization With Nesterov Momentum

We can apply the gradient descent with Nesterov Momentum to the check out draw back.

First, we wish a function that calculates the by-product for this function.

The by-product of x^2 is x * 2 in each dimension and the by-product() function implements this underneath.

Next, we’re capable of implement gradient descent optimization.

First, we’re in a position to decide on a random degree throughout the bounds of the problem as a starting point for the search.

This assumes we have an array that defines the bounds of the search with one row for each dimension and the first column defines the minimal and the second column defines the utmost of the dimension.

…<br /># generate an preliminary degree<br />reply = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

...

# generate an preliminary degree

reply = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

Next, we now have to calculate the projected degree from the current place and calculate its by-product.

…<br /># calculate the projected reply<br />projected = [solution[i] + momentum * change[i] for i in differ(reply.type[0])]<br /># calculate the gradient for the projection<br />gradient = by-product(projected[0], projected[1])

...

# calculate the projected reply

projected = [solution[i] + momentum * change[i] for i in differ(reply.type[0])]

# calculate the gradient for the projection

gradient = by-product(projected[0], projected[1])

We can then create the model new reply, one variable at a time.

First, the change throughout the variable is calculated using the partial by-product and finding out charge with the momentum from the ultimate change throughout the variable. This change is saved for the next iteration of the algorithm. Then the change is used to calculate the model new value for the variable.

…<br /># assemble a solution one variable at a time<br />new_solution = guidelines()<br />for i in differ(reply.type[0]):<br />	# calculate the change<br />	change[i] = (momentum * change[i]) – step_size * gradient[i]<br />	# calculate the model new place on this variable<br />	value = reply[i] + change[i]<br />	# retailer this variable<br />	new_solution.append(value)

...

# assemble a solution one variable at a time

new_solution = guidelines()

for i in differ(reply.type[0]):

# calculate the change

change[i] = (momentum * change[i]) – step_size * gradient[i]

# calculate the model new place on this variable

value = reply[i] + change[i]

# retailer this variable

new_solution.append(value)

This is repeated for each variable for the goal function, then repeated for each iteration of the algorithm.

This new reply can then be evaluated using the objective() function and the effectivity of the search is likely to be reported.

…<br /># contemplate candidate degree<br />reply = asarray(new_solution)<br />solution_eval = objective(reply[0], reply[1])<br /># report progress<br />print(‘>%d f(%s) = %.5f’ % (it, reply, solution_eval))

...

# contemplate candidate degree

reply = asarray(new_solution)

solution_eval = objective(reply[0], reply[1])

# report progress

print(‘>%d f(%s) = %.5f’ % (it, reply, solution_eval))

And that’s it.

We can tie all of this collectively proper right into a function named nesterov() that takes the names of the goal function and the by-product function, an array with the bounds of the world and hyperparameter values for the complete number of algorithm iterations, the tutorial charge, and the momentum, and returns the final word reply and its evaluation.

This full function is listed underneath.

# gradient descent algorithm with nesterov momentum<br />def nesterov(objective, by-product, bounds, n_iter, step_size, momentum):<br />	# generate an preliminary degree<br />	reply = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])<br />	# guidelines of modifications made to each variable<br />	change = [0.0 for _ in range(bounds.shape[0])]<br />	# run the gradient descent<br />	for it in differ(n_iter):<br />		# calculate the projected reply<br />		projected = [solution[i] + momentum * change[i] for i in differ(reply.type[0])]<br />		# calculate the gradient for the projection<br />		gradient = by-product(projected[0], projected[1])<br />		# assemble a solution one variable at a time<br />		new_solution = guidelines()<br />		for i in differ(reply.type[0]):<br />			# calculate the change<br />			change[i] = (momentum * change[i]) – step_size * gradient[i]<br />			# calculate the model new place on this variable<br />			value = reply[i] + change[i]<br />			# retailer this variable<br />			new_solution.append(value)<br />		# contemplate candidate degree<br />		reply = asarray(new_solution)<br />		solution_eval = objective(reply[0], reply[1])<br />		# report progress<br />		print(‘>%d f(%s) = %.5f’ % (it, reply, solution_eval))<br />	return [solution, solution_eval]

# gradient descent algorithm with nesterov momentum

def nesterov(objective, by-product, bounds, n_iter, step_size, momentum):

# generate an preliminary degree

reply = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

# guidelines of modifications made to each variable

change = [0.0 for _ in range(bounds.shape[0])]

# run the gradient descent

for it in differ(n_iter):

# calculate the projected reply

projected = [solution[i] + momentum * change[i] for i in differ(reply.type[0])]

# calculate the gradient for the projection

gradient = by-product(projected[0], projected[1])

# assemble a solution one variable at a time

new_solution = guidelines()

for i in differ(reply.type[0]):

# calculate the change

change[i] = (momentum * change[i]) – step_size * gradient[i]

# calculate the model new place on this variable

value = reply[i] + change[i]

# retailer this variable

new_solution.append(value)

# contemplate candidate degree

reply = asarray(new_solution)

solution_eval = objective(reply[0], reply[1])

# report progress

print(‘>%d f(%s) = %.5f’ % (it, reply, solution_eval))

return [solution, solution_eval]

Note, we have intentionally used lists and essential coding sort instead of vectorized operations for readability. Feel free to adapt the implementation to a vectorization implementation with NumPy arrays for increased effectivity.

We can then define our hyperparameters and title the nesterov() function to optimize our check out objective function.

In this case, we’ll use 30 iterations of the algorithm with a finding out charge of 0.1 and momentum of 0.3. These hyperparameter values had been found after a bit trial and error.

…<br /># seed the pseudo random amount generator<br />seed(1)<br /># define differ for enter<br />bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])<br /># define the complete iterations<br />n_iter = 30<br /># define the step measurement<br />step_size = 0.1<br /># define momentum<br />momentum = 0.3<br /># perform the gradient descent search with nesterov momentum<br />most interesting, ranking = nesterov(objective, by-product, bounds, n_iter, step_size, momentum)<br />print(‘Done!’)<br />print(‘f(%s) = %f’ % (most interesting, ranking))

...

# seed the pseudo random amount generator

seed(1)

# define differ for enter

bounds = asarray([[–1.0, 1.0], [–1.0, 1.0]])

# define the complete iterations

n_iter = 30

# define the step measurement

step_size = 0.1

# define momentum

momentum = 0.3

# perform the gradient descent search with nesterov momentum

most interesting, ranking = nesterov(objective, by-product, bounds, n_iter, step_size, momentum)

print(‘Done!’)

print(‘f(%s) = %f’ % (most interesting, ranking))

Tying all of this collectively, the entire occasion of gradient descent optimization with Nesterov Momentum is listed underneath.

# gradient descent optimization with nesterov momentum for a two-dimensional check out function<br />from math import sqrt<br />from numpy import asarray<br />from numpy.random import rand<br />from numpy.random import seed</p><p># objective function<br />def objective(x, y):<br />	return x**2.0 + y**2.0</p><p># by-product of objective function<br />def by-product(x, y):<br />	return asarray([x * 2.0, y * 2.0])</p><p># gradient descent algorithm with nesterov momentum<br />def nesterov(objective, by-product, bounds, n_iter, step_size, momentum):<br />	# generate an preliminary degree<br />	reply = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])<br />	# guidelines of modifications made to each variable<br />	change = [0.0 for _ in range(bounds.shape[0])]<br />	# run the gradient descent<br />	for it in differ(n_iter):<br />		# calculate the projected reply<br />		projected = [solution[i] + momentum * change[i] for i in differ(reply.type[0])]<br />		# calculate the gradient for the projection<br />		gradient = by-product(projected[0], projected[1])<br />		# assemble a solution one variable at a time<br />		new_solution = guidelines()<br />		for i in differ(reply.type[0]):<br />			# calculate the change<br />			change[i] = (momentum * change[i]) – step_size * gradient[i]<br />			# calculate the model new place on this variable<br />			value = reply[i] + change[i]<br />			# retailer this variable<br />			new_solution.append(value)<br />		# contemplate candidate degree<br />		reply = asarray(new_solution)<br />		solution_eval = objective(reply[0], reply[1])<br />		# report progress<br />		print(‘>%d f(%s) = %.5f’ % (it, reply, solution_eval))<br />	return [solution, solution_eval]</p><p># seed the pseudo random amount generator<br />seed(1)<br /># define differ for enter<br />bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])<br /># define the complete iterations<br />n_iter = 30<br /># define the step measurement<br />step_size = 0.1<br /># define momentum<br />momentum = 0.3<br /># perform the gradient descent search with nesterov momentum<br />most interesting, ranking = nesterov(objective, by-product, bounds, n_iter, step_size, momentum)<br />print(‘Done!’)<br />print(‘f(%s) = %f’ % (most interesting, ranking))

# gradient descent optimization with nesterov momentum for a two-dimensional check out function

from math import sqrt

from numpy import asarray

from numpy.random import rand

from numpy.random import seed

# objective function

def objective(x, y):

return x**2.0 + y**2.0

# by-product of objective function

def by-product(x, y):

return asarray([x * 2.0, y * 2.0])

# gradient descent algorithm with nesterov momentum

def nesterov(objective, by-product, bounds, n_iter, step_size, momentum):

# generate an preliminary degree

reply = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

# guidelines of modifications made to each variable

change = [0.0 for _ in range(bounds.shape[0])]

# run the gradient descent

for it in differ(n_iter):

# calculate the projected reply

projected = [solution[i] + momentum * change[i] for i in differ(reply.type[0])]

# calculate the gradient for the projection

gradient = by-product(projected[0], projected[1])

# assemble a solution one variable at a time

new_solution = guidelines()

for i in differ(reply.type[0]):

# calculate the change

change[i] = (momentum * change[i]) – step_size * gradient[i]

# calculate the model new place on this variable

value = reply[i] + change[i]

# retailer this variable

new_solution.append(value)

# contemplate candidate degree

reply = asarray(new_solution)

solution_eval = objective(reply[0], reply[1])

# report progress

print(‘>%d f(%s) = %.5f’ % (it, reply, solution_eval))

return [solution, solution_eval]

# seed the pseudo random amount generator

seed(1)

# define differ for enter

bounds = asarray([[–1.0, 1.0], [–1.0, 1.0]])

# define the complete iterations

n_iter = 30

# define the step measurement

step_size = 0.1

# define momentum

momentum = 0.3

# perform the gradient descent search with nesterov momentum

most interesting, ranking = nesterov(objective, by-product, bounds, n_iter, step_size, momentum)

print(‘Done!’)

print(‘f(%s) = %f’ % (most interesting, ranking))

Running the occasion applies the optimization algorithm with Nesterov Momentum to our check out draw back and opinions effectivity of the search for each iteration of the algorithm.

Note: Your outcomes would possibly differ given the stochastic nature of the algorithm or evaluation course of, or variations in numerical precision. Consider working the occasion a variety of events and consider the frequent finish consequence.

In this case, we’re capable of see {{that a}} near optimum reply was found after perhaps 15 iterations of the search, with enter values near 0.0 and 0.0, evaluating to 0.0.

>0 f([-0.13276479 0.35251919]) = 0.14190<br />>1 f([-0.09824595 0.2608642 ]) = 0.07770<br />>2 f([-0.07031223 0.18669416]) = 0.03980<br />>3 f([-0.0495457 0.13155452]) = 0.01976<br />>4 f([-0.03465259 0.0920101 ]) = 0.00967<br />>5 f([-0.02414772 0.06411742]) = 0.00469<br />>6 f([-0.01679701 0.04459969]) = 0.00227<br />>7 f([-0.01167344 0.0309955 ]) = 0.00110<br />>8 f([-0.00810909 0.02153139]) = 0.00053<br />>9 f([-0.00563183 0.01495373]) = 0.00026<br />>10 f([-0.00391092 0.01038434]) = 0.00012<br />>11 f([-0.00271572 0.00721082]) = 0.00006<br />>12 f([-0.00188573 0.00500701]) = 0.00003<br />>13 f([-0.00130938 0.0034767 ]) = 0.00001<br />>14 f([-0.00090918 0.00241408]) = 0.00001<br />>15 f([-0.0006313 0.00167624]) = 0.00000<br />>16 f([-0.00043835 0.00116391]) = 0.00000<br />>17 f([-0.00030437 0.00080817]) = 0.00000<br />>18 f([-0.00021134 0.00056116]) = 0.00000<br />>19 f([-0.00014675 0.00038964]) = 0.00000<br />>20 f([-0.00010189 0.00027055]) = 0.00000<br />>21 f([-7.07505806e-05 1.87858067e-04]) = 0.00000<br />>22 f([-4.91260884e-05 1.30440372e-04]) = 0.00000<br />>23 f([-3.41109926e-05 9.05720503e-05]) = 0.00000<br />>24 f([-2.36851711e-05 6.28892431e-05]) = 0.00000<br />>25 f([-1.64459397e-05 4.36675208e-05]) = 0.00000<br />>26 f([-1.14193362e-05 3.03208033e-05]) = 0.00000<br />>27 f([-7.92908415e-06 2.10534304e-05]) = 0.00000<br />>28 f([-5.50560682e-06 1.46185748e-05]) = 0.00000<br />>29 f([-3.82285090e-06 1.01504945e-05]) = 0.00000<br />Done!<br />f([-3.82285090e-06 1.01504945e-05]) = 0.000000

>0 f([-0.13276479 0.35251919]) = 0.14190

>1 f([-0.09824595 0.2608642 ]) = 0.07770

>2 f([-0.07031223 0.18669416]) = 0.03980

>3 f([-0.0495457 0.13155452]) = 0.01976

>4 f([-0.03465259 0.0920101 ]) = 0.00967

>5 f([-0.02414772 0.06411742]) = 0.00469

>6 f([-0.01679701 0.04459969]) = 0.00227

>7 f([-0.01167344 0.0309955 ]) = 0.00110

>8 f([-0.00810909 0.02153139]) = 0.00053

>9 f([-0.00563183 0.01495373]) = 0.00026

>10 f([-0.00391092 0.01038434]) = 0.00012

>11 f([-0.00271572 0.00721082]) = 0.00006

>12 f([-0.00188573 0.00500701]) = 0.00003

>13 f([-0.00130938 0.0034767 ]) = 0.00001

>14 f([-0.00090918 0.00241408]) = 0.00001

>15 f([-0.0006313 0.00167624]) = 0.00000

>16 f([-0.00043835 0.00116391]) = 0.00000

>17 f([-0.00030437 0.00080817]) = 0.00000

>18 f([-0.00021134 0.00056116]) = 0.00000

>19 f([-0.00014675 0.00038964]) = 0.00000

>20 f([-0.00010189 0.00027055]) = 0.00000

>21 f([-7.07505806e-05 1.87858067e-04]) = 0.00000

>22 f([-4.91260884e-05 1.30440372e-04]) = 0.00000

>23 f([-3.41109926e-05 9.05720503e-05]) = 0.00000

>24 f([-2.36851711e-05 6.28892431e-05]) = 0.00000

>25 f([-1.64459397e-05 4.36675208e-05]) = 0.00000

>26 f([-1.14193362e-05 3.03208033e-05]) = 0.00000

>27 f([-7.92908415e-06 2.10534304e-05]) = 0.00000

>28 f([-5.50560682e-06 1.46185748e-05]) = 0.00000

>29 f([-3.82285090e-06 1.01504945e-05]) = 0.00000

Done!

f([-3.82285090e-06 1.01504945e-05]) = 0.000000

Visualization of Nesterov Momentum

We can plot the progress of the Nesterov Momentum search on a contour plot of the world.

This can current an intuition for the progress of the search over the iterations of the algorithm.

We ought to substitute the nesterov() function to maintain up an inventory of all choices found in the middle of the search, then return this guidelines on the end of the search.

The updated mannequin of the function with these modifications is listed underneath.

# gradient descent algorithm with nesterov momentum<br />def nesterov(objective, by-product, bounds, n_iter, step_size, momentum):<br />	# monitor all choices<br />	choices = guidelines()<br />	# generate an preliminary degree<br />	reply = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])<br />	# guidelines of modifications made to each variable<br />	change = [0.0 for _ in range(bounds.shape[0])]<br />	# run the gradient descent<br />	for it in differ(n_iter):<br />		# calculate the projected reply<br />		projected = [solution[i] + momentum * change[i] for i in differ(reply.type[0])]<br />		# calculate the gradient for the projection<br />		gradient = by-product(projected[0], projected[1])<br />		# assemble a solution one variable at a time<br />		new_solution = guidelines()<br />		for i in differ(reply.type[0]):<br />			# calculate the change<br />			change[i] = (momentum * change[i]) – step_size * gradient[i]<br />			# calculate the model new place on this variable<br />			value = reply[i] + change[i]<br />			# retailer this variable<br />			new_solution.append(value)<br />		# retailer the model new reply<br />		reply = asarray(new_solution)<br />		choices.append(reply)<br />		# contemplate candidate degree<br />		solution_eval = objective(reply[0], reply[1])<br />		# report progress<br />		print(‘>%d f(%s) = %.5f’ % (it, reply, solution_eval))<br />	return choices

# gradient descent algorithm with nesterov momentum

def nesterov(objective, by-product, bounds, n_iter, step_size, momentum):

# monitor all choices

choices = guidelines()

# generate an preliminary degree

reply = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

# guidelines of modifications made to each variable

change = [0.0 for _ in range(bounds.shape[0])]

# run the gradient descent

for it in differ(n_iter):

# calculate the projected reply

projected = [solution[i] + momentum * change[i] for i in differ(reply.type[0])]

# calculate the gradient for the projection

gradient = by-product(projected[0], projected[1])

# assemble a solution one variable at a time

new_solution = guidelines()

for i in differ(reply.type[0]):

# calculate the change

change[i] = (momentum * change[i]) – step_size * gradient[i]

# calculate the model new place on this variable

value = reply[i] + change[i]

# retailer this variable

new_solution.append(value)

# retailer the model new reply

reply = asarray(new_solution)

choices.append(reply)

# contemplate candidate degree

solution_eval = objective(reply[0], reply[1])

# report progress

print(‘>%d f(%s) = %.5f’ % (it, reply, solution_eval))

return choices

We can then execute the search as sooner than, and this time retrieve the guidelines of choices instead of among the best final reply.

…<br /># seed the pseudo random amount generator<br />seed(1)<br /># define differ for enter<br />bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])<br /># define the complete iterations<br />n_iter = 50<br /># define the step measurement<br />step_size = 0.01<br /># define momentum<br />momentum = 0.8<br /># perform the gradient descent search with nesterov momentum<br />choices = nesterov(objective, by-product, bounds, n_iter, step_size, momentum)

...

# seed the pseudo random amount generator

seed(1)

# define differ for enter

bounds = asarray([[–1.0, 1.0], [–1.0, 1.0]])

# define the complete iterations

n_iter = 50

# define the step measurement

step_size = 0.01

# define momentum

momentum = 0.8

# perform the gradient descent search with nesterov momentum

choices = nesterov(objective, by-product, bounds, n_iter, step_size, momentum)

We can then create a contour plot of the goal function, as sooner than.

…<br /># sample enter differ uniformly at 0.1 increments<br />xaxis = arange(bounds[0,0], bounds[0,1], 0.1)<br />yaxis = arange(bounds[1,0], bounds[1,1], 0.1)<br /># create a mesh from the axis<br />x, y = meshgrid(xaxis, yaxis)<br /># compute targets<br />outcomes = objective(x, y)<br /># create a crammed contour plot with 50 ranges and jet coloration scheme<br />pyplot.contourf(x, y, outcomes, ranges=50, cmap=’jet’)

...

# sample enter differ uniformly at 0.1 increments

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

outcomes = objective(x, y)

# create a crammed contour plot with 50 ranges and jet coloration scheme

pyplot.contourf(x, y, outcomes, ranges=50, cmap=‘jet’)

Finally, we’re capable of plot each reply found in the middle of the search as a white dot linked by a line.

…<br /># plot the sample as black circles<br />choices = asarray(choices)<br />pyplot.plot(choices[:, 0], choices[:, 1], ‘.-‘, coloration=”w”)

...

# plot the sample as black circles

choices = asarray(choices)

pyplot.plot(choices[:, 0], choices[:, 1], ‘.-‘, coloration=‘w’)

Tying this all collectively, the entire occasion of performing the Nesterov Momentum optimization on the check out draw back and plotting the outcomes on a contour plot is listed underneath.

# occasion of plotting the nesterov momentum search on a contour plot of the check out function<br />from math import sqrt<br />from numpy import asarray<br />from numpy import arange<br />from numpy.random import rand<br />from numpy.random import seed<br />from numpy import meshgrid<br />from matplotlib import pyplot<br />from mpl_toolkits.mplot3d import Axes3D</p><p># objective function<br />def objective(x, y):<br />	return x**2.0 + y**2.0</p><p># by-product of objective function<br />def by-product(x, y):<br />	return asarray([x * 2.0, y * 2.0])</p><p># gradient descent algorithm with nesterov momentum<br />def nesterov(objective, by-product, bounds, n_iter, step_size, momentum):<br />	# monitor all choices<br />	choices = guidelines()<br />	# generate an preliminary degree<br />	reply = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])<br />	# guidelines of modifications made to each variable<br />	change = [0.0 for _ in range(bounds.shape[0])]<br />	# run the gradient descent<br />	for it in differ(n_iter):<br />		# calculate the projected reply<br />		projected = [solution[i] + momentum * change[i] for i in differ(reply.type[0])]<br />		# calculate the gradient for the projection<br />		gradient = by-product(projected[0], projected[1])<br />		# assemble a solution one variable at a time<br />		new_solution = guidelines()<br />		for i in differ(reply.type[0]):<br />			# calculate the change<br />			change[i] = (momentum * change[i]) – step_size * gradient[i]<br />			# calculate the model new place on this variable<br />			value = reply[i] + change[i]<br />			# retailer this variable<br />			new_solution.append(value)<br />		# retailer the model new reply<br />		reply = asarray(new_solution)<br />		choices.append(reply)<br />		# contemplate candidate degree<br />		solution_eval = objective(reply[0], reply[1])<br />		# report progress<br />		print(‘>%d f(%s) = %.5f’ % (it, reply, solution_eval))<br />	return choices</p><p># seed the pseudo random amount generator<br />seed(1)<br /># define differ for enter<br />bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])<br /># define the complete iterations<br />n_iter = 50<br /># define the step measurement<br />step_size = 0.01<br /># define momentum<br />momentum = 0.8<br /># perform the gradient descent search with nesterov momentum<br />choices = nesterov(objective, by-product, bounds, n_iter, step_size, momentum)<br /># sample enter differ uniformly at 0.1 increments<br />xaxis = arange(bounds[0,0], bounds[0,1], 0.1)<br />yaxis = arange(bounds[1,0], bounds[1,1], 0.1)<br /># create a mesh from the axis<br />x, y = meshgrid(xaxis, yaxis)<br /># compute targets<br />outcomes = objective(x, y)<br /># create a crammed contour plot with 50 ranges and jet coloration scheme<br />pyplot.contourf(x, y, outcomes, ranges=50, cmap=’jet’)<br /># plot the sample as black circles<br />choices = asarray(choices)<br />pyplot.plot(choices[:, 0], choices[:, 1], ‘.-‘, coloration=”w”)<br /># current the plot<br />pyplot.current()

# occasion of plotting the nesterov momentum search on a contour plot of the check out function

from math import sqrt

from numpy import asarray

from numpy import arange

from numpy.random import rand

from numpy.random import seed

from numpy import meshgrid

from matplotlib import pyplot

from mpl_toolkits.mplot3d import Axes3D

# objective function

def objective(x, y):

return x**2.0 + y**2.0

# by-product of objective function

def by-product(x, y):

return asarray([x * 2.0, y * 2.0])

# gradient descent algorithm with nesterov momentum

def nesterov(objective, by-product, bounds, n_iter, step_size, momentum):

# monitor all choices

choices = guidelines()

# generate an preliminary degree

reply = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

# guidelines of modifications made to each variable

change = [0.0 for _ in range(bounds.shape[0])]

# run the gradient descent

for it in differ(n_iter):

# calculate the projected reply

projected = [solution[i] + momentum * change[i] for i in differ(reply.type[0])]

# calculate the gradient for the projection

gradient = by-product(projected[0], projected[1])

# assemble a solution one variable at a time

new_solution = guidelines()

for i in differ(reply.type[0]):

# calculate the change

change[i] = (momentum * change[i]) – step_size * gradient[i]

# calculate the model new place on this variable

value = reply[i] + change[i]

# retailer this variable

new_solution.append(value)

# retailer the model new reply

reply = asarray(new_solution)

choices.append(reply)

# contemplate candidate degree

solution_eval = objective(reply[0], reply[1])

# report progress

print(‘>%d f(%s) = %.5f’ % (it, reply, solution_eval))

return choices

# seed the pseudo random amount generator

seed(1)

# define differ for enter

bounds = asarray([[–1.0, 1.0], [–1.0, 1.0]])

# define the complete iterations

n_iter = 50

# define the step measurement

step_size = 0.01

# define momentum

momentum = 0.8

# perform the gradient descent search with nesterov momentum

choices = nesterov(objective, by-product, bounds, n_iter, step_size, momentum)

# sample enter differ uniformly at 0.1 increments

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

outcomes = objective(x, y)

# create a crammed contour plot with 50 ranges and jet coloration scheme

pyplot.contourf(x, y, outcomes, ranges=50, cmap=‘jet’)

# plot the sample as black circles

choices = asarray(choices)

pyplot.plot(choices[:, 0], choices[:, 1], ‘.-‘, coloration=‘w’)

# current the plot

pyplot.current()

Running the occasion performs the search as sooner than, moreover on this case, the contour plot of the goal function is created.

In this case, we’re capable of see {{that a}} white dot is confirmed for each reply found in the middle of the search, starting above the optima and progressively getting nearer to the optima on the center of the plot.

Contour Plot of the Test Objective Function With Nesterov Momentum Search Results Shown

Summary

In this tutorial, you discovered one of the best ways to develop the gradient descent optimization with Nesterov Momentum from scratch.

Specifically, you found:

Gradient descent is an optimization algorithm that makes use of the gradient of the goal function to navigate the search home.
The convergence of gradient descent optimization algorithm is likely to be accelerated by extending the algorithm and together with Nesterov Momentum.
How to implement the Nesterov Momentum optimization algorithm from scratch and apply it to an objective function and contemplate the outcomes.

Do you will have any questions?
Ask your questions throughout the suggestions underneath and I’ll do my most interesting to answer.

Search This Blog

Solution Desk

Why Does My Snapchat AI Have a Story? Has Snapchat AI Been Hacked?

Gradient Descent With Nesterov Momentum From Scratch

Tutorial Overview

Gradient Descent

Want to Get Started With Optimization Algorithms?

Nesterov Momentum

Gradient Descent With Nesterov Momentum

Two-Dimensional Test Problem

Gradient Descent Optimization With Nesterov Momentum

Visualization of Nesterov Momentum

Further Reading

Papers

Books

APIs

Articles

Summary

Get a Handle on Modern Optimization Algorithms!

Develop Your Understanding of Optimization

Bring Modern Optimization Algorithms to
Your Machine Learning Projects

More On This Topic

Comments

Post a Comment

Popular posts from this blog

TransformX by Scale AI is Oct 19-21: Register with out spending a dime!

7 Things to Consider Before Buying Auto Insurance

How To Get More Life Insurance For Less Money

Why Does My Snapchat AI Have a Story? Has Snapchat AI Been Hacked?

Gradient Descent With Nesterov Momentum From Scratch

Tutorial Overview

Gradient Descent

Want to Get Started With Optimization Algorithms?

Nesterov Momentum

Gradient Descent With Nesterov Momentum

Two-Dimensional Test Problem

Gradient Descent Optimization With Nesterov Momentum

Visualization of Nesterov Momentum

Further Reading

Papers

Books

APIs

Articles

Summary

Get a Handle on Modern Optimization Algorithms!

Develop Your Understanding of Optimization

Bring Modern Optimization Algorithms to Your Machine Learning Projects

More On This Topic

Comments

Post a Comment

Popular posts from this blog

TransformX by Scale AI is Oct 19-21: Register with out spending a dime!

7 Things to Consider Before Buying Auto Insurance

How To Get More Life Insurance For Less Money

Bring Modern Optimization Algorithms to
Your Machine Learning Projects