Visualizing the vanishing gradient draw back

Last Updated on November 26, 2023

Deep finding out was a present invention. Partially, it’s due to improved computation vitality that allows us to make use of additional layers of perceptrons in a neural group. But on the similar time, we are going to put together a deep group solely after everyone knows how one can work throughout the vanishing gradient draw back.

In this tutorial, we visually examine why vanishing gradient draw back exists.

After ending this tutorial, you will know

What is a vanishing gradient
Which configuration of neural group will possible be inclined to vanishing gradient
How to run handbook teaching loop in Keras
How to extract weights and gradients from Keras model

Let’s get started

Visualizing the vanishing gradient problem

Visualizing the vanishing gradient draw back
Photo by Alisa Anton, some rights reserved.

Tutorial overview

This tutorial is cut up into 5 elements; they’re:

Configuration of multilayer perceptron fashions
Example of vanishing gradient draw back
Looking on the weights of each layer
Looking on the gradients of each layer
The Glorot initialization

Configuration of multilayer perceptron fashions

Because neural networks are educated by gradient descent, people believed {{that a}} differentiable carry out is required to be the activation carry out in neural networks. This triggered us to conventionally use sigmoid carry out or hyperbolic tangent as activation.

For a binary classification draw back, if we have to do logistic regression such that 0 and 1 are the right output, sigmoid carry out is preferred because it’s on this differ:
$$
sigma(x) = frac{1}{1+e^{-x}}
$$
and if we would like sigmoidal activation on the output, it is pure to utilize it in all layers of the neural group. Additionally, each layer in a neural group has a weight parameter. Initially, the weights have to be randomized and naturally we would use some straightforward technique to do it, resembling using uniform random or common distribution.

Example of vanishing gradient draw back

To illustrate the difficulty of vanishing gradient, let’s try with an occasion. Neural group is a nonlinear carry out. Hence it have to be finest fitted to classification of nonlinear dataset. We make use of scikit-learn’s make_circle() carry out to generate some data:

from sklearn.datasets import make_circles<br />import matplotlib.pyplot as plt</p><p># Make data: Two circles on x-y plane as a classification draw back<br />X, y = make_circles(n_samples=1000, difficulty=0.5, noise=0.1)</p><p>plt.decide(figsize=(8,6))<br />plt.scatter(X[:,0], X[:,1], c=y)<br />plt.current()

from sklearn.datasets import make_circles

import matplotlib.pyplot as plt

# Make data: Two circles on x-y plane as a classification draw back

X, y = make_circles(n_samples=1000, difficulty=0.5, noise=0.1)

plt.decide(figsize=(8,6))

plt.scatter(X[:,0], X[:,1], c=y)

plt.current()

This is not going to be powerful to classify. A naive strategy is to assemble a 3-layer neural group, which might present a reasonably good finish outcome:

from tensorflow.keras.layers import Dense, Input<br />from tensorflow.keras import Sequential</p><p>model = Sequential([<br />    Input(shape=(2,)),<br />    Dense(5, “relu”),<br />    Dense(1, “sigmoid”)<br />])<br />model.compile(optimizer=”adam”, loss=”binary_crossentropy”, metrics=[“acc”])<br />model.match(X, y, batch_size=32, epochs=100, verbose=0)<br />print(model.contemplate(X,y))

from tensorflow.keras.layers import Dense, Input

from tensorflow.keras import Sequential

model = Sequential([

Input(shape=(2,)),

Dense(5, “relu”),

Dense(1, “sigmoid”)

])

model.compile(optimizer=“adam”, loss=“binary_crossentropy”, metrics=[“acc”])

model.match(X, y, batch_size=32, epochs=100, verbose=0)

print(model.contemplate(X,y))

32/32 [==============================] – 0s 1ms/step – loss: 0.2404 – acc: 0.9730<br />[0.24042171239852905, 0.9729999899864197]

1 2	32/32 [==============================] – 0s 1ms/step – loss: 0.2404 – acc: 0.9730 [0.24042171239852905, 0.9729999899864197]

Note that we used rectified linear unit (ReLU) inside the hidden layer above. By default, the dense layer in Keras will possible be using linear activation (i.e. no activation) which largely is not going to be useful. We usually use ReLU in modern neural networks. But we are going to moreover try the quaint strategy as all people does 20 years prior to now:

model = Sequential([<br />    Input(shape=(2,)),<br />    Dense(5, “sigmoid”),<br />    Dense(1, “sigmoid”)<br />])<br />model.compile(optimizer=”adam”, loss=”binary_crossentropy”, metrics=[“acc”])<br />model.match(X, y, batch_size=32, epochs=100, verbose=0)<br />print(model.contemplate(X,y))

model = Sequential([

Input(shape=(2,)),

Dense(5, “sigmoid”),

Dense(1, “sigmoid”)

])

model.compile(optimizer=“adam”, loss=“binary_crossentropy”, metrics=[“acc”])

model.match(X, y, batch_size=32, epochs=100, verbose=0)

print(model.contemplate(X,y))

32/32 [==============================] – 0s 1ms/step – loss: 0.6927 – acc: 0.6540<br />[0.6926590800285339, 0.6539999842643738]

1 2	32/32 [==============================] – 0s 1ms/step – loss: 0.6927 – acc: 0.6540 [0.6926590800285339, 0.6539999842643738]

The accuracy is much worse. It appears, it is even worse by together with further layers (at least in my experiment):

model = Sequential([<br />    Input(shape=(2,)),<br />    Dense(5, “sigmoid”),<br />    Dense(5, “sigmoid”),<br />    Dense(5, “sigmoid”),<br />    Dense(1, “sigmoid”)<br />])<br />model.compile(optimizer=”adam”, loss=”binary_crossentropy”, metrics=[“acc”])<br />model.match(X, y, batch_size=32, epochs=100, verbose=0)<br />print(model.contemplate(X,y))

model = Sequential([

Input(shape=(2,)),

Dense(5, “sigmoid”),

Dense(1, “sigmoid”)

])

model.compile(optimizer=“adam”, loss=“binary_crossentropy”, metrics=[“acc”])

model.match(X, y, batch_size=32, epochs=100, verbose=0)

print(model.contemplate(X,y))

32/32 [==============================] – 0s 1ms/step – loss: 0.6922 – acc: 0.5330<br />[0.6921834349632263, 0.5329999923706055]

1 2	32/32 [==============================] – 0s 1ms/step – loss: 0.6922 – acc: 0.5330 [0.6921834349632263, 0.5329999923706055]

Your finish outcome may fluctuate given the stochastic nature of the teaching algorithm. You may even see the 5-layer sigmoidal group performing so much worse than 3-layer or not. But the thought proper right here is you presumably can’t get once more the extreme accuracy as we are going to acquire with rectified linear unit activation by merely together with layers.

Looking on the weights of each layer

Shouldn’t we get a further extremely efficient neural group with further layers?

Yes, it have to be. But it appears as we together with further layers, we triggered the vanishing gradient draw back. To illustrate what occurred, let’s see how are the weights look like as we educated our group.

In Keras, we’re allowed to plug-in a callback carry out to the teaching course of. We are going create our private callback object to intercept and doc the weights of each layer of our multilayer perceptron (MLP) model on the end of each epoch.

from tensorflow.keras.callbacks import Callback</p><p>class WeightCapture(Callback):<br />    “Capture the weights of each layer of the model”<br />    def __init__(self, model):<br />        super().__init__()<br />        self.model = model<br />        self.weights = []<br />        self.epochs = []</p><p>    def on_epoch_end(self, epoch, logs=None):<br />        self.epochs.append(epoch) # be mindful the epoch axis<br />        weight = {}<br />        for layer in model.layers:<br />            if not layer.weights:<br />                proceed<br />            title = layer.weights[0].title.break up(“/”)[0]<br />            weight[name] = layer.weights[0].numpy()<br />        self.weights.append(weight)

from tensorflow.keras.callbacks import Callback

class WeightCapture(Callback):

“Capture the weights of each layer of the model”

def __init__(self, model):

super().__init__()

self.model = model

self.weights = []

self.epochs = []

def on_epoch_end(self, epoch, logs=None):

self.epochs.append(epoch) # be mindful the epoch axis

weight = {}

for layer in model.layers:

if not layer.weights:

proceed

title = layer.weights[0].title.break up(“/”)[0]

weight[name] = layer.weights[0].numpy()

self.weights.append(weight)

We derive the Callback class and description the on_epoch_end() carry out. This class will need the created model to initialize. At the tip of each epoch, it’ll study each layer and save the weights into numpy array.

For the consolation of experimenting different methods of constructing a MLP, we make a helper carry out to rearrange the neural group model:

We deliberately create a neural group with 4 hidden layers so we are going to see how each layer reply to the teaching. We will fluctuate the activation carry out of each hidden layer along with the burden initialization. To make points easier to tell, we’ll title each layer in its place of letting Keras to assign a status. The enter is a coordinate on the xy-plane due to this fact the enter type is a vector of two. The output is binary classification. Therefore we use sigmoid activation to make the output fall inside the differ of 0 to 1.

Then we are going to compile() the model to supply the evaluation metrics and go on the callback inside the match() title to educate the model:

initializer = RandomNormal(suggest=0.0, stddev=1.0)<br />batch_size = 32<br />n_epochs = 100</p><p>model = make_mlp(“sigmoid”, initializer, “sigmoid”)<br />capture_cb = WeightCapture(model)<br />capture_cb.on_epoch_end(-1)<br />model.compile(optimizer=”rmsprop”, loss=”binary_crossentropy”, metrics=[“acc”])<br />model.match(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=1)

initializer = RandomNormal(suggest=0.0, stddev=1.0)

batch_size = 32

n_epochs = 100

model = make_mlp(“sigmoid”, initializer, “sigmoid”)

capture_cb = WeightCapture(model)

capture_cb.on_epoch_end(–1)

model.compile(optimizer=“rmsprop”, loss=“binary_crossentropy”, metrics=[“acc”])

model.match(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=1)

Here we create the neural group by calling make_mlp() first. Then we prepare our callback object. Since the weights of each layer inside the neural group are initialized at creation, we deliberately title the callback carry out to remember what they’re initialized to. Then we title the compile() and match() from the model as bizarre, with the callback object provided.

After we match the model, we are going to contemplate it together with your total dataset:

…<br />print(model.contemplate(X,y))

1 2	... print(model.contemplate(X,y))

[0.6649572253227234, 0.5879999995231628]

1	[0.6649572253227234, 0.5879999995231628]

Here it means the log-loss is 0.665 and the accuracy is 0.588 for this model of getting all layers using sigmoid activation.

What we are going to further look into is how the burden behaves alongside the iterations of teaching. All the layers apart from the first and the ultimate are having their weight as a 5×5 matrix. We can look at the suggest and commonplace deviation of the weights to get a means of how the weights look like:

def plotweight(capture_cb):<br />    “Plot the weights’ suggest and s.d. all through epochs”<br />    fig, ax = plt.subplots(2, 1, sharex=True, constrained_layout=True, figsize=(8, 10))<br />    ax[0].set_title(“Mean weight”)<br />    for key in capture_cb.weights[0]:<br />        ax[0].plot(capture_cb.epochs, [w[key].suggest() for w in capture_cb.weights], label=key)<br />    ax[0].legend()<br />    ax[1].set_title(“S.D.”)<br />    for key in capture_cb.weights[0]:<br />        ax[1].plot(capture_cb.epochs, [w[key].std() for w in capture_cb.weights], label=key)<br />    ax[1].legend()<br />    plt.current()</p><p>plotweight(capture_cb)

def plotweight(capture_cb):

“Plot the weights’ suggest and s.d. all through epochs”

fig, ax = plt.subplots(2, 1, sharex=True, constrained_layout=True, figsize=(8, 10))

ax[0].set_title(“Mean weight”)

for key in capture_cb.weights[0]:

ax[0].plot(capture_cb.epochs, [w[key].suggest() for w in capture_cb.weights], label=key)

ax[0].legend()

ax[1].set_title(“S.D.”)

for key in capture_cb.weights[0]:

ax[1].plot(capture_cb.epochs, [w[key].std() for w in capture_cb.weights], label=key)

ax[1].legend()

plt.current()

plotweight(capture_cb)

This ends within the subsequent decide:

We see the suggest weight moved shortly solely in first 10 iterations or so. Only the weights of the first layer getting further diversified as its commonplace deviation is transferring up.

We can restart with the hyperbolic tangent (tanh) activation on the similar course of:

# tanh activation, large variance gaussian initialization<br />model = make_mlp(“tanh”, initializer, “tanh”)<br />capture_cb = WeightCapture(model)<br />capture_cb.on_epoch_end(-1)<br />model.compile(optimizer=”rmsprop”, loss=”binary_crossentropy”, metrics=[“acc”])<br />model.match(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=0)<br />print(model.contemplate(X,y))<br />plotweight(capture_cb)

# tanh activation, large variance gaussian initialization

model = make_mlp(“tanh”, initializer, “tanh”)

capture_cb = WeightCapture(model)

capture_cb.on_epoch_end(–1)

model.compile(optimizer=“rmsprop”, loss=“binary_crossentropy”, metrics=[“acc”])

model.match(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=0)

print(model.contemplate(X,y))

plotweight(capture_cb)

[0.012918001972138882, 0.9929999709129333]

1	[0.012918001972138882, 0.9929999709129333]

The log-loss and accuracy are every improved. If we check out the plot, we don’t see the abrupt change inside the suggest and commonplace deviation inside the weights nevertheless in its place, that of all layers are slowly converged.

Similar case can be seen in ReLU activation:

# relu activation, large variance gaussian initialization<br />model = make_mlp(“relu”, initializer, “relu”)<br />capture_cb = WeightCapture(model)<br />capture_cb.on_epoch_end(-1)<br />model.compile(optimizer=”rmsprop”, loss=”binary_crossentropy”, metrics=[“acc”])<br />model.match(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=0)<br />print(model.contemplate(X,y))<br />plotweight(capture_cb)

# relu activation, large variance gaussian initialization

model = make_mlp(“relu”, initializer, “relu”)

capture_cb = WeightCapture(model)

capture_cb.on_epoch_end(–1)

model.compile(optimizer=“rmsprop”, loss=“binary_crossentropy”, metrics=[“acc”])

model.match(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=0)

print(model.contemplate(X,y))

plotweight(capture_cb)

[0.016895903274416924, 0.9940000176429749]

1	[0.016895903274416924, 0.9940000176429749]

Looking on the gradients of each layer

We see the affect of varied activation carry out inside the above. But definitely, what points is the gradient as we’re working gradient first fee all through teaching. The paper by Xavier Glorot and Yoshua Bengio, “Understanding the difficulty of training deep feedforward neural networks”, instructed to take a look on the gradient of each layer in each teaching iteration along with the standard deviation of it.

Bradley (2009) found that back-propagated gradients have been smaller as one strikes from the output layer in route of the enter layer, merely after initialization. He studied networks with linear activation at each layer, discovering that the variance of the back-propagated gradients decreases as we go backwards inside the group

— “Understanding the difficulty of training deep feedforward neural networks” (2010)

To understand how the activation carry out related to the gradient as perceived all through teaching, we’ve to run the teaching loop manually.

In Tensorflow-Keras, a training loop can be run by turning on the gradient tape, after which make the neural group model produce an output, which afterwards we are going to pay money for the gradient by computerized differentiation from the gradient tape. Subsequently we are going to change the parameters (weights and biases) in line with the gradient descent change rule.

Because the gradient is shortly obtained on this loop, we are going to make a reproduction of it. The following is how we implement the teaching loop and on the similar time, make a duplicate of the gradients:

optimizer = tf.keras.optimizers.RMSprop()<br />loss_fn = tf.keras.losses.BinaryCrossentropy()</p><p>def train_model(X, y, model, n_epochs=n_epochs, batch_size=batch_size):<br />    “Run teaching loop manually”<br />    train_dataset = tf.data.Dataset.from_tensor_slices((X, y))<br />    train_dataset = train_dataset.shuffle(buffer_size=1024).batch(batch_size)</p><p>    gradhistory = []<br />    losshistory = []<br />    def recordweight():<br />        data = {}<br />        for g,w in zip(grads, model.trainable_weights):<br />            if ‘/kernel:’ not in w.title:<br />                proceed # skip bias<br />            title = w.title.break up(“/”)[0]<br />            data[name] = g.numpy()<br />        gradhistory.append(data)<br />        losshistory.append(loss_value.numpy())<br />    for epoch in differ(n_epochs):<br />        for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):<br />            with tf.GradientTape() as tape:<br />                y_pred = model(x_batch_train, teaching=True)<br />                loss_value = loss_fn(y_batch_train, y_pred)</p><p>            grads = tape.gradient(loss_value, model.trainable_weights)<br />            optimizer.apply_gradients(zip(grads, model.trainable_weights))</p><p>            if step == 0:<br />                recordweight()<br />    # After all epochs, doc as soon as extra<br />    recordweight()<br />    return gradhistory, losshistory

optimizer = tf.keras.optimizers.RMSprop()

loss_fn = tf.keras.losses.BinaryCrossentropy()

def train_model(X, y, model, n_epochs=n_epochs, batch_size=batch_size):

“Run teaching loop manually”

train_dataset = tf.data.Dataset.from_tensor_slices((X, y))

train_dataset = train_dataset.shuffle(buffer_size=1024).batch(batch_size)

gradhistory = []

losshistory = []

def recordweight():

data = {}

for g,w in zip(grads, model.trainable_weights):

if ‘/kernel:’ not in w.title:

proceed # skip bias

title = w.title.break up(“/”)[0]

data[name] = g.numpy()

gradhistory.append(data)

losshistory.append(loss_value.numpy())

for epoch in differ(n_epochs):

for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):

with tf.GradientTape() as tape:

y_pred = model(x_batch_train, teaching=True)

loss_value = loss_fn(y_batch_train, y_pred)

grads = tape.gradient(loss_value, model.trainable_weights)

optimizer.apply_gradients(zip(grads, model.trainable_weights))

if step == 0:

recordweight()

# After all epochs, doc as soon as extra

recordweight()

return gradhistory, losshistory

The key inside the carry out above is the nested for-loop. In which, we launch tf.GradientTape() and go in a batch of data to the model to get a prediction, which is then evaluated using the loss carry out. Afterwards, we are going to pull out the gradient from the tape by evaluating the loss with the trainable weight from the model. Next, we change the weights using the optimizer, which might take care of the coaching weights and momentums inside the gradient descent algorithm implicitly.

As a refresh, the gradient proper right here means the subsequent. For a loss price $L$ computed and a layer with weights $W=[w_1, w_2, w_3, w_4, w_5]$ (e.g., on the output layer) then the gradient is the matrix

$$
frac{partial L}{partial W} = Big[frac{partial L}{partial w_1}, frac{partial L}{partial w_2}, frac{partial L}{partial w_3}, frac{partial L}{partial w_4}, frac{partial L}{partial w_5}Big]
$$

But sooner than we start the following iteration of teaching, we have a possibility to further manipulate the gradient: We match the gradient with the weights, to get the title of each, then save a reproduction of the gradient as numpy array. We sample the burden and loss solely as quickly as per epoch, nevertheless you presumably can change that to sample within the subsequent frequency.

With these, we are going to plot the gradient all through epochs. In the subsequent, we create the model (nevertheless not calling compile() because of we would not title match() afterwards) and run the handbook teaching loop, then plot the gradient along with the standard deviation of the gradient:

from sklearn.metrics import accuracy_score</p><p>def plot_gradient(gradhistory, losshistory):<br />    “Plot gradient suggest and sd all through epochs”<br />    fig, ax = plt.subplots(3, 1, sharex=True, constrained_layout=True, figsize=(8, 12))<br />    ax[0].set_title(“Mean gradient”)<br />    for key in gradhistory[0]:<br />        ax[0].plot(differ(len(gradhistory)), [w[key].suggest() for w in gradhistory], label=key)<br />    ax[0].legend()<br />    ax[1].set_title(“S.D.”)<br />    for key in gradhistory[0]:<br />        ax[1].semilogy(differ(len(gradhistory)), [w[key].std() for w in gradhistory], label=key)<br />    ax[1].legend()<br />    ax[2].set_title(“Loss”)<br />    ax[2].plot(differ(len(losshistory)), losshistory)<br />    plt.current()</p><p>model = make_mlp(“sigmoid”, initializer, “sigmoid”)<br />print(“Before teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))<br />gradhistory, losshistory = train_model(X, y, model)<br />print(“After teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))<br />plot_gradient(gradhistory, losshistory)

from sklearn.metrics import accuracy_score

def plot_gradient(gradhistory, losshistory):

“Plot gradient suggest and sd all through epochs”

fig, ax = plt.subplots(3, 1, sharex=True, constrained_layout=True, figsize=(8, 12))

ax[0].set_title(“Mean gradient”)

for key in gradhistory[0]:

ax[0].plot(differ(len(gradhistory)), [w[key].suggest() for w in gradhistory], label=key)

ax[0].legend()

ax[1].set_title(“S.D.”)

for key in gradhistory[0]:

ax[1].semilogy(differ(len(gradhistory)), [w[key].std() for w in gradhistory], label=key)

ax[1].legend()

ax[2].set_title(“Loss”)

ax[2].plot(differ(len(losshistory)), losshistory)

plt.current()

model = make_mlp(“sigmoid”, initializer, “sigmoid”)

print(“Before teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))

gradhistory, losshistory = train_model(X, y, model)

print(“After teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))

plot_gradient(gradhistory, losshistory)

It reported a weak classification finish outcome:

Before teaching: Accuracy 0.5<br />After teaching: Accuracy 0.652

1 2	Before teaching: Accuracy 0.5 After teaching: Accuracy 0.652

and the plot we obtained reveals vanishing gradient:

From the plot, the loss is not going to be significantly decreased. The suggest of gradient (i.e., suggest of all elements inside the gradient matrix) has noticeable price only for the ultimate layer whereas all totally different layers are nearly zero. The commonplace deviation of the gradient is on the diploma of between 0.01 and 0.001 roughly.

Repeat this with tanh activation, we see a novel finish outcome, which explains why the effectivity is finest:

model = make_mlp(“tanh”, initializer, “tanh”)<br />print(“Before teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))<br />gradhistory, losshistory = train_model(X, y, model)<br />print(“After teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))<br />plot_gradient(gradhistory, losshistory)

model = make_mlp(“tanh”, initializer, “tanh”)

print(“Before teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))

gradhistory, losshistory = train_model(X, y, model)

print(“After teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))

plot_gradient(gradhistory, losshistory)

Before teaching: Accuracy 0.502<br />After teaching: Accuracy 0.994

1 2	Before teaching: Accuracy 0.502 After teaching: Accuracy 0.994

From the plot of the suggest of the gradients, we see the gradients from every layer are wiggling equally. The commonplace deviation of the gradient are moreover an order of magnitude greater than the case of sigmoid activation, at spherical 0.1 to 0.01.

Finally, we are going to moreover see the identical in rectified linear unit (ReLU) activation. And on this case the loss dropped shortly, due to this fact we see it as a result of the additional surroundings pleasant activation to utilize in neural networks:

model = make_mlp(“relu”, initializer, “relu”)<br />print(“Before teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))<br />gradhistory, losshistory = train_model(X, y, model)<br />print(“After teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))<br />plot_gradient(gradhistory, losshistory)

model = make_mlp(“relu”, initializer, “relu”)

print(“Before teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))

gradhistory, losshistory = train_model(X, y, model)

print(“After teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))

plot_gradient(gradhistory, losshistory)

Before teaching: Accuracy 0.503<br />After teaching: Accuracy 0.995

1 2	Before teaching: Accuracy 0.503 After teaching: Accuracy 0.995

The following is your complete code:

import numpy as np<br />import tensorflow as tf<br />from tensorflow.keras.callbacks import Callback<br />from tensorflow.keras.layers import Dense, Input<br />from tensorflow.keras import Sequential<br />from tensorflow.keras.initializers import RandomNormal<br />import matplotlib.pyplot as plt<br />from sklearn.datasets import make_circles<br />from sklearn.metrics import accuracy_score</p><p>tf.random.set_seed(42)<br />np.random.seed(42)</p><p># Make data: Two circles on x-y plane as a classification draw back<br />X, y = make_circles(n_samples=1000, difficulty=0.5, noise=0.1)<br />plt.decide(figsize=(8,6))<br />plt.scatter(X[:,0], X[:,1], c=y)<br />plt.current()</p><p># Test effectivity with 3-layer binary classification group<br />model = Sequential([<br />    Input(shape=(2,)),<br />    Dense(5, “relu”),<br />    Dense(1, “sigmoid”)<br />])<br />model.compile(optimizer=”adam”, loss=”binary_crossentropy”, metrics=[“acc”])<br />model.match(X, y, batch_size=32, epochs=100, verbose=0)<br />print(model.contemplate(X,y))</p><p># Test effectivity with 3-layer group with sigmoid activation<br />model = Sequential([<br />    Input(shape=(2,)),<br />    Dense(5, “sigmoid”),<br />    Dense(1, “sigmoid”)<br />])<br />model.compile(optimizer=”adam”, loss=”binary_crossentropy”, metrics=[“acc”])<br />model.match(X, y, batch_size=32, epochs=100, verbose=0)<br />print(model.contemplate(X,y))</p><p># Test effectivity with 5-layer group with sigmoid activation<br />model = Sequential([<br />    Input(shape=(2,)),<br />    Dense(5, “sigmoid”),<br />    Dense(5, “sigmoid”),<br />    Dense(5, “sigmoid”),<br />    Dense(1, “sigmoid”)<br />])<br />model.compile(optimizer=”adam”, loss=”binary_crossentropy”, metrics=[“acc”])<br />model.match(X, y, batch_size=32, epochs=100, verbose=0)<br />print(model.contemplate(X,y))</p><p># Illustrate weights all through epochs<br />class WeightCapture(Callback):<br />    “Capture the weights of each layer of the model”<br />    def __init__(self, model):<br />        super().__init__()<br />        self.model = model<br />        self.weights = []<br />        self.epochs = []</p><p>    def on_epoch_end(self, epoch, logs=None):<br />        self.epochs.append(epoch) # be mindful the epoch axis<br />        weight = {}<br />        for layer in model.layers:<br />            if not layer.weights:<br />                proceed<br />            title = layer.weights[0].title.break up(“/”)[0]<br />            weight[name] = layer.weights[0].numpy()<br />        self.weights.append(weight)</p><p>def make_mlp(activation, initializer, title):<br />    “Create a model with specified activation and initalizer”<br />    model = Sequential([<br />        Input(shape=(2,), name=name+”0″),<br />        Dense(5, activation=activation, kernel_initializer=initializer, name=name+”1″),<br />        Dense(5, activation=activation, kernel_initializer=initializer, name=name+”2″),<br />        Dense(5, activation=activation, kernel_initializer=initializer, name=name+”3″),<br />        Dense(5, activation=activation, kernel_initializer=initializer, name=name+”4″),<br />        Dense(1, activation=”sigmoid”, kernel_initializer=initializer, name=name+”5″)<br />    ])<br />    return model</p><p>def plotweight(capture_cb):<br />    “Plot the weights’ suggest and s.d. all through epochs”<br />    fig, ax = plt.subplots(2, 1, sharex=True, constrained_layout=True, figsize=(8, 10))<br />    ax[0].set_title(“Mean weight”)<br />    for key in capture_cb.weights[0]:<br />        ax[0].plot(capture_cb.epochs, [w[key].suggest() for w in capture_cb.weights], label=key)<br />    ax[0].legend()<br />    ax[1].set_title(“S.D.”)<br />    for key in capture_cb.weights[0]:<br />        ax[1].plot(capture_cb.epochs, [w[key].std() for w in capture_cb.weights], label=key)<br />    ax[1].legend()<br />    plt.current()</p><p>initializer = RandomNormal(suggest=0, stddev=1)<br />batch_size = 32<br />n_epochs = 100</p><p># Sigmoid activation<br />model = make_mlp(“sigmoid”, initializer, “sigmoid”)<br />capture_cb = WeightCapture(model)<br />capture_cb.on_epoch_end(-1)<br />model.compile(optimizer=”rmsprop”, loss=”binary_crossentropy”, metrics=[“acc”])<br />print(“Before teaching: Accuracy”, accuracy_score(y, (model(X).numpy() > 0.5).astype(int)))<br />model.match(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=0)<br />print(“After teaching: Accuracy”, accuracy_score(y, (model(X).numpy() > 0.5).astype(int)))<br />print(model.contemplate(X,y))<br />plotweight(capture_cb)</p><p># tanh activation<br />model = make_mlp(“tanh”, initializer, “tanh”)<br />capture_cb = WeightCapture(model)<br />capture_cb.on_epoch_end(-1)<br />model.compile(optimizer=”rmsprop”, loss=”binary_crossentropy”, metrics=[“acc”])<br />print(“Before teaching: Accuracy”, accuracy_score(y, (model(X).numpy() > 0.5).astype(int)))<br />model.match(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=0)<br />print(“After teaching: Accuracy”, accuracy_score(y, (model(X).numpy() > 0.5).astype(int)))<br />print(model.contemplate(X,y))<br />plotweight(capture_cb)</p><p># relu activation<br />model = make_mlp(“relu”, initializer, “relu”)<br />capture_cb = WeightCapture(model)<br />capture_cb.on_epoch_end(-1)<br />model.compile(optimizer=”rmsprop”, loss=”binary_crossentropy”, metrics=[“acc”])<br />print(“Before teaching: Accuracy”, accuracy_score(y, (model(X).numpy() > 0.5).astype(int)))<br />model.match(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=0)<br />print(“After teaching: Accuracy”, accuracy_score(y, (model(X).numpy() > 0.5).astype(int)))<br />print(model.contemplate(X,y))<br />plotweight(capture_cb)</p><p># Show gradient all through epochs<br />optimizer = tf.keras.optimizers.RMSprop()<br />loss_fn = tf.keras.losses.BinaryCrossentropy()</p><p>def train_model(X, y, model, n_epochs=n_epochs, batch_size=batch_size):<br />    “Run teaching loop manually”<br />    train_dataset = tf.data.Dataset.from_tensor_slices((X, y))<br />    train_dataset = train_dataset.shuffle(buffer_size=1024).batch(batch_size)</p><p>    gradhistory = []<br />    losshistory = []<br />    def recordweight():<br />        data = {}<br />        for g,w in zip(grads, model.trainable_weights):<br />            if ‘/kernel:’ not in w.title:<br />                proceed # skip bias<br />            title = w.title.break up(“/”)[0]<br />            data[name] = g.numpy()<br />        gradhistory.append(data)<br />        losshistory.append(loss_value.numpy())<br />    for epoch in differ(n_epochs):<br />        for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):<br />            with tf.GradientTape() as tape:<br />                y_pred = model(x_batch_train, teaching=True)<br />                loss_value = loss_fn(y_batch_train, y_pred)</p><p>            grads = tape.gradient(loss_value, model.trainable_weights)<br />            optimizer.apply_gradients(zip(grads, model.trainable_weights))</p><p>            if step == 0:<br />                recordweight()<br />    # After all epochs, doc as soon as extra<br />    recordweight()<br />    return gradhistory, losshistory</p><p>def plot_gradient(gradhistory, losshistory):<br />    “Plot gradient suggest and sd all through epochs”<br />    fig, ax = plt.subplots(3, 1, sharex=True, constrained_layout=True, figsize=(8, 12))<br />    ax[0].set_title(“Mean gradient”)<br />    for key in gradhistory[0]:<br />        ax[0].plot(differ(len(gradhistory)), [w[key].suggest() for w in gradhistory], label=key)<br />    ax[0].legend()<br />    ax[1].set_title(“S.D.”)<br />    for key in gradhistory[0]:<br />        ax[1].semilogy(differ(len(gradhistory)), [w[key].std() for w in gradhistory], label=key)<br />    ax[1].legend()<br />    ax[2].set_title(“Loss”)<br />    ax[2].plot(differ(len(losshistory)), losshistory)<br />    plt.current()</p><p>model = make_mlp(“sigmoid”, initializer, “sigmoid”)<br />print(“Before teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))<br />gradhistory, losshistory = train_model(X, y, model)<br />print(“After teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))<br />plot_gradient(gradhistory, losshistory)</p><p>model = make_mlp(“tanh”, initializer, “tanh”)<br />print(“Before teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))<br />gradhistory, losshistory = train_model(X, y, model)<br />print(“After teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))<br />plot_gradient(gradhistory, losshistory)</p><p>model = make_mlp(“relu”, initializer, “relu”)<br />print(“Before teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))<br />gradhistory, losshistory = train_model(X, y, model)<br />print(“After teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))<br />plot_gradient(gradhistory, losshistory)

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

import numpy as np

import tensorflow as tf

from tensorflow.keras.callbacks import Callback

from tensorflow.keras.layers import Dense, Input

from tensorflow.keras import Sequential

from tensorflow.keras.initializers import RandomNormal

import matplotlib.pyplot as plt

from sklearn.datasets import make_circles

from sklearn.metrics import accuracy_score

tf.random.set_seed(42)

np.random.seed(42)

# Make data: Two circles on x-y plane as a classification draw back

X, y = make_circles(n_samples=1000, difficulty=0.5, noise=0.1)

plt.decide(figsize=(8,6))

plt.scatter(X[:,0], X[:,1], c=y)

plt.current()

# Test effectivity with 3-layer binary classification group

model = Sequential([

Input(shape=(2,)),

Dense(5, “relu”),

Dense(1, “sigmoid”)

])

model.compile(optimizer=“adam”, loss=“binary_crossentropy”, metrics=[“acc”])

model.match(X, y, batch_size=32, epochs=100, verbose=0)

print(model.contemplate(X,y))

# Test effectivity with 3-layer group with sigmoid activation

model = Sequential([

Input(shape=(2,)),

Dense(5, “sigmoid”),

Dense(1, “sigmoid”)

])

model.compile(optimizer=“adam”, loss=“binary_crossentropy”, metrics=[“acc”])

model.match(X, y, batch_size=32, epochs=100, verbose=0)

print(model.contemplate(X,y))

# Test effectivity with 5-layer group with sigmoid activation

model = Sequential([

Input(shape=(2,)),

Dense(5, “sigmoid”),

Dense(1, “sigmoid”)

])

model.compile(optimizer=“adam”, loss=“binary_crossentropy”, metrics=[“acc”])

model.match(X, y, batch_size=32, epochs=100, verbose=0)

print(model.contemplate(X,y))

# Illustrate weights all through epochs

class WeightCapture(Callback):

“Capture the weights of each layer of the model”

def __init__(self, model):

super().__init__()

self.model = model

self.weights = []

self.epochs = []

def on_epoch_end(self, epoch, logs=None):

self.epochs.append(epoch) # be mindful the epoch axis

weight = {}

for layer in model.layers:

if not layer.weights:

proceed

title = layer.weights[0].title.break up(“/”)[0]

weight[name] = layer.weights[0].numpy()

self.weights.append(weight)

def make_mlp(activation, initializer, title):

“Create a model with specified activation and initalizer”

model = Sequential([

Input(shape=(2,), name=name+“0”),

Dense(5, activation=activation, kernel_initializer=initializer, name=name+“1”),

Dense(5, activation=activation, kernel_initializer=initializer, name=name+“2”),

Dense(5, activation=activation, kernel_initializer=initializer, name=name+“3”),

Dense(5, activation=activation, kernel_initializer=initializer, name=name+“4”),

Dense(1, activation=“sigmoid”, kernel_initializer=initializer, name=name+“5”)

])

return model

def plotweight(capture_cb):

“Plot the weights’ suggest and s.d. all through epochs”

fig, ax = plt.subplots(2, 1, sharex=True, constrained_layout=True, figsize=(8, 10))

ax[0].set_title(“Mean weight”)

for key in capture_cb.weights[0]:

ax[0].plot(capture_cb.epochs, [w[key].suggest() for w in capture_cb.weights], label=key)

ax[0].legend()

ax[1].set_title(“S.D.”)

for key in capture_cb.weights[0]:

ax[1].plot(capture_cb.epochs, [w[key].std() for w in capture_cb.weights], label=key)

ax[1].legend()

plt.current()

initializer = RandomNormal(suggest=0, stddev=1)

batch_size = 32

n_epochs = 100

# Sigmoid activation

model = make_mlp(“sigmoid”, initializer, “sigmoid”)

capture_cb = WeightCapture(model)

capture_cb.on_epoch_end(–1)

model.compile(optimizer=“rmsprop”, loss=“binary_crossentropy”, metrics=[“acc”])

print(“Before teaching: Accuracy”, accuracy_score(y, (model(X).numpy() > 0.5).astype(int)))

model.match(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=0)

print(“After teaching: Accuracy”, accuracy_score(y, (model(X).numpy() > 0.5).astype(int)))

print(model.contemplate(X,y))

plotweight(capture_cb)

# tanh activation

model = make_mlp(“tanh”, initializer, “tanh”)

capture_cb = WeightCapture(model)

capture_cb.on_epoch_end(–1)

model.compile(optimizer=“rmsprop”, loss=“binary_crossentropy”, metrics=[“acc”])

print(“Before teaching: Accuracy”, accuracy_score(y, (model(X).numpy() > 0.5).astype(int)))

model.match(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=0)

print(“After teaching: Accuracy”, accuracy_score(y, (model(X).numpy() > 0.5).astype(int)))

print(model.contemplate(X,y))

plotweight(capture_cb)

# relu activation

model = make_mlp(“relu”, initializer, “relu”)

capture_cb = WeightCapture(model)

capture_cb.on_epoch_end(–1)

model.compile(optimizer=“rmsprop”, loss=“binary_crossentropy”, metrics=[“acc”])

print(“Before teaching: Accuracy”, accuracy_score(y, (model(X).numpy() > 0.5).astype(int)))

model.match(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=0)

print(“After teaching: Accuracy”, accuracy_score(y, (model(X).numpy() > 0.5).astype(int)))

print(model.contemplate(X,y))

plotweight(capture_cb)

# Show gradient all through epochs

optimizer = tf.keras.optimizers.RMSprop()

loss_fn = tf.keras.losses.BinaryCrossentropy()

def train_model(X, y, model, n_epochs=n_epochs, batch_size=batch_size):

“Run teaching loop manually”

train_dataset = tf.data.Dataset.from_tensor_slices((X, y))

train_dataset = train_dataset.shuffle(buffer_size=1024).batch(batch_size)

gradhistory = []

losshistory = []

def recordweight():

data = {}

for g,w in zip(grads, model.trainable_weights):

if ‘/kernel:’ not in w.title:

proceed # skip bias

title = w.title.break up(“/”)[0]

data[name] = g.numpy()

gradhistory.append(data)

losshistory.append(loss_value.numpy())

for epoch in differ(n_epochs):

for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):

with tf.GradientTape() as tape:

y_pred = model(x_batch_train, teaching=True)

loss_value = loss_fn(y_batch_train, y_pred)

grads = tape.gradient(loss_value, model.trainable_weights)

optimizer.apply_gradients(zip(grads, model.trainable_weights))

if step == 0:

recordweight()

# After all epochs, doc as soon as extra

recordweight()

return gradhistory, losshistory

def plot_gradient(gradhistory, losshistory):

“Plot gradient suggest and sd all through epochs”

fig, ax = plt.subplots(3, 1, sharex=True, constrained_layout=True, figsize=(8, 12))

ax[0].set_title(“Mean gradient”)

for key in gradhistory[0]:

ax[0].plot(differ(len(gradhistory)), [w[key].suggest() for w in gradhistory], label=key)

ax[0].legend()

ax[1].set_title(“S.D.”)

for key in gradhistory[0]:

ax[1].semilogy(differ(len(gradhistory)), [w[key].std() for w in gradhistory], label=key)

ax[1].legend()

ax[2].set_title(“Loss”)

ax[2].plot(differ(len(losshistory)), losshistory)

plt.current()

model = make_mlp(“sigmoid”, initializer, “sigmoid”)

print(“Before teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))

gradhistory, losshistory = train_model(X, y, model)

print(“After teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))

plot_gradient(gradhistory, losshistory)

model = make_mlp(“tanh”, initializer, “tanh”)

print(“Before teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))

gradhistory, losshistory = train_model(X, y, model)

print(“After teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))

plot_gradient(gradhistory, losshistory)

model = make_mlp(“relu”, initializer, “relu”)

print(“Before teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))

gradhistory, losshistory = train_model(X, y, model)

print(“After teaching: Accuracy”, accuracy_score(y, (model(X) > 0.5)))

plot_gradient(gradhistory, losshistory)

The Glorot initialization

We didn’t show inside the code above, nevertheless primarily probably the most well-known consequence from the paper by Glorot and Bengio is the Glorot initialization. Which suggests to initialize the weights of a layer of the neural group with uniform distribution:

The normalization difficulty may because of this truth be important when initializing deep networks because of the multiplicative affect by layers, and we suggest the subsequent initialization course of to roughly fulfill our targets of sustaining activation variances and back-propagated gradients variance as one strikes up or down the group. We title it the normalized initialization:
$$
W sim UBig[-frac{sqrt{6}}{sqrt{n_j+n_{j+1}}}, frac{sqrt{6}}{sqrt{n_j+n_{j+1}}}Big]
$$

— “Understanding the difficulty of training deep feedforward neural networks” (2010)

This is derived from the linear activation on the state of affairs that the standard deviation of the gradient is defending fixed all through the layers. In the sigmoid and tanh activation, the linear space is slender. Therefore we are going to understand why ReLU is the vital factor to workaround the vanishing gradient draw back. Comparing to altering the activation carry out, altering the burden initialization is way much less pronounced in serving to to resolve the vanishing gradient draw back. But this can be an prepare to be able to uncover to see how this may increasingly assist bettering the top outcome.

Summary

In this tutorial, you visually seen how a rectified linear unit (ReLU) could assist resolving the vanishing gradient draw back.

Specifically, you realized:

How the difficulty of vanishing gradient impression the effectivity of a neural group
Why ReLU activation is the reply to vanishing gradient draw back
How to utilize a custom-made callback to extract data in the midst of teaching loop in Keras
How to place in writing a custom-made teaching loop
How to study the burden and gradient from a layer inside the neural group

Search This Blog

Solution Desk

Why Does My Snapchat AI Have a Story? Has Snapchat AI Been Hacked?