Training the Transformer Model

Last Updated on January 6, 2023

We have put collectively all the Transformer model, and now we’re ready to teach it for neural machine translation. We shall use a training dataset for this perform, which contains fast English and German sentence pairs. We may even revisit the place of masking in computing the accuracy and loss metrics all through the teaching course of.

In this tutorial, you will uncover how one can follow the Transformer model for neural machine translation.

After ending this tutorial, you will know:

How to rearrange the teaching dataset
How to make use of a padding masks to the loss and accuracy computations
How to teach the Transformer model

Kick-start your endeavor with my e-book Building Transformer Models with Attention. It provides self-study tutorials with working code to info you into setting up a fully-working transformer model which will
translate sentences from one language to a special…

Let’s get started.

Training the transformer model
Photo by v2osk, some rights reserved.

Tutorial Overview

This tutorial is break up into 4 components; they’re:

Recap of the Transformer Architecture
Preparing the Training Dataset
Applying a Padding Mask to the Loss and Accuracy Computations
Training the Transformer Model

Prerequisites

For this tutorial, we assume that you just’re already familiar with:

The idea behind the Transformer model
An implementation of the Transformer model

Recap of the Transformer Architecture

Recall having seen that the Transformer construction follows an encoder-decoder building. The encoder, on the left-hand side, is tasked with mapping an enter sequence to a sequence of regular representations; the decoder, on the right-hand side, receives the output of the encoder together with the decoder output on the sooner time step to generate an output sequence.

The encoder-decoder building of the Transformer construction
Taken from “Attention Is All You Need“

In producing an output sequence, the Transformer would not rely upon recurrence and convolutions.

You have seen how one can implement all the Transformer model, so now you’ll be able to proceed to teach it for neural machine translation.

Let’s start first by preparing the dataset for teaching.

Want to Get Started With Building Transformer Models with Attention?

Take my free 12-day e-mail crash course now (with sample code).

Click to sign-up and likewise get a free PDF Ebook mannequin of the course.

Preparing the Training Dataset

For this perform, you probably can check with a earlier tutorial that covers supplies about preparing the textual content material data for teaching.

You may even use a dataset that includes fast English and German sentence pairs, which you will get hold of here. This particular dataset has already been cleaned by eradicating non-printable and non-alphabetic characters and punctuation characters, further normalizing all Unicode characters to ASCII, and altering all uppercase letters to lowercase ones. Hence, you probably can skip the cleaning step, which is normally part of the knowledge preparation course of. However, in case you employ a dataset that does not come readily cleaned, you probably can check with this this earlier tutorial to learn how to take motion.

Let’s proceed by creating the PrepareDataset class that implements the following steps:

Loads the dataset from a specified filename.

clean_dataset = load(open(filename, ‘rb’))

1	clean_dataset = load(open(filename, ‘rb’))

Selects the number of sentences to utilize from the dataset. Since the dataset is very large, you will reduce its dimension to limit the teaching time. However, you can uncover using the whole dataset as an extension to this tutorial.

dataset = clean_dataset[:self.n_sentences, :]

1	dataset = clean_dataset[:self.n_sentences, :]

Appends start (<START>) and end-of-string (<EOS>) tokens to each sentence. For occasion, the English sentence, i want to run, now turns into, <START> i want to run <EOS>. This moreover applies to its corresponding translation in German, ich gehe gerne joggen, which now turns into, <START> ich gehe gerne joggen <EOS>.

for i in range(dataset[:, 0].dimension):<br />	dataset[i, 0] = “<START> ” + dataset[i, 0] + ” <EOS>”<br />	dataset[i, 1] = “<START> ” + dataset[i, 1] + ” <EOS>”

for i in range(dataset[:, 0].dimension):

dataset[i, 0] = “<START> “ + dataset[i, 0] + ” <EOS>”

dataset[i, 1] = “<START> “ + dataset[i, 1] + ” <EOS>”

Shuffles the dataset randomly.

shuffle(dataset)

1	shuffle(dataset)

Splits the shuffled dataset based on a pre-defined ratio.

follow = dataset[:int(self.n_sentences * self.train_split)]

1	follow = dataset[:int(self.n_sentences * self.train_split)]

Creates and trains a tokenizer on the textual content material sequences that can in all probability be fed into the encoder and finds the dimensions of the longest sequence along with the vocabulary dimension.

enc_tokenizer = self.create_tokenizer(follow[:, 0])<br />enc_seq_length = self.find_seq_length(follow[:, 0])<br />enc_vocab_size = self.find_vocab_size(enc_tokenizer, follow[:, 0])

enc_tokenizer = self.create_tokenizer(follow[:, 0])

enc_seq_length = self.find_seq_length(follow[:, 0])

enc_vocab_size = self.find_vocab_size(enc_tokenizer, follow[:, 0])

Tokenizes the sequences of textual content material that can in all probability be fed into the encoder by making a vocabulary of phrases and altering each phrase with its corresponding vocabulary index. The <START> and <EOS> tokens may even kind part of this vocabulary. Each sequence may be padded to the utmost phrase dimension.

Creates and trains a tokenizer on the textual content material sequences that can in all probability be fed into the decoder, and finds the dimensions of the longest sequence along with the vocabulary dimension.

dec_tokenizer = self.create_tokenizer(follow[:, 1])<br />dec_seq_length = self.find_seq_length(follow[:, 1])<br />dec_vocab_size = self.find_vocab_size(dec_tokenizer, follow[:, 1])

dec_tokenizer = self.create_tokenizer(follow[:, 1])

dec_seq_length = self.find_seq_length(follow[:, 1])

dec_vocab_size = self.find_vocab_size(dec_tokenizer, follow[:, 1])

Repeats an an identical tokenization and padding course of for the sequences of textual content material that can in all probability be fed into the decoder.

trainY = dec_tokenizer.texts_to_sequences(follow[:, 1])<br />trainY = pad_sequences(trainY, maxlen=dec_seq_length, padding=’publish’)<br />trainY = convert_to_tensor(trainY, dtype=int64)

trainY = dec_tokenizer.texts_to_sequences(follow[:, 1])

trainY = pad_sequences(trainY, maxlen=dec_seq_length, padding=‘publish’)

trainY = convert_to_tensor(trainY, dtype=int64)

The full code itemizing is as follows (check with this earlier tutorial for added particulars):

from pickle import load<br />from numpy.random import shuffle<br />from keras.preprocessing.textual content material import Tokenizer<br />from keras.preprocessing.sequence import pad_sequences<br />from tensorflow import convert_to_tensor, int64</p><p>class PrepareDataset:<br />	def __init__(self, **kwargs):<br />		great(PrepareDataset, self).__init__(**kwargs)<br />		self.n_sentences = 10000  # Number of sentences to include throughout the dataset<br />		self.train_split = 0.9  # Ratio of the teaching data reduce up</p><p>	# Fit a tokenizer<br />	def create_tokenizer(self, dataset):<br />		tokenizer = Tokenizer()<br />		tokenizer.fit_on_texts(dataset)</p><p>		return tokenizer</p><p>	def find_seq_length(self, dataset):<br />		return max(len(seq.reduce up()) for seq in dataset)</p><p>	def find_vocab_size(self, tokenizer, dataset):<br />		tokenizer.fit_on_texts(dataset)</p><p>		return len(tokenizer.word_index) + 1</p><p>	def __call__(self, filename, **kwargs):<br />		# Load a transparent dataset<br />		clean_dataset = load(open(filename, ‘rb’))</p><p>		# Reduce dataset dimension<br />		dataset = clean_dataset[:self.n_sentences, :]</p><p>		# Include start and end of string tokens<br />		for i in range(dataset[:, 0].dimension):<br />			dataset[i, 0] = “<START> ” + dataset[i, 0] + ” <EOS>”<br />			dataset[i, 1] = “<START> ” + dataset[i, 1] + ” <EOS>”</p><p>		# Random shuffle the dataset<br />		shuffle(dataset)</p><p>		# Split the dataset<br />		follow = dataset[:int(self.n_sentences * self.train_split)]</p><p>		# Prepare tokenizer for the encoder enter<br />		enc_tokenizer = self.create_tokenizer(follow[:, 0])<br />		enc_seq_length = self.find_seq_length(follow[:, 0])<br />		enc_vocab_size = self.find_vocab_size(enc_tokenizer, follow[:, 0])</p><p>		# Encode and pad the enter sequences<br />		trainX = enc_tokenizer.texts_to_sequences(follow[:, 0])<br />		trainX = pad_sequences(trainX, maxlen=enc_seq_length, padding=’publish’)<br />		trainX = convert_to_tensor(trainX, dtype=int64)</p><p>		# Prepare tokenizer for the decoder enter<br />		dec_tokenizer = self.create_tokenizer(follow[:, 1])<br />		dec_seq_length = self.find_seq_length(follow[:, 1])<br />		dec_vocab_size = self.find_vocab_size(dec_tokenizer, follow[:, 1])</p><p>		# Encode and pad the enter sequences<br />		trainY = dec_tokenizer.texts_to_sequences(follow[:, 1])<br />		trainY = pad_sequences(trainY, maxlen=dec_seq_length, padding=’publish’)<br />		trainY = convert_to_tensor(trainY, dtype=int64)</p><p>		return trainX, trainY, follow, enc_seq_length, dec_seq_length, enc_vocab_size, dec_vocab_size

from pickle import load

from numpy.random import shuffle

from keras.preprocessing.textual content material import Tokenizer

from keras.preprocessing.sequence import pad_sequences

from tensorflow import convert_to_tensor, int64

class PrepareDataset:

def __init__(self, **kwargs):

great(PrepareDataset, self).__init__(**kwargs)

self.n_sentences = 10000 # Number of sentences to include throughout the dataset

self.train_split = 0.9 # Ratio of the teaching data reduce up

# Fit a tokenizer

def create_tokenizer(self, dataset):

tokenizer = Tokenizer()

tokenizer.fit_on_texts(dataset)

return tokenizer

def find_seq_length(self, dataset):

return max(len(seq.reduce up()) for seq in dataset)

def find_vocab_size(self, tokenizer, dataset):

tokenizer.fit_on_texts(dataset)

return len(tokenizer.word_index) + 1

def __call__(self, filename, **kwargs):

# Load a transparent dataset

clean_dataset = load(open(filename, ‘rb’))

# Reduce dataset dimension

dataset = clean_dataset[:self.n_sentences, :]

# Include start and end of string tokens

for i in range(dataset[:, 0].dimension):

dataset[i, 0] = “<START> “ + dataset[i, 0] + ” <EOS>”

dataset[i, 1] = “<START> “ + dataset[i, 1] + ” <EOS>”

# Random shuffle the dataset

shuffle(dataset)

# Split the dataset

follow = dataset[:int(self.n_sentences * self.train_split)]

# Prepare tokenizer for the encoder enter

enc_tokenizer = self.create_tokenizer(follow[:, 0])

enc_seq_length = self.find_seq_length(follow[:, 0])

enc_vocab_size = self.find_vocab_size(enc_tokenizer, follow[:, 0])

# Encode and pad the enter sequences

trainX = enc_tokenizer.texts_to_sequences(follow[:, 0])

trainX = pad_sequences(trainX, maxlen=enc_seq_length, padding=‘publish’)

trainX = convert_to_tensor(trainX, dtype=int64)

# Prepare tokenizer for the decoder enter

dec_tokenizer = self.create_tokenizer(follow[:, 1])

dec_seq_length = self.find_seq_length(follow[:, 1])

dec_vocab_size = self.find_vocab_size(dec_tokenizer, follow[:, 1])

# Encode and pad the enter sequences

trainY = dec_tokenizer.texts_to_sequences(follow[:, 1])

trainY = pad_sequences(trainY, maxlen=dec_seq_length, padding=‘publish’)

trainY = convert_to_tensor(trainY, dtype=int64)

return trainX, trainY, follow, enc_seq_length, dec_seq_length, enc_vocab_size, dec_vocab_size

Before shifting on to teach the Transformer model, let’s first have a look on the output of the PrepareDataset class just like the first sentence throughout the teaching dataset:

# Prepare the teaching data<br />dataset = PrepareDataset()<br />trainX, trainY, train_orig, enc_seq_length, dec_seq_length, enc_vocab_size, dec_vocab_size = dataset(‘english-german-both.pkl’)</p><p>print(train_orig[0, 0], ‘n’, trainX[0, :])

# Prepare the teaching data

dataset = PrepareDataset()

trainX, trainY, train_orig, enc_seq_length, dec_seq_length, enc_vocab_size, dec_vocab_size = dataset(‘english-german-both.pkl’)

print(train_orig[0, 0], ‘n’, trainX[0, :])

<START> did tom inform you <EOS><br /> tf.Tensor([ 1 25  4 97  5  2  0], type=(7,), dtype=int64)

1 2	<START> did tom inform you <EOS> tf.Tensor([ 1 25 4 97 5 2 0], type=(7,), dtype=int64)

(Note: Since the dataset has been randomly shuffled, you will probably see a novel output.)

You can see that, initially, you had a three-word sentence (did tom inform you) to which you appended the start and end-of-string tokens. Then you proceeded to vectorize (you can uncover that the <START> and <EOS> tokens are assigned the vocabulary indices 1 and a few, respectively). The vectorized textual content material was moreover padded with zeros, such that the dimensions of the highest consequence matches the utmost sequence dimension of the encoder:

print(‘Encoder sequence dimension:’, enc_seq_length)

1	print(‘Encoder sequence dimension:’, enc_seq_length)

Encoder sequence dimension: 7

1	Encoder sequence dimension: 7

You can equally check out the corresponding objective data that is fed into the decoder:

print(train_orig[0, 1], ‘n’, trainY[0, :])

1	print(train_orig[0, 1], ‘n’, trainY[0, :])

<START> hat tom es dir gesagt <EOS><br /> tf.Tensor([  1  14   5   7  42 162   2   0   0   0   0   0], type=(12,), dtype=int64)

1 2	<START> hat tom es dir gesagt <EOS> tf.Tensor([ 1 14 5 7 42 162 2 0 0 0 0 0], type=(12,), dtype=int64)

Here, the dimensions of the highest consequence matches the utmost sequence dimension of the decoder:

print(‘Decoder sequence dimension:’, dec_seq_length)

1	print(‘Decoder sequence dimension:’, dec_seq_length)

Decoder sequence dimension: 12

1	Decoder sequence dimension: 12

Applying a Padding Mask to the Loss and Accuracy Computations

Recall seeing that the importance of getting a padding masks on the encoder and decoder is to make it attainable for the zero values that now we’ve merely appended to the vectorized inputs is not going to be processed along with the exact enter values.

This moreover holds true for the teaching course of, the place a padding masks is required so that the zero padding values throughout the objective data is not going to be considered throughout the computation of the loss and accuracy.

Let’s have a look on the computation of loss first.

This will in all probability be computed using a sparse categorical cross-entropy loss carry out between the objective and predicted values and subsequently multiplied by a padding masks so that solely the reliable non-zero values are considered. The returned loss is the suggest of the unmasked values:

def loss_fcn(objective, prediction):<br />    # Create masks so that the zero padding values is not going to be included throughout the computation of loss<br />    padding_mask = math.logical_not(equal(objective, 0))<br />    padding_mask = strong(padding_mask, float32)</p><p>    # Compute a sparse categorical cross-entropy loss on the unmasked values<br />    loss = sparse_categorical_crossentropy(objective, prediction, from_logits=True) * padding_mask</p><p>    # Compute the suggest loss over the unmasked values<br />    return reduce_sum(loss) / reduce_sum(padding_mask)

def loss_fcn(objective, prediction):

# Create masks so that the zero padding values is not going to be included throughout the computation of loss

padding_mask = math.logical_not(equal(objective, 0))

padding_mask = strong(padding_mask, float32)

# Compute a sparse categorical cross-entropy loss on the unmasked values

loss = sparse_categorical_crossentropy(objective, prediction, from_logits=True) * padding_masks

# Compute the suggest loss over the unmasked values

return reduce_sum(loss) / reduce_sum(padding_mask)

For the computation of accuracy, the anticipated and objective values are first in distinction. The predicted output is a tensor of dimension (batch_size, dec_seq_length, dec_vocab_size) and incorporates chance values (generated by the softmax carry out on the decoder side) for the tokens throughout the output. In order to have the flexibility to hold out the comparability with the objective values, solely each token with the perfect chance value is taken into consideration, with its dictionary index being retrieved by way of the operation: argmax(prediction, axis=2). Following the making use of of a padding masks, the returned accuracy is the suggest of the unmasked values:

def accuracy_fcn(objective, prediction):<br />    # Create masks so that the zero padding values is not going to be included throughout the computation of accuracy<br />    padding_mask = math.logical_not(math.equal(objective, 0))</p><p>    # Find equal prediction and objective values, and apply the padding masks<br />    accuracy = equal(objective, argmax(prediction, axis=2))<br />    accuracy = math.logical_and(padding_mask, accuracy)</p><p>    # Cast the True/False values to 32-bit-precision floating-point numbers<br />    padding_mask = strong(padding_mask, float32)<br />    accuracy = strong(accuracy, float32)</p><p>    # Compute the suggest accuracy over the unmasked values<br />    return reduce_sum(accuracy) / reduce_sum(padding_mask)

def accuracy_fcn(objective, prediction):

# Create masks so that the zero padding values is not going to be included throughout the computation of accuracy

padding_mask = math.logical_not(math.equal(objective, 0))

# Find equal prediction and objective values, and apply the padding masks

accuracy = equal(objective, argmax(prediction, axis=2))

accuracy = math.logical_and(padding_mask, accuracy)

# Cast the True/False values to 32-bit-precision floating-point numbers

padding_mask = strong(padding_mask, float32)

accuracy = strong(accuracy, float32)

# Compute the suggest accuracy over the unmasked values

return reduce_sum(accuracy) / reduce_sum(padding_mask)

Training the Transformer Model

Let’s first define the model and training parameters as specified by Vaswani et al. (2023):

# Define the model parameters<br />h = 8  # Number of self-attention heads<br />d_k = 64  # Dimensionality of the linearly projected queries and keys<br />d_v = 64  # Dimensionality of the linearly projected values<br />d_model = 512  # Dimensionality of model layers’ outputs<br />d_ff = 2048  # Dimensionality of the within completely associated layer<br />n = 6  # Number of layers throughout the encoder stack</p><p># Define the teaching parameters<br />epochs = 2<br />batch_size = 64<br />beta_1 = 0.9<br />beta_2 = 0.98<br />epsilon = 1e-9<br />dropout_rate = 0.1

# Define the model parameters

h = 8 # Number of self-attention heads

d_k = 64 # Dimensionality of the linearly projected queries and keys

d_v = 64 # Dimensionality of the linearly projected values

d_model = 512 # Dimensionality of model layers’ outputs

d_ff = 2048 # Dimensionality of the within completely associated layer

n = 6 # Number of layers throughout the encoder stack

# Define the teaching parameters

epochs = 2

batch_size = 64

beta_1 = 0.9

beta_2 = 0.98

epsilon = 1e–9

dropout_rate = 0.1

(Note: Only consider two epochs to limit the teaching time. However, you can uncover teaching the model further as an extension to this tutorial.)

You moreover should implement a finding out value scheduler that initially will improve the coaching value linearly for the first warmup_steps after which decreases it proportionally to the inverse sq. root of the step amount. Vaswani et al. categorical this by the following formulation:

$$textual content material{learning_rate} = textual content material{d_model}^{−0.5} cdot textual content material{min}(textual content material{step}^{−0.5}, textual content material{step} cdot textual content material{warmup_steps}^{−1.5})$$

class LRScheduler(LearningRateSchedule):<br />    def __init__(self, d_model, warmup_steps=4000, **kwargs):<br />        great(LRScheduler, self).__init__(**kwargs)</p><p>        self.d_model = strong(d_model, float32)<br />        self.warmup_steps = warmup_steps</p><p>    def __call__(self, step_num):</p><p>        # Linearly rising the coaching value for the first warmup_steps, and decreasing it thereafter<br />        arg1 = step_num ** -0.5<br />        arg2 = step_num * (self.warmup_steps ** -1.5)</p><p>        return (self.d_model ** -0.5) * math.minimal(arg1, arg2)

class LRScheduler(LearningRateSchedule):

def __init__(self, d_model, warmup_steps=4000, **kwargs):

great(LRScheduler, self).__init__(**kwargs)

self.d_model = strong(d_model, float32)

self.warmup_steps = warmup_steps

def __call__(self, step_num):

# Linearly rising the coaching value for the first warmup_steps, and decreasing it thereafter

arg1 = step_num ** –0.5

arg2 = step_num * (self.warmup_steps ** –1.5)

return (self.d_model ** –0.5) * math.minimal(arg1, arg2)

An event of the LRScheduler class is subsequently handed on as a result of the learning_rate argument of the Adam optimizer:

optimizer = Adam(LRScheduler(d_model), beta_1, beta_2, epsilon)

1	optimizer = Adam(LRScheduler(d_model), beta_1, beta_2, epsilon)

Next, reduce up the dataset into batches in preparation for teaching:

train_dataset = data.Dataset.from_tensor_slices((trainX, trainY))<br />train_dataset = train_dataset.batch(batch_size)

1 2	train_dataset = data.Dataset.from_tensor_slices((trainX, trainY)) train_dataset = train_dataset.batch(batch_size)

This is adopted by the creation of a model event:

training_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)

1	training_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)

In teaching the Transformer model, you will write your particular person teaching loop, which incorporates the loss and accuracy options that had been utilized earlier.

The default runtime in Tensorflow 2.0 is eager execution, which suggests that operations execute immediately one after the other. Eager execution is simple and intuitive, making debugging easier. Its draw again, nonetheless, is that it may’t reap the advantages of the worldwide effectivity optimizations that run the code using the graph execution. In graph execution, a graph is first constructed sooner than the tensor computations is likely to be executed, which gives rise to a computational overhead. For this motive, the utilization of graph execution is usually advisable for large model teaching barely than for small model teaching, the place eager execution may be additional suited to hold out simpler operations. Since the Transformer model is sufficiently huge, apply the graph execution to teach it.

In order to take motion, you will use the @carry out decorator as follows:

@carry out<br />def train_step(encoder_input, decoder_input, decoder_output):<br />    with GradientTape() as tape:</p><p>        # Run the forward go of the model to generate a prediction<br />        prediction = training_model(encoder_input, decoder_input, teaching=True)</p><p>        # Compute the teaching loss<br />        loss = loss_fcn(decoder_output, prediction)</p><p>        # Compute the teaching accuracy<br />        accuracy = accuracy_fcn(decoder_output, prediction)</p><p>    # Retrieve gradients of the trainable variables with respect to the teaching loss<br />    gradients = tape.gradient(loss, training_model.trainable_weights)</p><p>    # Update the values of the trainable variables by gradient descent<br />    optimizer.apply_gradients(zip(gradients, training_model.trainable_weights))</p><p>    train_loss(loss)<br />    train_accuracy(accuracy)

@carry out

def train_step(encoder_input, decoder_input, decoder_output):

with GradientTape() as tape:

# Run the forward go of the model to generate a prediction

prediction = training_model(encoder_input, decoder_input, teaching=True)

# Compute the teaching loss

loss = loss_fcn(decoder_output, prediction)

# Compute the teaching accuracy

accuracy = accuracy_fcn(decoder_output, prediction)

# Retrieve gradients of the trainable variables with respect to the teaching loss

gradients = tape.gradient(loss, training_model.trainable_weights)

# Update the values of the trainable variables by gradient descent

optimizer.apply_gradients(zip(gradients, training_model.trainable_weights))

train_loss(loss)

train_accuracy(accuracy)

With the addition of the @carry out decorator, a carry out that takes tensors as enter will in all probability be compiled proper right into a graph. If the @carry out decorator is commented out, the carry out is, alternatively, run with eager execution.

The subsequent step is implementing the teaching loop that may identify the train_step carry out above. The teaching loop will iterate over the required number of epochs and the dataset batches. For each batch, the train_step carry out computes the teaching loss and accuracy measures and applies the optimizer to interchange the trainable model parameters. A checkpoint supervisor may be included to avoid wasting a number of a checkpoint after every 5 epochs:

train_loss = Mean(establish=”train_loss”)<br />train_accuracy = Mean(establish=”train_accuracy”)</p><p># Create a checkpoint object and supervisor to deal with quite a few checkpoints<br />ckpt = follow.Checkpoint(model=training_model, optimizer=optimizer)<br />ckpt_manager = follow.CheckpointSupervisor(ckpt, “./checkpoints”, max_to_keep=3)</p><p>for epoch in range(epochs):</p><p>    train_loss.reset_states()<br />    train_accuracy.reset_states()</p><p>    print(“nStart of epoch %d” % (epoch + 1))</p><p>    # Iterate over the dataset batches<br />    for step, (train_batchX, train_batchY) in enumerate(train_dataset):</p><p>        # Define the encoder and decoder inputs, and the decoder output<br />        encoder_input = train_batchX[:, 1:]<br />        decoder_input = train_batchY[:, :-1]<br />        decoder_output = train_batchY[:, 1:]</p><p>        train_step(encoder_input, decoder_input, decoder_output)</p><p>        if step % 50 == 0:<br />            print(f’Epoch {epoch + 1} Step {step} Loss {train_loss.consequence():.4f} Accuracy {train_accuracy.consequence():.4f}’)</p><p>    # Print epoch amount and loss value on the end of every epoch<br />    print(“Epoch %d: Training Loss %.4f, Training Accuracy %.4f” % (epoch + 1, train_loss.consequence(), train_accuracy.consequence()))</p><p>    # Save a checkpoint after every 5 epochs<br />    if (epoch + 1) % 5 == 0:<br />        save_path = ckpt_manager.save()<br />        print(“Saved checkpoint at epoch %d” % (epoch + 1))

train_loss = Mean(establish=‘train_loss’)

train_accuracy = Mean(establish=‘train_accuracy’)

# Create a checkpoint object and supervisor to deal with quite a few checkpoints

ckpt = follow.Checkpoint(model=training_model, optimizer=optimizer)

ckpt_manager = follow.CheckpointSupervisor(ckpt, “./checkpoints”, max_to_keep=3)

for epoch in range(epochs):

train_loss.reset_states()

train_accuracy.reset_states()

print(“nStart of epoch %d” % (epoch + 1))

# Iterate over the dataset batches

for step, (train_batchX, train_batchY) in enumerate(train_dataset):

# Define the encoder and decoder inputs, and the decoder output

encoder_input = train_batchX[:, 1:]

decoder_input = train_batchY[:, :–1]

decoder_output = train_batchY[:, 1:]

train_step(encoder_input, decoder_input, decoder_output)

if step % 50 == 0:

print(f‘Epoch {epoch + 1} Step {step} Loss {train_loss.consequence():.4f} Accuracy {train_accuracy.consequence():.4f}’)

# Print epoch amount and loss value on the end of every epoch

print(“Epoch %d: Training Loss %.4f, Training Accuracy %.4f” % (epoch + 1, train_loss.consequence(), train_accuracy.consequence()))

# Save a checkpoint after every 5 epochs

if (epoch + 1) % 5 == 0:

save_path = ckpt_manager.save()

print(“Saved checkpoint at epoch %d” % (epoch + 1))

An important degree to recollect is that the enter to the decoder is offset by one place to the exact with respect to the encoder enter. The thought behind this offset, combined with a look-ahead masks throughout the first multi-head consideration block of the decoder, is to be sure that the prediction for the current token can solely depend on the sooner tokens.

This masking, combined with indisputable fact that the output embeddings are offset by one place, ensures that the predictions for place i can rely solely on the recognized outputs at positions decrease than i.
– Attention Is All You Need, 2023.

It is due to this that the encoder and decoder inputs are fed into the Transformer model throughout the following technique:

encoder_input = train_batchX[:, 1:]

decoder_input = train_batchY[:, :-1]

Putting collectively all the code itemizing produces the following:

from tensorflow.keras.optimizers import Adam<br />from tensorflow.keras.optimizers.schedules import LearningRateSchedule<br />from tensorflow.keras.metrics import Mean<br />from tensorflow import data, follow, math, reduce_sum, strong, equal, argmax, float32, GradientTape, TensorSpec, carry out, int64<br />from keras.losses import sparse_categorical_crossentropy<br />from model import TransformerModel<br />from prepare_dataset import PrepareDataset<br />from time import time</p><p># Define the model parameters<br />h = 8  # Number of self-attention heads<br />d_k = 64  # Dimensionality of the linearly projected queries and keys<br />d_v = 64  # Dimensionality of the linearly projected values<br />d_model = 512  # Dimensionality of model layers’ outputs<br />d_ff = 2048  # Dimensionality of the within completely associated layer<br />n = 6  # Number of layers throughout the encoder stack</p><p># Define the teaching parameters<br />epochs = 2<br />batch_size = 64<br />beta_1 = 0.9<br />beta_2 = 0.98<br />epsilon = 1e-9<br />dropout_rate = 0.1</p><p># Implementing a finding out value scheduler<br />class LRScheduler(LearningRateSchedule):<br />    def __init__(self, d_model, warmup_steps=4000, **kwargs):<br />        great(LRScheduler, self).__init__(**kwargs)</p><p>        self.d_model = strong(d_model, float32)<br />        self.warmup_steps = warmup_steps</p><p>    def __call__(self, step_num):</p><p>        # Linearly rising the coaching value for the first warmup_steps, and decreasing it thereafter<br />        arg1 = step_num ** -0.5<br />        arg2 = step_num * (self.warmup_steps ** -1.5)</p><p>        return (self.d_model ** -0.5) * math.minimal(arg1, arg2)</p><p># Instantiate an Adam optimizer<br />optimizer = Adam(LRScheduler(d_model), beta_1, beta_2, epsilon)</p><p># Prepare the teaching and verify splits of the dataset<br />dataset = PrepareDataset()<br />trainX, trainY, train_orig, enc_seq_length, dec_seq_length, enc_vocab_size, dec_vocab_size = dataset(‘english-german-both.pkl’)</p><p># Prepare the dataset batches<br />train_dataset = data.Dataset.from_tensor_slices((trainX, trainY))<br />train_dataset = train_dataset.batch(batch_size)</p><p># Create model<br />training_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)</p><p># Defining the loss carry out<br />def loss_fcn(objective, prediction):<br />    # Create masks so that the zero padding values is not going to be included throughout the computation of loss<br />    padding_mask = math.logical_not(equal(objective, 0))<br />    padding_mask = strong(padding_mask, float32)</p><p>    # Compute a sparse categorical cross-entropy loss on the unmasked values<br />    loss = sparse_categorical_crossentropy(objective, prediction, from_logits=True) * padding_mask</p><p>    # Compute the suggest loss over the unmasked values<br />    return reduce_sum(loss) / reduce_sum(padding_mask)</p><p># Defining the accuracy carry out<br />def accuracy_fcn(objective, prediction):<br />    # Create masks so that the zero padding values is not going to be included throughout the computation of accuracy<br />    padding_mask = math.logical_not(equal(objective, 0))</p><p>    # Find equal prediction and objective values, and apply the padding masks<br />    accuracy = equal(objective, argmax(prediction, axis=2))<br />    accuracy = math.logical_and(padding_mask, accuracy)</p><p>    # Cast the True/False values to 32-bit-precision floating-point numbers<br />    padding_mask = strong(padding_mask, float32)<br />    accuracy = strong(accuracy, float32)</p><p>    # Compute the suggest accuracy over the unmasked values<br />    return reduce_sum(accuracy) / reduce_sum(padding_mask)</p><p># Include metrics monitoring<br />train_loss = Mean(establish=”train_loss”)<br />train_accuracy = Mean(establish=”train_accuracy”)</p><p># Create a checkpoint object and supervisor to deal with quite a few checkpoints<br />ckpt = follow.Checkpoint(model=training_model, optimizer=optimizer)<br />ckpt_manager = follow.CheckpointSupervisor(ckpt, “./checkpoints”, max_to_keep=3)</p><p># Speeding up the teaching course of<br />@carry out<br />def train_step(encoder_input, decoder_input, decoder_output):<br />    with GradientTape() as tape:</p><p>        # Run the forward go of the model to generate a prediction<br />        prediction = training_model(encoder_input, decoder_input, teaching=True)</p><p>        # Compute the teaching loss<br />        loss = loss_fcn(decoder_output, prediction)</p><p>        # Compute the teaching accuracy<br />        accuracy = accuracy_fcn(decoder_output, prediction)</p><p>    # Retrieve gradients of the trainable variables with respect to the teaching loss<br />    gradients = tape.gradient(loss, training_model.trainable_weights)</p><p>    # Update the values of the trainable variables by gradient descent<br />    optimizer.apply_gradients(zip(gradients, training_model.trainable_weights))</p><p>    train_loss(loss)<br />    train_accuracy(accuracy)</p><p>for epoch in range(epochs):</p><p>    train_loss.reset_states()<br />    train_accuracy.reset_states()</p><p>    print(“nStart of epoch %d” % (epoch + 1))</p><p>    start_time = time()</p><p>    # Iterate over the dataset batches<br />    for step, (train_batchX, train_batchY) in enumerate(train_dataset):</p><p>        # Define the encoder and decoder inputs, and the decoder output<br />        encoder_input = train_batchX[:, 1:]<br />        decoder_input = train_batchY[:, :-1]<br />        decoder_output = train_batchY[:, 1:]</p><p>        train_step(encoder_input, decoder_input, decoder_output)</p><p>        if step % 50 == 0:<br />            print(f’Epoch {epoch + 1} Step {step} Loss {train_loss.consequence():.4f} Accuracy {train_accuracy.consequence():.4f}’)<br />            # print(“Samples up to now: %s” % ((step + 1) * batch_size))</p><p>    # Print epoch amount and loss value on the end of every epoch<br />    print(“Epoch %d: Training Loss %.4f, Training Accuracy %.4f” % (epoch + 1, train_loss.consequence(), train_accuracy.consequence()))</p><p>    # Save a checkpoint after every 5 epochs<br />    if (epoch + 1) % 5 == 0:<br />        save_path = ckpt_manager.save()<br />        print(“Saved checkpoint at epoch %d” % (epoch + 1))</p><p>print(“Total time taken: %.2fs” % (time() – start_time))

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

from tensorflow.keras.optimizers import Adam

from tensorflow.keras.optimizers.schedules import LearningRateSchedule

from tensorflow.keras.metrics import Mean

from tensorflow import data, follow, math, reduce_sum, strong, equal, argmax, float32, GradientTape, TensorSpec, carry out, int64

from keras.losses import sparse_categorical_crossentropy

from model import TransformerModel

from prepare_dataset import PrepareDataset

from time import time

# Define the model parameters

h = 8 # Number of self-attention heads

d_k = 64 # Dimensionality of the linearly projected queries and keys

d_v = 64 # Dimensionality of the linearly projected values

d_model = 512 # Dimensionality of model layers’ outputs

d_ff = 2048 # Dimensionality of the within completely associated layer

n = 6 # Number of layers throughout the encoder stack

# Define the teaching parameters

epochs = 2

batch_size = 64

beta_1 = 0.9

beta_2 = 0.98

epsilon = 1e–9

dropout_rate = 0.1

# Implementing a finding out value scheduler

class LRScheduler(LearningRateSchedule):

def __init__(self, d_model, warmup_steps=4000, **kwargs):

great(LRScheduler, self).__init__(**kwargs)

self.d_model = strong(d_model, float32)

self.warmup_steps = warmup_steps

def __call__(self, step_num):

# Linearly rising the coaching value for the first warmup_steps, and decreasing it thereafter

arg1 = step_num ** –0.5

arg2 = step_num * (self.warmup_steps ** –1.5)

return (self.d_model ** –0.5) * math.minimal(arg1, arg2)

# Instantiate an Adam optimizer

optimizer = Adam(LRScheduler(d_model), beta_1, beta_2, epsilon)

# Prepare the teaching and verify splits of the dataset

dataset = PrepareDataset()

trainX, trainY, train_orig, enc_seq_length, dec_seq_length, enc_vocab_size, dec_vocab_size = dataset(‘english-german-both.pkl’)

# Prepare the dataset batches

train_dataset = data.Dataset.from_tensor_slices((trainX, trainY))

train_dataset = train_dataset.batch(batch_size)

# Create model

training_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)

# Defining the loss carry out

def loss_fcn(objective, prediction):

# Create masks so that the zero padding values is not going to be included throughout the computation of loss

padding_mask = math.logical_not(equal(objective, 0))

padding_mask = strong(padding_mask, float32)

# Compute a sparse categorical cross-entropy loss on the unmasked values

loss = sparse_categorical_crossentropy(objective, prediction, from_logits=True) * padding_masks

# Compute the suggest loss over the unmasked values

return reduce_sum(loss) / reduce_sum(padding_mask)

# Defining the accuracy carry out

def accuracy_fcn(objective, prediction):

# Create masks so that the zero padding values is not going to be included throughout the computation of accuracy

padding_mask = math.logical_not(equal(objective, 0))

# Find equal prediction and objective values, and apply the padding masks

accuracy = equal(objective, argmax(prediction, axis=2))

accuracy = math.logical_and(padding_mask, accuracy)

# Cast the True/False values to 32-bit-precision floating-point numbers

padding_mask = strong(padding_mask, float32)

accuracy = strong(accuracy, float32)

# Compute the suggest accuracy over the unmasked values

return reduce_sum(accuracy) / reduce_sum(padding_mask)

# Include metrics monitoring

train_loss = Mean(establish=‘train_loss’)

train_accuracy = Mean(establish=‘train_accuracy’)

# Create a checkpoint object and supervisor to deal with quite a few checkpoints

ckpt = follow.Checkpoint(model=training_model, optimizer=optimizer)

ckpt_manager = follow.CheckpointSupervisor(ckpt, “./checkpoints”, max_to_keep=3)

# Speeding up the teaching course of

@carry out

def train_step(encoder_input, decoder_input, decoder_output):

with GradientTape() as tape:

# Run the forward go of the model to generate a prediction

prediction = training_model(encoder_input, decoder_input, teaching=True)

# Compute the teaching loss

loss = loss_fcn(decoder_output, prediction)

# Compute the teaching accuracy

accuracy = accuracy_fcn(decoder_output, prediction)

# Retrieve gradients of the trainable variables with respect to the teaching loss

gradients = tape.gradient(loss, training_model.trainable_weights)

# Update the values of the trainable variables by gradient descent

optimizer.apply_gradients(zip(gradients, training_model.trainable_weights))

train_loss(loss)

train_accuracy(accuracy)

for epoch in range(epochs):

train_loss.reset_states()

train_accuracy.reset_states()

print(“nStart of epoch %d” % (epoch + 1))

start_time = time()

# Iterate over the dataset batches

for step, (train_batchX, train_batchY) in enumerate(train_dataset):

# Define the encoder and decoder inputs, and the decoder output

encoder_input = train_batchX[:, 1:]

decoder_input = train_batchY[:, :–1]

decoder_output = train_batchY[:, 1:]

train_step(encoder_input, decoder_input, decoder_output)

if step % 50 == 0:

print(f‘Epoch {epoch + 1} Step {step} Loss {train_loss.consequence():.4f} Accuracy {train_accuracy.consequence():.4f}’)

# print(“Samples up to now: %s” % ((step + 1) * batch_size))

# Print epoch amount and loss value on the end of every epoch

print(“Epoch %d: Training Loss %.4f, Training Accuracy %.4f” % (epoch + 1, train_loss.consequence(), train_accuracy.consequence()))

# Save a checkpoint after every 5 epochs

if (epoch + 1) % 5 == 0:

save_path = ckpt_manager.save()

print(“Saved checkpoint at epoch %d” % (epoch + 1))

print(“Total time taken: %.2fs” % (time() – start_time))

Running the code produces an an identical output to the following (you will probably see utterly completely different loss and accuracy values on account of the teaching is from scratch, whereas the teaching time depends on the computational sources that you’ve got accessible for teaching):

Start of epoch 1<br />Epoch 1 Step 0 Loss 8.4525 Accuracy 0.0000<br />Epoch 1 Step 50 Loss 7.6768 Accuracy 0.1234<br />Epoch 1 Step 100 Loss 7.0360 Accuracy 0.1713<br />Epoch 1: Training Loss 6.7109, Training Accuracy 0.1924</p><p>Start of epoch 2<br />Epoch 2 Step 0 Loss 5.7323 Accuracy 0.2628<br />Epoch 2 Step 50 Loss 5.4360 Accuracy 0.2756<br />Epoch 2 Step 100 Loss 5.2638 Accuracy 0.2839<br />Epoch 2: Training Loss 5.1468, Training Accuracy 0.2908<br />Total time taken: 87.98s

Start of epoch 1

Epoch 1 Step 0 Loss 8.4525 Accuracy 0.0000

Epoch 1 Step 50 Loss 7.6768 Accuracy 0.1234

Epoch 1 Step 100 Loss 7.0360 Accuracy 0.1713

Epoch 1: Training Loss 6.7109, Training Accuracy 0.1924

Start of epoch 2

Epoch 2 Step 0 Loss 5.7323 Accuracy 0.2628

Epoch 2 Step 50 Loss 5.4360 Accuracy 0.2756

Epoch 2 Step 100 Loss 5.2638 Accuracy 0.2839

Epoch 2: Training Loss 5.1468, Training Accuracy 0.2908

Total time taken: 87.98s

It takes 155.13s for the code to run using eager execution alone on the an identical platform that is making use of solely a CPU, which reveals the benefit of using graph execution.

Summary

In this tutorial, you discovered how one can follow the Transformer model for neural machine translation.

Specifically, you realized:

How to rearrange the teaching dataset
How to make use of a padding masks to the loss and accuracy computations
How to teach the Transformer model

Do you should have any questions?
Ask your questions throughout the suggestions below, and I’ll do my best to answer.

Search This Blog

Solution Desk

Why Does My Snapchat AI Have a Story? Has Snapchat AI Been Hacked?

Training the Transformer Model

Tutorial Overview

Prerequisites

Recap of the Transformer Architecture

Want to Get Started With Building Transformer Models with Attention?

Preparing the Training Dataset

Applying a Padding Mask to the Loss and Accuracy Computations

Training the Transformer Model

Further Reading

Books

Papers

Websites

Summary

Learn Transformers and Attention!

Teach your deep finding out model to be taught a sentence

Give magical vitality of understanding human language for
Your Projects

More On This Topic

Comments

Post a Comment

Popular posts from this blog

7 Things to Consider Before Buying Auto Insurance

Why Does My Snapchat AI Have a Story? Has Snapchat AI Been Hacked?

TransformX by Scale AI is Oct 19-21: Register with out spending a dime!

Why Does My Snapchat AI Have a Story? Has Snapchat AI Been Hacked?

Training the Transformer Model

Tutorial Overview

Prerequisites

Recap of the Transformer Architecture

Want to Get Started With Building Transformer Models with Attention?

Preparing the Training Dataset

Applying a Padding Mask to the Loss and Accuracy Computations

Training the Transformer Model

Further Reading

Books

Papers

Websites

Summary

Learn Transformers and Attention!

Teach your deep finding out model to be taught a sentence

Give magical vitality of understanding human language for Your Projects

More On This Topic

Comments

Post a Comment

Popular posts from this blog

7 Things to Consider Before Buying Auto Insurance

Why Does My Snapchat AI Have a Story? Has Snapchat AI Been Hacked?

TransformX by Scale AI is Oct 19-21: Register with out spending a dime!

Give magical vitality of understanding human language for
Your Projects