Implementing the Transformer Decoder from Scratch in TensorCirculation and Keras
- Get link
- X
- Other Apps
Last Updated on January 6, 2023
There are many similarities between the Transformer encoder and decoder, equal to their implementation of multi-head consideration, layer normalization, and a totally linked feed-forward group as their final sub-layer. Having carried out the Transformer encoder, we’ll now go ahead and apply our info in implementing the Transformer decoder as an extra step in direction of implementing the entire Transformer model. Your end objective stays to make use of the entire model to Natural Language Processing (NLP).
In this tutorial, you may uncover learn how to implement the Transformer decoder from scratch in TensorCirculation and Keras.
After ending this tutorial, you may know:
- The layers that sort part of the Transformer decoder
- How to implement the Transformer decoder from scratch
Kick-start your endeavor with my e-book Building Transformer Models with Attention. It presents self-study tutorials with working code to info you into establishing a fully-working transformer model which will
translate sentences from one language to a distinct…
Let’s get started.

Implementing the Transformer decoder from scratch in TensorCirculation and Keras
Photo by François Kaiser, some rights reserved.
Tutorial Overview
This tutorial is break up into three parts; they’re:
- Recap of the Transformer Architecture
- The Transformer Decoder
- Implementing the Transformer Decoder From Scratch
- The Decoder Layer
- The Transformer Decoder
- Testing Out the Code
Prerequisites
For this tutorial, we assume that you simply’re already acquainted with:
- The Transformer model
- The scaled dot-product consideration
- The multi-head consideration
- The Transformer positional encoding
- The Transformer encoder
Recap of the Transformer Architecture
Recall having seen that the Transformer construction follows an encoder-decoder building. The encoder, on the left-hand facet, is tasked with mapping an enter sequence to a sequence of regular representations; the decoder, on the right-hand facet, receives the output of the encoder together with the decoder output on the sooner time step to generate an output sequence.

The encoder-decoder building of the Transformer construction
Taken from “Attention Is All You Need“
In producing an output sequence, the Transformer does not rely upon recurrence and convolutions.
You have seen that the decoder part of the Transformer shares many similarities in its construction with the encoder. This tutorial will uncover these similarities.
The Transformer Decoder
Similar to the Transformer encoder, the Transformer decoder moreover consists of a stack of $N$ comparable layers. The Transformer decoder, nonetheless, implements an extra multi-head consideration block for an entire of three major sub-layers:
- The first sub-layer features a multi-head consideration mechanism that receives the queries, keys, and values as inputs.
- The second sub-layer features a second multi-head consideration mechanism.
- The third sub-layer features a fully-connected feed-forward group.

The decoder block of the Transformer construction
Taken from “Attention Is All You Need“
Each thought of considered one of these three sub-layers can be adopted by layer normalization, the place the enter to the layer normalization step is its corresponding sub-layer enter (through a residual connection) and output.
On the decoder facet, the queries, keys, and values which may be fed into the first multi-head consideration block moreover characterize the equivalent enter sequence. However, this time spherical, it is the aim sequence that is embedded and augmented with positional knowledge sooner than being geared up to the decoder. On the alternative hand, the second multi-head consideration block receives the encoder output inside the kind of keys and values and the normalized output of the first decoder consideration block as a result of the queries. In every cases, the dimensionality of the queries and keys stays equal to $d_k$, whereas the dimensionality of the values stays equal to $d_v$.
Vaswani et al. introduce regularization into the model on the decoder facet, too, by making use of dropout to the output of each sub-layer (sooner than the layer normalization step), along with to the positional encodings sooner than these are fed into the decoder.
Let’s now see learn how to implement the Transformer decoder from scratch in TensorCirculation and Keras.
Want to Get Started With Building Transformer Models with Attention?
Take my free 12-day piece of email crash course now (with sample code).
Click to sign-up and as well as get a free PDF Ebook mannequin of the course.
Implementing the Transformer Decoder from Scratch
The Decoder Layer
Since you may have already carried out the required sub-layers when you coated the implementation of the Transformer encoder, you may create a class for the decoder layer that makes use of these sub-layers instantly:
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | from multihead_attention import MultiHeadAttention from encoder import AddNormalization, FeedForward class DecoderLayer(Layer): def __init__(self, h, d_k, d_v, d_model, d_ff, cost, **kwargs): great(DecoderLayer, self).__init__(**kwargs) self.multihead_attention1 = MultiHeadAttention(h, d_k, d_v, d_model) self.dropout1 = Dropout(cost) self.add_norm1 = AddNormalization() self.multihead_attention2 = MultiHeadAttention(h, d_k, d_v, d_model) self.dropout2 = Dropout(cost) self.add_norm2 = AddNormalization() self.feed_forward = FeedForward(d_ff, d_model) self.dropout3 = Dropout(cost) self.add_norm3 = AddNormalization() ... |
Notice proper right here that as a result of the code for the completely completely different sub-layers had been saved into quite a lot of Python scripts (notably, multihead_attention.py and encoder.py), it was important to import them to have the flexibility to make use of the required programs.
As you most likely did for the Transformer encoder, you may now create the class approach, identify()
, that implements all the decoder sub-layers:
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | ... def identify(self, x, encoder_output, lookahead_mask, padding_mask, teaching): # Multi-head consideration layer multihead_output1 = self.multihead_attention1(x, x, x, lookahead_mask) # Expected output type = (batch_size, sequence_length, d_model) # Add in a dropout layer multihead_output1 = self.dropout1(multihead_output1, teaching=teaching) # Followed by an Add & Norm layer addnorm_output1 = self.add_norm1(x, multihead_output1) # Expected output type = (batch_size, sequence_length, d_model) # Followed by one different multi-head consideration layer multihead_output2 = self.multihead_attention2(addnorm_output1, encoder_output, encoder_output, padding_mask) # Add in a single different dropout layer multihead_output2 = self.dropout2(multihead_output2, teaching=teaching) # Followed by one different Add & Norm layer addnorm_output2 = self.add_norm1(addnorm_output1, multihead_output2) # Followed by a totally linked layer feedforward_output = self.feed_forward(addnorm_output2) # Expected output type = (batch_size, sequence_length, d_model) # Add in a single different dropout layer feedforward_output = self.dropout3(feedforward_output, teaching=teaching) # Followed by one different Add & Norm layer return self.add_norm3(addnorm_output2, feedforward_output) |
The multi-head consideration sub-layers can also acquire a padding masks or a look-ahead masks. As a fast reminder of what was said in a earlier tutorial, the padding masks is necessary to suppress the zero padding inside the enter sequence from being processed along with the exact enter values. The look-ahead masks prevents the decoder from attending to succeeding phrases, such that the prediction for a specific phrase can solely rely on recognized outputs for the phrases that come sooner than it.
The equivalent identify()
class approach can also acquire a teaching
flag to solely apply the Dropout layers all through teaching when the flag’s price is about to True
.
The Transformer Decoder
The Transformer decoder takes the decoder layer you may have merely carried out and replicates it identically $N$ cases.
You will create the following Decoder()
class to implement the Transformer decoder:
Python
1 2 3 4 5 6 7 8 9 | from positional_encoding import PositionEmbeddingFixedWeights class Decoder(Layer): def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, rate, **kwargs): super(Decoder, self).__init__(**kwargs) self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model) self.dropout = Dropout(rate) self.decoder_layer = [DecoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n) ... |
As in the Transformer encoder, the input to the first multi-head attention block on the decoder side receives the input sequence after this would have undergone a process of word embedding and positional encoding. For this purpose, an instance of the PositionEmbeddingFixedWeights
class (covered in this tutorial) is initialized, and its output assigned to the pos_encoding
variable.
The final step is to create a class method, call()
, that applies word embedding and positional encoding to the input sequence and feeds the result, together with the encoder output, to $N$ decoder layers:
The code listing for the full Transformer decoder is the following:
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | from tensorflow.keras.layers import Layer, Dropout from multihead_attention import MultiHeadAttention from positional_encoding import PositionEmbeddingFixedWeights from encoder import AddNormalization, FeedForward # Implementing the Decoder Layer class DecoderLayer(Layer): def __init__(self, h, d_k, d_v, d_model, d_ff, cost, **kwargs): great(DecoderLayer, self).__init__(**kwargs) self.multihead_attention1 = MultiHeadAttention(h, d_k, d_v, d_model) self.dropout1 = Dropout(cost) self.add_norm1 = AddNormalization() self.multihead_attention2 = MultiHeadAttention(h, d_k, d_v, d_model) self.dropout2 = Dropout(cost) self.add_norm2 = AddNormalization() self.feed_forward = FeedForward(d_ff, d_model) self.dropout3 = Dropout(cost) self.add_norm3 = AddNormalization() def identify(self, x, encoder_output, lookahead_mask, padding_mask, teaching): # Multi-head consideration layer multihead_output1 = self.multihead_attention1(x, x, x, lookahead_mask) # Expected output type = (batch_size, sequence_length, d_model) # Add in a dropout layer multihead_output1 = self.dropout1(multihead_output1, teaching=teaching) # Followed by an Add & Norm layer addnorm_output1 = self.add_norm1(x, multihead_output1) # Expected output type = (batch_size, sequence_length, d_model) # Followed by one different multi-head consideration layer multihead_output2 = self.multihead_attention2(addnorm_output1, encoder_output, encoder_output, padding_mask) # Add in a single different dropout layer multihead_output2 = self.dropout2(multihead_output2, teaching=teaching) # Followed by one different Add & Norm layer addnorm_output2 = self.add_norm1(addnorm_output1, multihead_output2) # Followed by a totally linked layer feedforward_output = self.feed_forward(addnorm_output2) # Expected output type = (batch_size, sequence_length, d_model) # Add in a single different dropout layer feedforward_output = self.dropout3(feedforward_output, teaching=teaching) # Followed by one different Add & Norm layer return self.add_norm3(addnorm_output2, feedforward_output) # Implementing the Decoder class Decoder(Layer): def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, cost, **kwargs): great(Decoder, self).__init__(**kwargs) self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model) self.dropout = Dropout(cost) self.decoder_layer = [DecoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n)] def identify(self, output_target, encoder_output, lookahead_mask, padding_mask, teaching): # Generate the positional encoding pos_encoding_output = self.pos_encoding(output_target) # Expected output type = (number of sentences, sequence_length, d_model) # Add in a dropout layer x = self.dropout(pos_encoding_output, teaching=teaching) # Pass on the positional encoded values to each encoder layer for i, layer in enumerate(self.decoder_layer): x = layer(x, encoder_output, lookahead_mask, padding_mask, teaching) return x |
Testing Out the Code
You will work with the parameter values specified inside the paper, Attention Is All You Need, by Vaswani et al. (2023):
Python
1 2 3 4 5 6 7 8 9 10 | h = 8 # Number of self-attention heads d_k = 64 # Dimensionality of the linearly projected queries and keys d_v = 64 # Dimensionality of the linearly projected values d_ff = 2048 # Dimensionality of the inner completely linked layer d_model = 512 # Dimensionality of the model sub-layers’ outputs n = 6 # Number of layers inside the encoder stack batch_size = 64 # Batch dimension from the teaching course of dropout_rate = 0.1 # Frequency of dropping the enter objects inside the dropout layers ... |
As for the enter sequence, you may work with dummy info in the intervening time until you arrive on the stage of teaching the entire Transformer model in a separate tutorial, at which degree you may use exact sentences:
Python
1 2 3 4 5 6 7 | ... dec_vocab_size = 20 # Vocabulary dimension for the decoder input_seq_length = 5 # Maximum measurement of the enter sequence input_seq = random.random((batch_size, input_seq_length)) enc_output = random.random((batch_size, input_seq_length, d_model)) ... |
Next, you may create a model new event of the Decoder
class, assigning its output to the decoder
variable, subsequently passing inside the enter arguments, and printing the top consequence. You will set the padding and look-ahead masks to None
in the intervening time, nonetheless you may return to these when you implement the entire Transformer model:
Python
1 2 3 | ... decoder = Decoder(dec_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate) print(decoder(input_seq, enc_output, None, True) |
Tying the whole thing collectively produces the following code itemizing:
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | from numpy import random dec_vocab_size = 20 # Vocabulary dimension for the decoder input_seq_length = 5 # Maximum measurement of the enter sequence h = 8 # Number of self-attention heads d_k = 64 # Dimensionality of the linearly projected queries and keys d_v = 64 # Dimensionality of the linearly projected values d_ff = 2048 # Dimensionality of the inner completely linked layer d_model = 512 # Dimensionality of the model sub-layers’ outputs n = 6 # Number of layers inside the decoder stack batch_size = 64 # Batch dimension from the teaching course of dropout_rate = 0.1 # Frequency of dropping the enter objects inside the dropout layers input_seq = random.random((batch_size, input_seq_length)) enc_output = random.random((batch_size, input_seq_length, d_model)) decoder = Decoder(dec_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate) print(decoder(input_seq, enc_output, None, True)) |
Running this code produces an output of type (batch dimension, sequence measurement, model dimensionality). Note that you will seemingly see a novel output due to the random initialization of the enter sequence and the parameter values of the Dense layers.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | tf.Tensor( [[[-0.04132953 -1.7236308 0.5391184 … -0.76394725 1.4969798 0.37682498] [ 0.05501875 -1.7523409 0.58404493 … -0.70776534 1.4498456 0.32555297] [ 0.04983566 -1.8431275 0.55850077 … -0.68202356 1.4222856 0.32104644] [-0.05684051 -1.8862512 0.4771412 … -0.7101341 1.431343 0.39346313] [-0.15625843 -1.7992781 0.40803364 … -0.75190556 1.4602519 0.53546077]] … [[-0.58847624 -1.646842 0.5973466 … -0.47778523 1.2060764 0.34091905] [-0.48688865 -1.6809179 0.6493542 … -0.41274604 1.188649 0.27100053] [-0.49568555 -1.8002801 0.61536175 … -0.38540334 1.2023914 0.24383534] [-0.59913146 -1.8598882 0.5098136 … -0.3984461 1.2115746 0.3186561 ] [-0.71045107 -1.7778647 0.43008155 … -0.42037937 1.2255307 0.47380894]]], type=(64, 5, 512), dtype=float32) |
Further Reading
This half presents additional belongings on the topic if you happen to’re attempting to go deeper.
Books
Papers
Summary
In this tutorial, you discovered learn how to implement the Transformer decoder from scratch in TensorCirculation and Keras.
Specifically, you realized:
- The layers that sort part of the Transformer decoder
- How to implement the Transformer decoder from scratch
Do you may have any questions?
Ask your questions inside the suggestions beneath, and I’ll do my best to answer.
Learn Transformers and Attention!
Teach your deep finding out model to be taught a sentence
…using transformer fashions with consideration
Discover how in my new Ebook:
Building Transformer Models with Attention
It presents self-study tutorials with working code to info you into establishing a fully-working transformer fashions which will
translate sentences from one language to a distinct…
Give magical power of understanding human language for
Your Projects
See What’s Inside
TensorCirculation 2 Tutorial: Get Started in Deep Learning…
Building Transformer Models with Attention Crash…
Joining the Transformer Encoder and Decoder Plus Masking
How to Develop an Encoder-Decoder Model with…
Multi-Step LSTM Time Series Forecasting Models for…
Implementing the Transformer Encoder from Scratch in…
- Get link
- X
- Other Apps
Comments
Post a Comment