Joining the Transformer Encoder and Decoder Plus Masking
- Get link
- X
- Other Apps
Last Updated on January 6, 2023
We have arrived at a level the place now we have now carried out and examined the Transformer encoder and decoder individually, and we’d now be part of the two collectively into a complete model. We could even see simple strategies to create padding and look-ahead masks by which we’re going to suppress the enter values that will not be thought-about throughout the encoder or decoder computations. Our end goal stays to make use of your entire model to Natural Language Processing (NLP).
In this tutorial, you will uncover simple strategies to implement your entire Transformer model and create padding and look-ahead masks.
After ending this tutorial, you will know:
- How to create a padding masks for the encoder and decoder
- How to create a look-ahead masks for the decoder
- How to hitch the Transformer encoder and decoder proper right into a single model
- How to print out a summary of the encoder and decoder layers
Let’s get started.

Joining the Transformer encoder and decoder and Masking
Photo by John O’Nolan, some rights reserved.
Tutorial Overview
This tutorial is cut up into 4 elements; they’re:
- Recap of the Transformer Architecture
- Masking
- Creating a Padding Mask
- Creating a Look-Ahead Mask
- Joining the Transformer Encoder and Decoder
- Creating an Instance of the Transformer Model
- Printing Out a Summary of the Encoder and Decoder Layers
Prerequisites
For this tutorial, we assume that you just’re already familiar with:
- The Transformer model
- The Transformer encoder
- The Transformer decoder
Recap of the Transformer Architecture
Recall having seen that the Transformer construction follows an encoder-decoder building. The encoder, on the left-hand facet, is tasked with mapping an enter sequence to a sequence of regular representations; the decoder, on the right-hand facet, receives the output of the encoder together with the decoder output on the sooner time step to generate an output sequence.

The encoder-decoder building of the Transformer construction
Taken from “Attention Is All You Need“
In producing an output sequence, the Transformer would not depend upon recurrence and convolutions.
You have seen simple strategies to implement the Transformer encoder and decoder individually. In this tutorial, you will be part of the two into a complete Transformer model and apply padding and look-ahead masking to the enter values.
Let’s start first by discovering simple strategies to use masking.
Kick-start your mission with my information Building Transformer Models with Attention. It provides self-study tutorials with working code to info you into setting up a fully-working transformer model that will
translate sentences from one language to a special…
Masking
Creating a Padding Mask
You should already be conversant within the significance of masking the enter values sooner than feeding them into the encoder and decoder.
As you’ll be aware whilst you proceed to teach the Transformer model, the enter sequences fed into the encoder and decoder will first be zero-padded as a lot as a selected sequence dimension. The significance of getting a padding masks is to ensure that these zero values are often not processed along with the exact enter values by every the encoder and decoder.
Let’s create the subsequent function to generate a padding masks for every the encoder and decoder:
Python
1 2 3 4 5 6 7 8 | from tensorflow import math, solid, float32 def padding_mask(enter): # Create masks which marks the zero padding values throughout the enter by a 1 masks = math.equal(enter, 0) masks = solid(masks, float32) return masks |
Upon receiving an enter, this function will generate a tensor that marks by a value of one wherever the enter accommodates a value of zero.
Hence, must you enter the subsequent array:
Python
1 2 3 4 | from numpy import array enter = array([1, 2, 3, 4, 0, 0, 0]) print(padding_mask(enter)) |
Then the output of the padding_mask
function may be the subsequent:
Python
1 | tf.Tensor([0. 0. 0. 0. 1. 1. 1.], type=(7,), dtype=float32) |
Creating a Look-Ahead Mask
A look-ahead masks is required to forestall the decoder from attending to succeeding phrases, such that the prediction for a selected phrase can solely depend on recognized outputs for the phrases that come sooner than it.
For this operate, let’s create the subsequent function to generate a look-ahead masks for the decoder:
Python
1 2 3 4 5 6 7 | from tensorflow import linalg, ones def lookahead_mask(type): # Mask out future entries by marking them with a 1.0 masks = 1 – linalg.band_part(ones((type, type)), –1, 0) return masks |
You will go to it the dimensions of the decoder enter. Let’s make this dimension equal to 5, for instance:
Python
1 | print(lookahead_mask(5)) |
Then the output that the lookahead_mask
function returns is the subsequent:
Python
1 2 3 4 5 6 | tf.Tensor( [[0. 1. 1. 1. 1.] [0. 0. 1. 1. 1.] [0. 0. 0. 1. 1.] [0. 0. 0. 0. 1.] [0. 0. 0. 0. 0.]], type=(5, 5), dtype=float32) |
Again, the one values masks out the entries that should not be used. In this style, the prediction of every phrase solely relies upon individuals who come sooner than it.
Want to Get Started With Building Transformer Models with Attention?
Take my free 12-day e-mail crash course now (with sample code).
Click to sign-up and likewise get a free PDF Ebook mannequin of the course.
Joining the Transformer Encoder and Decoder
Let’s start by creating the class, TransformerModel
, which inherits from the Model
base class in Keras:
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 | class TransformerModel(Model): def __init__(self, enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, cost, **kwargs): super(TransformerModel, self).__init__(**kwargs) # Set up the encoder self.encoder = Encoder(enc_vocab_size, enc_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, cost) # Set up the decoder self.decoder = Decoder(dec_vocab_size, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, cost) # Define the final word dense layer self.model_last_layer = Dense(dec_vocab_size) ... |
Our first step in creating the TransformerModel
class is to initialize instances of the Encoder
and Decoder
programs carried out earlier and assign their outputs to the variables, encoder
and decoder
, respectively. If you saved these programs in separate Python scripts, remember to import them. I saved my code throughout the Python scripts encoder.py and decoder.py, so I’ve to import them accordingly.
You could even embrace one final dense layer that produces the final word output, as throughout the Transformer construction of Vaswani et al. (2023).
Next, you shall create the class approach, identify()
, to feed the associated inputs into the encoder and decoder.
A padding masks is first generated to masks the encoder enter, along with the encoder output, when that’s fed into the second self-attention block of the decoder:
Python
1 2 3 4 5 6 | ... def identify(self, encoder_input, decoder_input, teaching): # Create padding masks to masks the encoder inputs and the encoder outputs throughout the decoder enc_padding_mask = self.padding_mask(encoder_input) ... |
A padding masks and a look-ahead masks are then generated to masks the decoder enter. These are blended collectively by the use of an element-wise most
operation:
Python
1 2 3 4 5 6 | ... # Create and blend padding and look-ahead masks to be fed into the decoder dec_in_padding_mask = self.padding_mask(decoder_input) dec_in_lookahead_mask = self.lookahead_mask(decoder_input.type[1]) dec_in_lookahead_mask = most(dec_in_padding_mask, dec_in_lookahead_mask) ... |
Next, the associated inputs are fed into the encoder and decoder, and the Transformer model output is generated by feeding the decoder output into one final dense layer:
Python
1 2 3 4 5 6 7 8 9 10 11 | ... # Feed the enter into the encoder encoder_output = self.encoder(encoder_input, enc_padding_mask, teaching) # Feed the encoder output into the decoder decoder_output = self.decoder(decoder_input, encoder_output, dec_in_lookahead_mask, enc_padding_mask, teaching) # Pass the decoder output by the use of a final dense layer model_output = self.model_last_layer(decoder_output) return model_output |
Combining all the steps gives us the subsequent full code itemizing:
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | from encoder import Encoder from decoder import Decoder from tensorflow import math, solid, float32, linalg, ones, most, newaxis from tensorflow.keras import Model from tensorflow.keras.layers import Dense class TransformerModel(Model): def __init__(self, enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, cost, **kwargs): super(TransformerModel, self).__init__(**kwargs) # Set up the encoder self.encoder = Encoder(enc_vocab_size, enc_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, cost) # Set up the decoder self.decoder = Decoder(dec_vocab_size, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, cost) # Define the final word dense layer self.model_last_layer = Dense(dec_vocab_size) def padding_mask(self, enter): # Create masks which marks the zero padding values throughout the enter by a 1.0 masks = math.equal(enter, 0) masks = solid(masks, float32) # The type of the masks must be broadcastable to the shape # of the attention weights that can most likely be masking shortly return masks[:, newaxis, newaxis, :] def lookahead_mask(self, type): # Mask out future entries by marking them with a 1.0 masks = 1 – linalg.band_part(ones((type, type)), –1, 0) return masks def identify(self, encoder_input, decoder_input, teaching): # Create padding masks to masks the encoder inputs and the encoder outputs throughout the decoder enc_padding_mask = self.padding_mask(encoder_input) # Create and blend padding and look-ahead masks to be fed into the decoder dec_in_padding_mask = self.padding_mask(decoder_input) dec_in_lookahead_mask = self.lookahead_mask(decoder_input.type[1]) dec_in_lookahead_mask = most(dec_in_padding_mask, dec_in_lookahead_mask) # Feed the enter into the encoder encoder_output = self.encoder(encoder_input, enc_padding_mask, teaching) # Feed the encoder output into the decoder decoder_output = self.decoder(decoder_input, encoder_output, dec_in_lookahead_mask, enc_padding_mask, teaching) # Pass the decoder output by the use of a final dense layer model_output = self.model_last_layer(decoder_output) return model_output |
Note that you have carried out a small change to the output that is returned by the padding_mask
function. Its type is made broadcastable to the type of the attention weight tensor that it will masks whilst you put together the Transformer model.
Creating an Instance of the Transformer Model
You will work with the parameter values specified throughout the paper, Attention Is All You Need, by Vaswani et al. (2023):
Python
1 2 3 4 5 6 7 8 9 | h = 8 # Number of self-attention heads d_k = 64 # Dimensionality of the linearly projected queries and keys d_v = 64 # Dimensionality of the linearly projected values d_ff = 2048 # Dimensionality of the within completely linked layer d_model = 512 # Dimensionality of the model sub-layers’ outputs n = 6 # Number of layers throughout the encoder stack dropout_rate = 0.1 # Frequency of dropping the enter objects throughout the dropout layers ... |
As for the input-related parameters, you will work with dummy values for now until you arrive on the stage of teaching your entire Transformer model. At that point, you will use exact sentences:
Python
1 2 3 4 5 6 7 | ... enc_vocab_size = 20 # Vocabulary dimension for the encoder dec_vocab_size = 20 # Vocabulary dimension for the decoder enc_seq_length = 5 # Maximum dimension of the enter sequence dec_seq_length = 5 # Maximum dimension of the objective sequence ... |
You can now create an event of the TransformerModel
class as follows:
Python
1 2 3 4 | from model import TransformerModel # Create model training_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate) |
The full code itemizing is as follows:
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | enc_vocab_size = 20 # Vocabulary dimension for the encoder dec_vocab_size = 20 # Vocabulary dimension for the decoder enc_seq_length = 5 # Maximum dimension of the enter sequence dec_seq_length = 5 # Maximum dimension of the objective sequence h = 8 # Number of self-attention heads d_k = 64 # Dimensionality of the linearly projected queries and keys d_v = 64 # Dimensionality of the linearly projected values d_ff = 2048 # Dimensionality of the within completely linked layer d_model = 512 # Dimensionality of the model sub-layers’ outputs n = 6 # Number of layers throughout the encoder stack dropout_rate = 0.1 # Frequency of dropping the enter objects throughout the dropout layers # Create model training_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate) |
Printing Out a Summary of the Encoder and Decoder Layers
You may also print out a summary of the encoder and decoder blocks of the Transformer model. The choice to print them out individually will allow you to have the flexibility to see the details of their explicit individual sub-layers. In order to take motion, add the subsequent line of code to the __init__()
approach of every the EncoderLayer
and DecoderLayer
programs:
Python
1 | self.assemble(input_shape=[None, sequence_length, d_model]) |
Then you will need to add the subsequent approach to the EncoderLayer
class:
Python
1 2 3 | def build_graph(self): input_layer = Input(type=(self.sequence_length, self.d_model)) return Model(inputs=[input_layer], outputs=self.identify(input_layer, None, True)) |
And the subsequent approach to the DecoderLayer
class:
Python
1 2 3 | def build_graph(self): input_layer = Input(type=(self.sequence_length, self.d_model)) return Model(inputs=[input_layer], outputs=self.identify(input_layer, input_layer, None, None, True)) |
This ends within the EncoderLayer
class being modified as follows (the three dots beneath the identify()
approach indicate that this stays the similar as a result of the one which was carried out proper right here):
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | from tensorflow.keras.layers import Input from tensorflow.keras import Model class EncoderLayer(Layer): def __init__(self, sequence_length, h, d_k, d_v, d_model, d_ff, cost, **kwargs): super(EncoderLayer, self).__init__(**kwargs) self.assemble(input_shape=[None, sequence_length, d_model]) self.d_model = d_model self.sequence_length = sequence_length self.multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model) self.dropout1 = Dropout(cost) self.add_norm1 = AddNormalization() self.feed_forward = FeedForward(d_ff, d_model) self.dropout2 = Dropout(cost) self.add_norm2 = AddNormalization() def build_graph(self): input_layer = Input(type=(self.sequence_length, self.d_model)) return Model(inputs=[input_layer], outputs=self.identify(input_layer, None, True)) def identify(self, x, padding_mask, teaching): ... |
Similar changes could be made to the DecoderLayer
class too.
Once you’ve got gotten the required changes in place, you probably can proceed to create instances of the EncoderLayer
and DecoderLayer
programs and print out their summaries as follows:
Python
1 2 3 4 5 6 7 8 | from encoder import EncoderLayer from decoder import DecoderLayer encoder = EncoderLayer(enc_seq_length, h, d_k, d_v, d_model, d_ff, dropout_rate) encoder.build_graph().summary() decoder = DecoderLayer(dec_seq_length, h, d_k, d_v, d_model, d_ff, dropout_rate) decoder.build_graph().summary() |
The ensuing summary for the encoder is the subsequent:
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | Model: “model” __________________________________________________________________________________________________ Layer (kind) Output Shape Param # Connected to ================================================================================================== input_1 (InputLayer) [(None, 5, 512)] 0 [] multi_head_attention_18 (Multi (None, 5, 512) 131776 [‘input_1[0][0]’, HeadAttention) ‘input_1[0][0]’, ‘input_1[0][0]’] dropout_32 (Dropout) (None, 5, 512) 0 [‘multi_head_attention_18[0][0]’] add_normalization_30 (AddNorma (None, 5, 512) 1024 [‘input_1[0][0]’, lization) ‘dropout_32[0][0]’] feed_forward_12 (FeedForward) (None, 5, 512) 2099712 [‘add_normalization_30[0][0]’] dropout_33 (Dropout) (None, 5, 512) 0 [‘feed_forward_12[0][0]’] add_normalization_31 (AddNorma (None, 5, 512) 1024 [‘add_normalization_30[0][0]’, lization) ‘dropout_33[0][0]’] ================================================================================================== Total params: 2,233,536 Trainable params: 2,233,536 Non-trainable params: 0 __________________________________________________________________________________________________ |
While the following summary for the decoder is the subsequent:
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | Model: “model_1” __________________________________________________________________________________________________ Layer (kind) Output Shape Param # Connected to ================================================================================================== input_2 (InputLayer) [(None, 5, 512)] 0 [] multi_head_attention_19 (Multi (None, 5, 512) 131776 [‘input_2[0][0]’, HeadAttention) ‘input_2[0][0]’, ‘input_2[0][0]’] dropout_34 (Dropout) (None, 5, 512) 0 [‘multi_head_attention_19[0][0]’] add_normalization_32 (AddNorma (None, 5, 512) 1024 [‘input_2[0][0]’, lization) ‘dropout_34[0][0]’, ‘add_normalization_32[0][0]’, ‘dropout_35[0][0]’] multi_head_attention_20 (Multi (None, 5, 512) 131776 [‘add_normalization_32[0][0]’, HeadAttention) ‘input_2[0][0]’, ‘input_2[0][0]’] dropout_35 (Dropout) (None, 5, 512) 0 [‘multi_head_attention_20[0][0]’] feed_forward_13 (FeedForward) (None, 5, 512) 2099712 [‘add_normalization_32[1][0]’] dropout_36 (Dropout) (None, 5, 512) 0 [‘feed_forward_13[0][0]’] add_normalization_34 (AddNorma (None, 5, 512) 1024 [‘add_normalization_32[1][0]’, lization) ‘dropout_36[0][0]’] ================================================================================================== Total params: 2,365,312 Trainable params: 2,365,312 Non-trainable params: 0 __________________________________________________________________________________________________ |
Further Reading
This half provides further sources on the topic when you’re looking for to go deeper.
Books
Papers
Summary
In this tutorial, you discovered simple strategies to implement your entire Transformer model and create padding and look-ahead masks.
Specifically, you realized:
- How to create a padding masks for the encoder and decoder
- How to create a look-ahead masks for the decoder
- How to hitch the Transformer encoder and decoder proper right into a single model
- How to print out a summary of the encoder and decoder layers
Do you’ve got gotten any questions?
Ask your questions throughout the suggestions beneath and I’ll do my biggest to answer.
Learn Transformers and Attention!
Teach your deep learning model to study a sentence
…using transformer fashions with consideration
Discover how in my new Ebook:
Building Transformer Models with Attention
It provides self-study tutorials with working code to info you into setting up a fully-working transformer fashions that will
translate sentences from one language to a special…
Give magical power of understanding human language for
Your Projects
See What’s Inside
Building Transformer Models with Attention Crash…
Implementing the Transformer Decoder from Scratch in…
Implementation Patterns for the Encoder-Decoder RNN…
Multi-Step LSTM Time Series Forecasting Models for…
How to Develop an Encoder-Decoder Model with…
How to Develop a Seq2Seq Model for Neural Machine…
- Get link
- X
- Other Apps
Comments
Post a Comment