Joining the Transformer Encoder and Decoder Plus Masking

Last Updated on January 6, 2023

We have arrived at a level the place now we have now carried out and examined the Transformer encoder and decoder individually, and we’d now be part of the two collectively into a complete model. We could even see simple strategies to create padding and look-ahead masks by which we’re going to suppress the enter values that will not be thought-about throughout the encoder or decoder computations. Our end goal stays to make use of your entire model to Natural Language Processing (NLP).

In this tutorial, you will uncover simple strategies to implement your entire Transformer model and create padding and look-ahead masks.

After ending this tutorial, you will know:

How to create a padding masks for the encoder and decoder
How to create a look-ahead masks for the decoder
How to hitch the Transformer encoder and decoder proper right into a single model
How to print out a summary of the encoder and decoder layers

Let’s get started.

Joining the Transformer encoder and decoder and Masking
Photo by John O’Nolan, some rights reserved.

Tutorial Overview

This tutorial is cut up into 4 elements; they’re:

Recap of the Transformer Architecture
Masking
- Creating a Padding Mask
- Creating a Look-Ahead Mask
Joining the Transformer Encoder and Decoder
Creating an Instance of the Transformer Model
- Printing Out a Summary of the Encoder and Decoder Layers

Prerequisites

For this tutorial, we assume that you just’re already familiar with:

The Transformer model
The Transformer encoder
The Transformer decoder

Recap of the Transformer Architecture

Recall having seen that the Transformer construction follows an encoder-decoder building. The encoder, on the left-hand facet, is tasked with mapping an enter sequence to a sequence of regular representations; the decoder, on the right-hand facet, receives the output of the encoder together with the decoder output on the sooner time step to generate an output sequence.

The encoder-decoder building of the Transformer construction
Taken from “Attention Is All You Need“

In producing an output sequence, the Transformer would not depend upon recurrence and convolutions.

You have seen simple strategies to implement the Transformer encoder and decoder individually. In this tutorial, you will be part of the two into a complete Transformer model and apply padding and look-ahead masking to the enter values.

Let’s start first by discovering simple strategies to use masking.

Kick-start your mission with my information Building Transformer Models with Attention. It provides self-study tutorials with working code to info you into setting up a fully-working transformer model that will
translate sentences from one language to a special…

Masking

Creating a Padding Mask

You should already be conversant within the significance of masking the enter values sooner than feeding them into the encoder and decoder.

As you’ll be aware whilst you proceed to teach the Transformer model, the enter sequences fed into the encoder and decoder will first be zero-padded as a lot as a selected sequence dimension. The significance of getting a padding masks is to ensure that these zero values are often not processed along with the exact enter values by every the encoder and decoder.

Let’s create the subsequent function to generate a padding masks for every the encoder and decoder:

from tensorflow import math, solid, float32</p><p>def padding_mask(enter):<br />    # Create masks which marks the zero padding values throughout the enter by a 1<br />    masks = math.equal(enter, 0)<br />    masks = solid(masks, float32)</p><p>    return masks

from tensorflow import math, solid, float32

def padding_mask(enter):

# Create masks which marks the zero padding values throughout the enter by a 1

masks = math.equal(enter, 0)

masks = solid(masks, float32)

return masks

Upon receiving an enter, this function will generate a tensor that marks by a value of one wherever the enter accommodates a value of zero.

Hence, must you enter the subsequent array:

from numpy import array</p><p>enter = array([1, 2, 3, 4, 0, 0, 0])<br />print(padding_mask(enter))

from numpy import array

enter = array([1, 2, 3, 4, 0, 0, 0])

print(padding_mask(enter))

Then the output of the padding_mask function may be the subsequent:

tf.Tensor([0. 0. 0. 0. 1. 1. 1.], type=(7,), dtype=float32)

1	tf.Tensor([0. 0. 0. 0. 1. 1. 1.], type=(7,), dtype=float32)

Creating a Look-Ahead Mask

A look-ahead masks is required to forestall the decoder from attending to succeeding phrases, such that the prediction for a selected phrase can solely depend on recognized outputs for the phrases that come sooner than it.

For this operate, let’s create the subsequent function to generate a look-ahead masks for the decoder:

from tensorflow import linalg, ones</p><p>def lookahead_mask(type):<br />    # Mask out future entries by marking them with a 1.0<br />    masks = 1 – linalg.band_part(ones((type, type)), -1, 0)</p><p>    return masks

from tensorflow import linalg, ones

def lookahead_mask(type):

# Mask out future entries by marking them with a 1.0

masks = 1 – linalg.band_part(ones((type, type)), –1, 0)

return masks

You will go to it the dimensions of the decoder enter. Let’s make this dimension equal to 5, for instance:

print(lookahead_mask(5))

1	print(lookahead_mask(5))

Then the output that the lookahead_mask function returns is the subsequent:

tf.Tensor(<br />[[0. 1. 1. 1. 1.]<br /> [0. 0. 1. 1. 1.]<br /> [0. 0. 0. 1. 1.]<br /> [0. 0. 0. 0. 1.]<br /> [0. 0. 0. 0. 0.]], type=(5, 5), dtype=float32)

tf.Tensor(

[[0. 1. 1. 1. 1.]

[0. 0. 1. 1. 1.]

[0. 0. 0. 1. 1.]

[0. 0. 0. 0. 1.]

[0. 0. 0. 0. 0.]], type=(5, 5), dtype=float32)

Again, the one values masks out the entries that should not be used. In this style, the prediction of every phrase solely relies upon individuals who come sooner than it.

Want to Get Started With Building Transformer Models with Attention?

Take my free 12-day e-mail crash course now (with sample code).

Click to sign-up and likewise get a free PDF Ebook mannequin of the course.

Joining the Transformer Encoder and Decoder

Let’s start by creating the class, TransformerModel, which inherits from the Model base class in Keras:

class TransformerModel(Model):<br />    def __init__(self, enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, cost, **kwargs):<br />        super(TransformerModel, self).__init__(**kwargs)</p><p>        # Set up the encoder<br />        self.encoder = Encoder(enc_vocab_size, enc_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, cost)</p><p>        # Set up the decoder<br />        self.decoder = Decoder(dec_vocab_size, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, cost)</p><p>        # Define the final word dense layer<br />        self.model_last_layer = Dense(dec_vocab_size)<br />        …

class TransformerModel(Model):

def __init__(self, enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, cost, **kwargs):

super(TransformerModel, self).__init__(**kwargs)

# Set up the encoder

self.encoder = Encoder(enc_vocab_size, enc_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, cost)

# Set up the decoder

self.decoder = Decoder(dec_vocab_size, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, cost)

# Define the final word dense layer

self.model_last_layer = Dense(dec_vocab_size)

...

Our first step in creating the TransformerModel class is to initialize instances of the Encoder and Decoder programs carried out earlier and assign their outputs to the variables, encoder and decoder, respectively. If you saved these programs in separate Python scripts, remember to import them. I saved my code throughout the Python scripts encoder.py and decoder.py, so I’ve to import them accordingly.

You could even embrace one final dense layer that produces the final word output, as throughout the Transformer construction of Vaswani et al. (2023).

Next, you shall create the class approach, identify(), to feed the associated inputs into the encoder and decoder.

A padding masks is first generated to masks the encoder enter, along with the encoder output, when that’s fed into the second self-attention block of the decoder:

…<br />def identify(self, encoder_input, decoder_input, teaching):</p><p>    # Create padding masks to masks the encoder inputs and the encoder outputs throughout the decoder<br />    enc_padding_mask = self.padding_mask(encoder_input)<br />…

...

def identify(self, encoder_input, decoder_input, teaching):

# Create padding masks to masks the encoder inputs and the encoder outputs throughout the decoder

enc_padding_mask = self.padding_mask(encoder_input)

...

A padding masks and a look-ahead masks are then generated to masks the decoder enter. These are blended collectively by the use of an element-wise most operation:

…<br /># Create and blend padding and look-ahead masks to be fed into the decoder<br />dec_in_padding_mask = self.padding_mask(decoder_input)<br />dec_in_lookahead_mask = self.lookahead_mask(decoder_input.type[1])<br />dec_in_lookahead_mask = most(dec_in_padding_mask, dec_in_lookahead_mask)<br />…

...

# Create and blend padding and look-ahead masks to be fed into the decoder

dec_in_padding_mask = self.padding_mask(decoder_input)

dec_in_lookahead_mask = self.lookahead_mask(decoder_input.type[1])

dec_in_lookahead_mask = most(dec_in_padding_mask, dec_in_lookahead_mask)

...

Next, the associated inputs are fed into the encoder and decoder, and the Transformer model output is generated by feeding the decoder output into one final dense layer:

…<br /># Feed the enter into the encoder<br />encoder_output = self.encoder(encoder_input, enc_padding_mask, teaching)</p><p># Feed the encoder output into the decoder<br />decoder_output = self.decoder(decoder_input, encoder_output, dec_in_lookahead_mask, enc_padding_mask, teaching)</p><p># Pass the decoder output by the use of a final dense layer<br />model_output = self.model_last_layer(decoder_output)</p><p>return model_output

...

# Feed the enter into the encoder

encoder_output = self.encoder(encoder_input, enc_padding_mask, teaching)

# Feed the encoder output into the decoder

decoder_output = self.decoder(decoder_input, encoder_output, dec_in_lookahead_mask, enc_padding_mask, teaching)

# Pass the decoder output by the use of a final dense layer

model_output = self.model_last_layer(decoder_output)

return model_output

Combining all the steps gives us the subsequent full code itemizing:

from encoder import Encoder<br />from decoder import Decoder<br />from tensorflow import math, solid, float32, linalg, ones, most, newaxis<br />from tensorflow.keras import Model<br />from tensorflow.keras.layers import Dense</p><p>class TransformerModel(Model):<br />    def __init__(self, enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, cost, **kwargs):<br />        super(TransformerModel, self).__init__(**kwargs)</p><p>        # Set up the encoder<br />        self.encoder = Encoder(enc_vocab_size, enc_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, cost)</p><p>        # Set up the decoder<br />        self.decoder = Decoder(dec_vocab_size, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, cost)</p><p>        # Define the final word dense layer<br />        self.model_last_layer = Dense(dec_vocab_size)</p><p>    def padding_mask(self, enter):<br />        # Create masks which marks the zero padding values throughout the enter by a 1.0<br />        masks = math.equal(enter, 0)<br />        masks = solid(masks, float32)</p><p>        # The type of the masks must be broadcastable to the shape<br />        # of the attention weights that can most likely be masking shortly<br />        return masks[:, newaxis, newaxis, :]</p><p>    def lookahead_mask(self, type):<br />        # Mask out future entries by marking them with a 1.0<br />        masks = 1 – linalg.band_part(ones((type, type)), -1, 0)</p><p>        return masks</p><p>    def identify(self, encoder_input, decoder_input, teaching):</p><p>        # Create padding masks to masks the encoder inputs and the encoder outputs throughout the decoder<br />        enc_padding_mask = self.padding_mask(encoder_input)</p><p>        # Create and blend padding and look-ahead masks to be fed into the decoder<br />        dec_in_padding_mask = self.padding_mask(decoder_input)<br />        dec_in_lookahead_mask = self.lookahead_mask(decoder_input.type[1])<br />        dec_in_lookahead_mask = most(dec_in_padding_mask, dec_in_lookahead_mask)</p><p>        # Feed the enter into the encoder<br />        encoder_output = self.encoder(encoder_input, enc_padding_mask, teaching)</p><p>        # Feed the encoder output into the decoder<br />        decoder_output = self.decoder(decoder_input, encoder_output, dec_in_lookahead_mask, enc_padding_mask, teaching)</p><p>        # Pass the decoder output by the use of a final dense layer<br />        model_output = self.model_last_layer(decoder_output)</p><p>        return model_output

from encoder import Encoder

from decoder import Decoder

from tensorflow import math, solid, float32, linalg, ones, most, newaxis

from tensorflow.keras import Model

from tensorflow.keras.layers import Dense

class TransformerModel(Model):

def __init__(self, enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, cost, **kwargs):

super(TransformerModel, self).__init__(**kwargs)

# Set up the encoder

self.encoder = Encoder(enc_vocab_size, enc_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, cost)

# Set up the decoder

self.decoder = Decoder(dec_vocab_size, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, cost)

# Define the final word dense layer

self.model_last_layer = Dense(dec_vocab_size)

def padding_mask(self, enter):

# Create masks which marks the zero padding values throughout the enter by a 1.0

masks = math.equal(enter, 0)

masks = solid(masks, float32)

# The type of the masks must be broadcastable to the shape

# of the attention weights that can most likely be masking shortly

return masks[:, newaxis, newaxis, :]

def lookahead_mask(self, type):

# Mask out future entries by marking them with a 1.0

masks = 1 – linalg.band_part(ones((type, type)), –1, 0)

return masks

def identify(self, encoder_input, decoder_input, teaching):

# Create padding masks to masks the encoder inputs and the encoder outputs throughout the decoder

enc_padding_mask = self.padding_mask(encoder_input)

# Create and blend padding and look-ahead masks to be fed into the decoder

dec_in_padding_mask = self.padding_mask(decoder_input)

dec_in_lookahead_mask = self.lookahead_mask(decoder_input.type[1])

dec_in_lookahead_mask = most(dec_in_padding_mask, dec_in_lookahead_mask)

# Feed the enter into the encoder

encoder_output = self.encoder(encoder_input, enc_padding_mask, teaching)

# Feed the encoder output into the decoder

decoder_output = self.decoder(decoder_input, encoder_output, dec_in_lookahead_mask, enc_padding_mask, teaching)

# Pass the decoder output by the use of a final dense layer

model_output = self.model_last_layer(decoder_output)

return model_output

Note that you have carried out a small change to the output that is returned by the padding_mask function. Its type is made broadcastable to the type of the attention weight tensor that it will masks whilst you put together the Transformer model.

Creating an Instance of the Transformer Model

You will work with the parameter values specified throughout the paper, Attention Is All You Need, by Vaswani et al. (2023):

h = 8  # Number of self-attention heads<br />d_k = 64  # Dimensionality of the linearly projected queries and keys<br />d_v = 64  # Dimensionality of the linearly projected values<br />d_ff = 2048  # Dimensionality of the within completely linked layer<br />d_model = 512  # Dimensionality of the model sub-layers’ outputs<br />n = 6  # Number of layers throughout the encoder stack</p><p>dropout_rate = 0.1  # Frequency of dropping the enter objects throughout the dropout layers<br />…

h = 8 # Number of self-attention heads

d_k = 64 # Dimensionality of the linearly projected queries and keys

d_v = 64 # Dimensionality of the linearly projected values

d_ff = 2048 # Dimensionality of the within completely linked layer

d_model = 512 # Dimensionality of the model sub-layers’ outputs

n = 6 # Number of layers throughout the encoder stack

dropout_rate = 0.1 # Frequency of dropping the enter objects throughout the dropout layers

...

As for the input-related parameters, you will work with dummy values for now until you arrive on the stage of teaching your entire Transformer model. At that point, you will use exact sentences:

…<br />enc_vocab_size = 20 # Vocabulary dimension for the encoder<br />dec_vocab_size = 20 # Vocabulary dimension for the decoder</p><p>enc_seq_length = 5  # Maximum dimension of the enter sequence<br />dec_seq_length = 5  # Maximum dimension of the objective sequence<br />…

...

enc_vocab_size = 20 # Vocabulary dimension for the encoder

dec_vocab_size = 20 # Vocabulary dimension for the decoder

enc_seq_length = 5 # Maximum dimension of the enter sequence

dec_seq_length = 5 # Maximum dimension of the objective sequence

...

You can now create an event of the TransformerModel class as follows:

from model import TransformerModel</p><p># Create model<br />training_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)

from model import TransformerModel

# Create model

training_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)

The full code itemizing is as follows:

enc_vocab_size = 20 # Vocabulary dimension for the encoder<br />dec_vocab_size = 20 # Vocabulary dimension for the decoder</p><p>enc_seq_length = 5  # Maximum dimension of the enter sequence<br />dec_seq_length = 5  # Maximum dimension of the objective sequence</p><p>h = 8  # Number of self-attention heads<br />d_k = 64  # Dimensionality of the linearly projected queries and keys<br />d_v = 64  # Dimensionality of the linearly projected values<br />d_ff = 2048  # Dimensionality of the within completely linked layer<br />d_model = 512  # Dimensionality of the model sub-layers’ outputs<br />n = 6  # Number of layers throughout the encoder stack</p><p>dropout_rate = 0.1  # Frequency of dropping the enter objects throughout the dropout layers</p><p># Create model<br />training_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)

enc_vocab_size = 20 # Vocabulary dimension for the encoder

dec_vocab_size = 20 # Vocabulary dimension for the decoder

enc_seq_length = 5 # Maximum dimension of the enter sequence

dec_seq_length = 5 # Maximum dimension of the objective sequence

h = 8 # Number of self-attention heads

d_k = 64 # Dimensionality of the linearly projected queries and keys

d_v = 64 # Dimensionality of the linearly projected values

d_ff = 2048 # Dimensionality of the within completely linked layer

d_model = 512 # Dimensionality of the model sub-layers’ outputs

n = 6 # Number of layers throughout the encoder stack

dropout_rate = 0.1 # Frequency of dropping the enter objects throughout the dropout layers

# Create model

training_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)

Printing Out a Summary of the Encoder and Decoder Layers

You may also print out a summary of the encoder and decoder blocks of the Transformer model. The choice to print them out individually will allow you to have the flexibility to see the details of their explicit individual sub-layers. In order to take motion, add the subsequent line of code to the __init__() approach of every the EncoderLayer and DecoderLayer programs:

self.assemble(input_shape=[None, sequence_length, d_model])

1	self.assemble(input_shape=[None, sequence_length, d_model])

Then you will need to add the subsequent approach to the EncoderLayer class:

def build_graph(self):<br />    input_layer = Input(type=(self.sequence_length, self.d_model))<br />    return Model(inputs=[input_layer], outputs=self.identify(input_layer, None, True))

def build_graph(self):

input_layer = Input(type=(self.sequence_length, self.d_model))

return Model(inputs=[input_layer], outputs=self.identify(input_layer, None, True))

And the subsequent approach to the DecoderLayer class:

def build_graph(self):<br />    input_layer = Input(type=(self.sequence_length, self.d_model))<br />    return Model(inputs=[input_layer], outputs=self.identify(input_layer, input_layer, None, None, True))

def build_graph(self):

input_layer = Input(type=(self.sequence_length, self.d_model))

return Model(inputs=[input_layer], outputs=self.identify(input_layer, input_layer, None, None, True))

This ends within the EncoderLayer class being modified as follows (the three dots beneath the identify() approach indicate that this stays the similar as a result of the one which was carried out proper right here):

from tensorflow.keras.layers import Input<br />from tensorflow.keras import Model</p><p>class EncoderLayer(Layer):<br />    def __init__(self, sequence_length, h, d_k, d_v, d_model, d_ff, cost, **kwargs):<br />        super(EncoderLayer, self).__init__(**kwargs)<br />        self.assemble(input_shape=[None, sequence_length, d_model])<br />        self.d_model = d_model<br />        self.sequence_length = sequence_length<br />        self.multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model)<br />        self.dropout1 = Dropout(cost)<br />        self.add_norm1 = AddNormalization()<br />        self.feed_forward = FeedForward(d_ff, d_model)<br />        self.dropout2 = Dropout(cost)<br />        self.add_norm2 = AddNormalization()</p><p>    def build_graph(self):<br />        input_layer = Input(type=(self.sequence_length, self.d_model))<br />        return Model(inputs=[input_layer], outputs=self.identify(input_layer, None, True))</p><p>    def identify(self, x, padding_mask, teaching):<br />        …

from tensorflow.keras.layers import Input

from tensorflow.keras import Model

class EncoderLayer(Layer):

def __init__(self, sequence_length, h, d_k, d_v, d_model, d_ff, cost, **kwargs):

super(EncoderLayer, self).__init__(**kwargs)

self.assemble(input_shape=[None, sequence_length, d_model])

self.d_model = d_model

self.sequence_length = sequence_length

self.multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model)

self.dropout1 = Dropout(cost)

self.add_norm1 = AddNormalization()

self.feed_forward = FeedForward(d_ff, d_model)

self.dropout2 = Dropout(cost)

self.add_norm2 = AddNormalization()

def build_graph(self):

input_layer = Input(type=(self.sequence_length, self.d_model))

return Model(inputs=[input_layer], outputs=self.identify(input_layer, None, True))

def identify(self, x, padding_mask, teaching):

...

Similar changes could be made to the DecoderLayer class too.

Once you’ve got gotten the required changes in place, you probably can proceed to create instances of the EncoderLayer and DecoderLayer programs and print out their summaries as follows:

from encoder import EncoderLayer<br />from decoder import DecoderLayer</p><p>encoder = EncoderLayer(enc_seq_length, h, d_k, d_v, d_model, d_ff, dropout_rate)<br />encoder.build_graph().summary()</p><p>decoder = DecoderLayer(dec_seq_length, h, d_k, d_v, d_model, d_ff, dropout_rate)<br />decoder.build_graph().summary()

from encoder import EncoderLayer

from decoder import DecoderLayer

encoder = EncoderLayer(enc_seq_length, h, d_k, d_v, d_model, d_ff, dropout_rate)

encoder.build_graph().summary()

decoder = DecoderLayer(dec_seq_length, h, d_k, d_v, d_model, d_ff, dropout_rate)

decoder.build_graph().summary()

The ensuing summary for the encoder is the subsequent:

Model: “model”<br />__________________________________________________________________________________________________<br /> Layer (kind)                   Output Shape         Param #     Connected to<br />==================================================================================================<br /> input_1 (InputLayer)           [(None, 5, 512)]     0           []                               </p><p> multi_head_attention_18 (Multi  (None, 5, 512)      131776      [‘input_1[0][0]’,<br /> HeadAttention)                                                   ‘input_1[0][0]’,<br />                                                                  ‘input_1[0][0]’]                </p><p> dropout_32 (Dropout)           (None, 5, 512)       0           [‘multi_head_attention_18[0][0]’]</p><p> add_normalization_30 (AddNorma  (None, 5, 512)      1024        [‘input_1[0][0]’,<br /> lization)                                                        ‘dropout_32[0][0]’]             </p><p> feed_forward_12 (FeedForward)  (None, 5, 512)       2099712     [‘add_normalization_30[0][0]’]   </p><p> dropout_33 (Dropout)           (None, 5, 512)       0           [‘feed_forward_12[0][0]’]        </p><p> add_normalization_31 (AddNorma  (None, 5, 512)      1024        [‘add_normalization_30[0][0]’,<br /> lization)                                                        ‘dropout_33[0][0]’]             </p><p>==================================================================================================<br />Total params: 2,233,536<br />Trainable params: 2,233,536<br />Non-trainable params: 0<br />__________________________________________________________________________________________________

Model: “model”

__________________________________________________________________________________________________

Layer (kind) Output Shape Param # Connected to

==================================================================================================

input_1 (InputLayer) [(None, 5, 512)] 0 []

multi_head_attention_18 (Multi (None, 5, 512) 131776 [‘input_1[0][0]’,

HeadAttention) ‘input_1[0][0]’,

‘input_1[0][0]’]

dropout_32 (Dropout) (None, 5, 512) 0 [‘multi_head_attention_18[0][0]’]

add_normalization_30 (AddNorma (None, 5, 512) 1024 [‘input_1[0][0]’,

lization) ‘dropout_32[0][0]’]

feed_forward_12 (FeedForward) (None, 5, 512) 2099712 [‘add_normalization_30[0][0]’]

dropout_33 (Dropout) (None, 5, 512) 0 [‘feed_forward_12[0][0]’]

add_normalization_31 (AddNorma (None, 5, 512) 1024 [‘add_normalization_30[0][0]’,

lization) ‘dropout_33[0][0]’]

==================================================================================================

Total params: 2,233,536

Trainable params: 2,233,536

Non-trainable params: 0

__________________________________________________________________________________________________

While the following summary for the decoder is the subsequent:

Model: “model_1”<br />__________________________________________________________________________________________________<br /> Layer (kind)                   Output Shape         Param #     Connected to<br />==================================================================================================<br /> input_2 (InputLayer)           [(None, 5, 512)]     0           []                               </p><p> multi_head_attention_19 (Multi  (None, 5, 512)      131776      [‘input_2[0][0]’,<br /> HeadAttention)                                                   ‘input_2[0][0]’,<br />                                                                  ‘input_2[0][0]’]                </p><p> dropout_34 (Dropout)           (None, 5, 512)       0           [‘multi_head_attention_19[0][0]’]</p><p> add_normalization_32 (AddNorma  (None, 5, 512)      1024        [‘input_2[0][0]’,<br /> lization)                                                        ‘dropout_34[0][0]’,<br />                                                                  ‘add_normalization_32[0][0]’,<br />                                                                  ‘dropout_35[0][0]’]             </p><p> multi_head_attention_20 (Multi  (None, 5, 512)      131776      [‘add_normalization_32[0][0]’,<br /> HeadAttention)                                                   ‘input_2[0][0]’,<br />                                                                  ‘input_2[0][0]’]                </p><p> dropout_35 (Dropout)           (None, 5, 512)       0           [‘multi_head_attention_20[0][0]’]</p><p> feed_forward_13 (FeedForward)  (None, 5, 512)       2099712     [‘add_normalization_32[1][0]’]   </p><p> dropout_36 (Dropout)           (None, 5, 512)       0           [‘feed_forward_13[0][0]’]        </p><p> add_normalization_34 (AddNorma  (None, 5, 512)      1024        [‘add_normalization_32[1][0]’,<br /> lization)                                                        ‘dropout_36[0][0]’]             </p><p>==================================================================================================<br />Total params: 2,365,312<br />Trainable params: 2,365,312<br />Non-trainable params: 0<br />__________________________________________________________________________________________________

Model: “model_1”

__________________________________________________________________________________________________

Layer (kind) Output Shape Param # Connected to

==================================================================================================

input_2 (InputLayer) [(None, 5, 512)] 0 []

multi_head_attention_19 (Multi (None, 5, 512) 131776 [‘input_2[0][0]’,

HeadAttention) ‘input_2[0][0]’,

‘input_2[0][0]’]

dropout_34 (Dropout) (None, 5, 512) 0 [‘multi_head_attention_19[0][0]’]

add_normalization_32 (AddNorma (None, 5, 512) 1024 [‘input_2[0][0]’,

lization) ‘dropout_34[0][0]’,

‘add_normalization_32[0][0]’,

‘dropout_35[0][0]’]

multi_head_attention_20 (Multi (None, 5, 512) 131776 [‘add_normalization_32[0][0]’,

HeadAttention) ‘input_2[0][0]’,

‘input_2[0][0]’]

dropout_35 (Dropout) (None, 5, 512) 0 [‘multi_head_attention_20[0][0]’]

feed_forward_13 (FeedForward) (None, 5, 512) 2099712 [‘add_normalization_32[1][0]’]

dropout_36 (Dropout) (None, 5, 512) 0 [‘feed_forward_13[0][0]’]

add_normalization_34 (AddNorma (None, 5, 512) 1024 [‘add_normalization_32[1][0]’,

lization) ‘dropout_36[0][0]’]

==================================================================================================

Total params: 2,365,312

Trainable params: 2,365,312

Non-trainable params: 0

__________________________________________________________________________________________________

Summary

In this tutorial, you discovered simple strategies to implement your entire Transformer model and create padding and look-ahead masks.

Specifically, you realized:

How to create a padding masks for the encoder and decoder
How to create a look-ahead masks for the decoder
How to hitch the Transformer encoder and decoder proper right into a single model
How to print out a summary of the encoder and decoder layers

Do you’ve got gotten any questions?
Ask your questions throughout the suggestions beneath and I’ll do my biggest to answer.

Search This Blog

Solution Desk

Why Does My Snapchat AI Have a Story? Has Snapchat AI Been Hacked?

Joining the Transformer Encoder and Decoder Plus Masking

Tutorial Overview

Prerequisites

Recap of the Transformer Architecture

Masking

Creating a Padding Mask

Creating a Look-Ahead Mask

Want to Get Started With Building Transformer Models with Attention?

Joining the Transformer Encoder and Decoder

Creating an Instance of the Transformer Model

Printing Out a Summary of the Encoder and Decoder Layers

Further Reading

Books

Papers

Summary

Learn Transformers and Attention!

Teach your deep learning model to study a sentence

Give magical power of understanding human language for
Your Projects

More On This Topic

Comments

Post a Comment

Popular posts from this blog

7 Things to Consider Before Buying Auto Insurance

Why Does My Snapchat AI Have a Story? Has Snapchat AI Been Hacked?

Creating AI: A Simple Understandable Guide

Why Does My Snapchat AI Have a Story? Has Snapchat AI Been Hacked?

Joining the Transformer Encoder and Decoder Plus Masking

Tutorial Overview

Prerequisites

Recap of the Transformer Architecture

Masking

Creating a Padding Mask

Creating a Look-Ahead Mask

Want to Get Started With Building Transformer Models with Attention?

Joining the Transformer Encoder and Decoder

Creating an Instance of the Transformer Model

Printing Out a Summary of the Encoder and Decoder Layers

Further Reading

Books

Papers

Summary

Learn Transformers and Attention!

Teach your deep learning model to study a sentence

Give magical power of understanding human language for Your Projects

More On This Topic

Comments

Post a Comment

Popular posts from this blog

7 Things to Consider Before Buying Auto Insurance

Why Does My Snapchat AI Have a Story? Has Snapchat AI Been Hacked?

Creating AI: A Simple Understandable Guide

Give magical power of understanding human language for
Your Projects