Implementing the Transformer Encoder from Scratch in TensorMotion and Keras
- Get link
- X
- Other Apps
Last Updated on January 6, 2023
Having seen discover ways to implement the scaled dot-product consideration and mix it all through the multi-head consideration of the Transformer model, let’s progress one step further in the direction of implementing a complete Transformer model by making use of its encoder. Our end goal stays to make use of the entire model to Natural Language Processing (NLP).
In this tutorial, you may uncover discover ways to implement the Transformer encoder from scratch in TensorMotion and Keras.
After ending this tutorial, you may know:
- The layers that sort part of the Transformer encoder.
- How to implement the Transformer encoder from scratch.
Kick-start your mission with my information Building Transformer Models with Attention. It provides self-study tutorials with working code to info you into developing a fully-working transformer model that will
translate sentences from one language to a special…
Let’s get started.

Implementing the transformer encoder from scratch in TensorMotion and Keras
Photo by ian dooley, some rights reserved.
Tutorial Overview
This tutorial is cut up into three components; they’re:
- Recap of the Transformer Architecture
- The Transformer Encoder
- Implementing the Transformer Encoder From Scratch
- The Fully Connected Feed-Forward Neural Network and Layer Normalization
- The Encoder Layer
- The Transformer Encoder
- Testing Out the Code
Prerequisites
For this tutorial, we assume that you just’re already conscious of:
- The Transformer model
- The scaled dot-product consideration
- The multi-head consideration
- The Transformer positional encoding
Recap of the Transformer Architecture
Recall having seen that the Transformer construction follows an encoder-decoder development. The encoder, on the left-hand facet, is tasked with mapping an enter sequence to a sequence of regular representations; the decoder, on the right-hand facet, receives the output of the encoder together with the decoder output on the sooner time step to generate an output sequence.

The encoder-decoder development of the Transformer construction
Taken from “Attention Is All You Need“
In producing an output sequence, the Transformer does not rely upon recurrence and convolutions.
You have seen that the decoder part of the Transformer shares many similarities in its construction with the encoder. In this tutorial, you may give consideration to the weather that sort part of the Transformer encoder.
The Transformer Encoder
The Transformer encoder consists of a stack of $N$ an similar layers, the place each layer further consists of two main sub-layers:
- The first sub-layer features a multi-head consideration mechanism that receives the queries, keys, and values as inputs.
- A second sub-layer features a fully-connected feed-forward neighborhood.

The encoder block of the Transformer construction
Taken from “Attention Is All You Need“
Following each of these two sub-layers is layer normalization, into which the sub-layer enter (by a residual connection) and output are fed. The output of each layer normalization step is the subsequent:
LayerNorm(Sublayer Input + Sublayer Output)
In order to facilitate such an operation, which entails an addition between the sublayer enter and output, Vaswani et al. designed all sub-layers and embedding layers throughout the model to offer outputs of dimension, $d_{textual content material{model}}$ = 512.
Also, recall the queries, keys, and values as a result of the inputs to the Transformer encoder.
Here, the queries, keys, and values carry the similar enter sequence after this has been embedded and augmented by positional information, the place the queries and keys are of dimensionality, $d_k$, and the dimensionality of the values is $d_v$.
Furthermore, Vaswani et al. moreover introduce regularization into the model by making use of a dropout to the output of each sub-layer (sooner than the layer normalization step), along with to the positional encodings sooner than these are fed into the encoder.
Let’s now see discover ways to implement the Transformer encoder from scratch in TensorMotion and Keras.
Want to Get Started With Building Transformer Models with Attention?
Take my free 12-day e-mail crash course now (with sample code).
Click to sign-up and as well as get a free PDF Ebook mannequin of the course.
Implementing the Transformer Encoder from Scratch
The Fully Connected Feed-Forward Neural Network and Layer Normalization
Let’s begin by creating programs for the Feed Forward and Add & Norm layers which could be confirmed throughout the diagram above.
Vaswani et al. inform us that the completely associated feed-forward neighborhood consists of two linear transformations with a ReLU activation in between. The first linear transformation produces an output of dimensionality, $d_{ff}$ = 2048, whereas the second linear transformation produces an output of dimensionality, $d_{textual content material{model}}$ = 512.
For this perform, let’s first create the class FeedForward
that inherits from the Layer
base class in Keras and initialize the dense layers and the ReLU activation:
Python
1 2 3 4 5 6 7 | class FeedForward(Layer): def __init__(self, d_ff, d_model, **kwargs): super(FeedForward, self).__init__(**kwargs) self.fully_connected1 = Dense(d_ff) # First completely associated layer self.fully_connected2 = Dense(d_model) # Second completely associated layer self.activation = ReLU() # ReLU activation layer ... |
We will add to it the class methodology, title()
, that receives an enter and passes it by the two completely associated layers with ReLU activation, returning an output of dimensionality equal to 512:
Python
1 2 3 4 5 6 | ... def title(self, x): # The enter is handed into the two fully-connected layers, with a ReLU in between x_fc1 = self.fully_connected1(x) return self.fully_connected2(self.activation(x_fc1)) |
The subsequent step is to create one different class, AddNormalization
, that moreover inherits from the Layer
base class in Keras and initialize a Layer normalization layer:
Python
1 2 3 4 5 | class AddNormalization(Layer): def __init__(self, **kwargs): super(AddNormalization, self).__init__(**kwargs) self.layer_norm = LayerNormalization() # Layer normalization layer ... |
In it, embody the subsequent class methodology that sums its sub-layer’s enter and output, which it receives as inputs, and applies layer normalization to the tip outcome:
Python
1 2 3 4 5 6 7 | ... def title(self, x, sublayer_x): # The sublayer enter and output should be of the similar kind to be summed add = x + sublayer_x # Apply layer normalization to the sum return self.layer_norm(add) |
The Encoder Layer
Next, you may implement the encoder layer, which the Transformer encoder will replicate identically $N$ events.
For this perform, let’s create the class, EncoderLayer
, and initialize the entire sub-layers that it consists of:
Python
1 2 3 4 5 6 7 8 9 10 | class EncoderLayer(Layer): def __init__(self, h, d_k, d_v, d_model, d_ff, value, **kwargs): super(EncoderLayer, self).__init__(**kwargs) self.multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model) self.dropout1 = Dropout(value) self.add_norm1 = AddNormalization() self.feed_forward = FeedForward(d_ff, d_model) self.dropout2 = Dropout(value) self.add_norm2 = AddNormalization() ... |
Here, it is attainable you may uncover that you’ve got initialized instances of the FeedForward
and AddNormalization
programs, which you merely created throughout the earlier half, and assigned their output to the respective variables, feed_forward
and add_norm
(1 and a pair of). The Dropout
layer is self-explanatory, the place the value
defines the frequency at which the enter fashions are set to 0. You created the MultiHeadAttention
class in a earlier tutorial, and for those who occur to saved the code proper right into a separate Python script, then remember to import
it. I saved mine in a Python script named multihead_attention.py, and for that purpose, I want to incorporate the street of code from multihead_attention import MultiHeadAttention.
Let’s now proceed to create the class methodology, title()
, that implements the entire encoder sub-layers:
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | ... def title(self, x, padding_mask, teaching): # Multi-head consideration layer multihead_output = self.multihead_attention(x, x, x, padding_mask) # Expected output kind = (batch_size, sequence_length, d_model) # Add in a dropout layer multihead_output = self.dropout1(multihead_output, teaching=teaching) # Followed by an Add & Norm layer addnorm_output = self.add_norm1(x, multihead_output) # Expected output kind = (batch_size, sequence_length, d_model) # Followed by a very associated layer feedforward_output = self.feed_forward(addnorm_output) # Expected output kind = (batch_size, sequence_length, d_model) # Add in a single different dropout layer feedforward_output = self.dropout2(feedforward_output, teaching=teaching) # Followed by one different Add & Norm layer return self.add_norm2(addnorm_output, feedforward_output) |
In addition to the enter info, the title()
methodology may even get hold of a padding masks. As a brief reminder of what was talked about in a earlier tutorial, the padding masks is essential to suppress the zero padding throughout the enter sequence from being processed along with the exact enter values.
The similar class methodology can get hold of a teaching
flag which, when set to True
, will solely apply the Dropout layers all through teaching.
The Transformer Encoder
The remaining step is to create a class for the Transformer encoder, which should be named Encoder
:
Python
1 2 3 4 5 6 7 | class Encoder(Layer): def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, value, **kwargs): super(Encoder, self).__init__(**kwargs) self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model) self.dropout = Dropout(value) self.encoder_layer = [EncoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n)] ... |
The Transformer encoder receives an enter sequence after this might have undergone a technique of phrase embedding and positional encoding. In order to compute the positional encoding, let’s make use of the PositionEmbeddingFixedWeights
class described by Mehreen Saeed on this tutorial.
As you may need equally carried out throughout the earlier sections, proper right here, moreover, you’ll create a class methodology, title()
, that applies phrase embedding and positional encoding to the enter sequence and feeds the tip outcome to $N$ encoder layers:
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | ... def title(self, input_sentence, padding_mask, teaching): # Generate the positional encoding pos_encoding_output = self.pos_encoding(input_sentence) # Expected output kind = (batch_size, sequence_length, d_model) # Add in a dropout layer x = self.dropout(pos_encoding_output, teaching=teaching) # Pass on the positional encoded values to each encoder layer for i, layer in enumerate(self.encoder_layer): x = layer(x, padding_mask, teaching) return x |
The code itemizing for the full Transformer encoder is the subsequent:
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | from tensorflow.keras.layers import LayerNormalization, Layer, Dense, ReLU, Dropout from multihead_attention import MultiHeadAttention from positional_encoding import PositionEmbeddingFixedWeights # Implementing the Add & Norm Layer class AddNormalization(Layer): def __init__(self, **kwargs): super(AddNormalization, self).__init__(**kwargs) self.layer_norm = LayerNormalization() # Layer normalization layer def title(self, x, sublayer_x): # The sublayer enter and output should be of the similar kind to be summed add = x + sublayer_x # Apply layer normalization to the sum return self.layer_norm(add) # Implementing the Feed-Forward Layer class FeedForward(Layer): def __init__(self, d_ff, d_model, **kwargs): super(FeedForward, self).__init__(**kwargs) self.fully_connected1 = Dense(d_ff) # First completely associated layer self.fully_connected2 = Dense(d_model) # Second completely associated layer self.activation = ReLU() # ReLU activation layer def title(self, x): # The enter is handed into the two fully-connected layers, with a ReLU in between x_fc1 = self.fully_connected1(x) return self.fully_connected2(self.activation(x_fc1)) # Implementing the Encoder Layer class EncoderLayer(Layer): def __init__(self, h, d_k, d_v, d_model, d_ff, value, **kwargs): super(EncoderLayer, self).__init__(**kwargs) self.multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model) self.dropout1 = Dropout(value) self.add_norm1 = AddNormalization() self.feed_forward = FeedForward(d_ff, d_model) self.dropout2 = Dropout(value) self.add_norm2 = AddNormalization() def title(self, x, padding_mask, teaching): # Multi-head consideration layer multihead_output = self.multihead_attention(x, x, x, padding_mask) # Expected output kind = (batch_size, sequence_length, d_model) # Add in a dropout layer multihead_output = self.dropout1(multihead_output, teaching=teaching) # Followed by an Add & Norm layer addnorm_output = self.add_norm1(x, multihead_output) # Expected output kind = (batch_size, sequence_length, d_model) # Followed by a very associated layer feedforward_output = self.feed_forward(addnorm_output) # Expected output kind = (batch_size, sequence_length, d_model) # Add in a single different dropout layer feedforward_output = self.dropout2(feedforward_output, teaching=teaching) # Followed by one different Add & Norm layer return self.add_norm2(addnorm_output, feedforward_output) # Implementing the Encoder class Encoder(Layer): def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, value, **kwargs): super(Encoder, self).__init__(**kwargs) self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model) self.dropout = Dropout(value) self.encoder_layer = [EncoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n)] def title(self, input_sentence, padding_mask, teaching): # Generate the positional encoding pos_encoding_output = self.pos_encoding(input_sentence) # Expected output kind = (batch_size, sequence_length, d_model) # Add in a dropout layer x = self.dropout(pos_encoding_output, teaching=teaching) # Pass on the positional encoded values to each encoder layer for i, layer in enumerate(self.encoder_layer): x = layer(x, padding_mask, teaching) return x |
Testing Out the Code
You will work with the parameter values specified throughout the paper, Attention Is All You Need, by Vaswani et al. (2023):
Python
1 2 3 4 5 6 7 8 9 10 | h = 8 # Number of self-attention heads d_k = 64 # Dimensionality of the linearly projected queries and keys d_v = 64 # Dimensionality of the linearly projected values d_ff = 2048 # Dimensionality of the within completely associated layer d_model = 512 # Dimensionality of the model sub-layers’ outputs n = 6 # Number of layers throughout the encoder stack batch_size = 64 # Batch measurement from the teaching course of dropout_rate = 0.1 # Frequency of dropping the enter fashions throughout the dropout layers ... |
As for the enter sequence, you may work with dummy info in the mean time until you arrive on the stage of teaching the entire Transformer model in a separate tutorial, at which stage you might be using exact sentences:
Python
1 2 3 4 5 6 | ... enc_vocab_size = 20 # Vocabulary measurement for the encoder input_seq_length = 5 # Maximum measurement of the enter sequence input_seq = random.random((batch_size, input_seq_length)) ... |
Next, you may create a model new event of the Encoder
class, assigning its output to the encoder
variable, subsequently feeding throughout the enter arguments, and printing the tip outcome. You will set the padding masks argument to None
in the mean time, nonetheless you may return to this everytime you implement the entire Transformer model:
Python
1 2 3 | ... encoder = Encoder(enc_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate) print(encoder(input_seq, None, True)) |
Tying each little factor collectively produces the subsequent code itemizing:
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | from numpy import random enc_vocab_size = 20 # Vocabulary measurement for the encoder input_seq_length = 5 # Maximum measurement of the enter sequence h = 8 # Number of self-attention heads d_k = 64 # Dimensionality of the linearly projected queries and keys d_v = 64 # Dimensionality of the linearly projected values d_ff = 2048 # Dimensionality of the within completely associated layer d_model = 512 # Dimensionality of the model sub-layers’ outputs n = 6 # Number of layers throughout the encoder stack batch_size = 64 # Batch measurement from the teaching course of dropout_rate = 0.1 # Frequency of dropping the enter fashions throughout the dropout layers input_seq = random.random((batch_size, input_seq_length)) encoder = Encoder(enc_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate) print(encoder(input_seq, None, True)) |
Running this code produces an output of kind (batch measurement, sequence measurement, model dimensionality). Note that you’re going to likely see a singular output due to the random initialization of the enter sequence and the parameter values of the Dense layers.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | tf.Tensor( [[[-0.4214715 -1.1246173 -0.8444572 … 1.6388322 -0.1890367 1.0173352 ] [ 0.21662089 -0.61147404 -1.0946581 … 1.4627445 -0.6000164 -0.64127874] [ 0.46674493 -1.4155326 -0.5686513 … 1.1790234 -0.94788337 0.1331717 ] [-0.30638126 -1.9047263 -1.8556844 … 0.9130118 -0.47863355 0.00976158] [-0.22600567 -0.9702025 -0.91090447 … 1.7457147 -0.139926 -0.07021569]] … [[-0.48047638 -1.1034104 -0.16164204 … 1.5588069 0.08743562 -0.08847156] [-0.61683714 -0.8403657 -1.0450369 … 2.3587787 -0.76091915 -0.02891812] [-0.34268388 -0.65042275 -0.6715749 … 2.8530657 -0.33631966 0.5215888 ] [-0.6288677 -1.0030932 -0.9749813 … 2.1386387 0.0640307 -0.69504136] [-1.33254 -1.2524267 -0.230098 … 2.515467 -0.04207756 -0.3395423 ]]], kind=(64, 5, 512), dtype=float32) |
Further Reading
This half provides further sources on the topic in case you might be searching for to go deeper.
Books
Papers
Summary
In this tutorial, you discovered discover ways to implement the Transformer encoder from scratch in TensorMotion and Keras.
Specifically, you realized:
- The layers that sort part of the Transformer encoder
- How to implement the Transformer encoder from scratch
Do you may need any questions?
Ask your questions throughout the suggestions beneath, and I’ll do my most interesting to answer.
Learn Transformers and Attention!
Teach your deep finding out model to be taught a sentence
…using transformer fashions with consideration
Discover how in my new Ebook:
Building Transformer Models with Attention
It provides self-study tutorials with working code to info you into developing a fully-working transformer fashions that will
translate sentences from one language to a special…
Give magical vitality of understanding human language for
Your Projects
See What’s Inside
TensorMotion 2 Tutorial: Get Started in Deep Learning…
Building Transformer Models with Attention Crash…
Joining the Transformer Encoder and Decoder Plus Masking
Implementing the Transformer Decoder from Scratch in…
How to Develop an Encoder-Decoder Model with…
Multi-Step LSTM Time Series Forecasting Models for…
- Get link
- X
- Other Apps
Comments
Post a Comment