How to Implement Multi-Head Attention from Scratch in TensorCirculation and Keras
- Get link
- X
- Other Apps
Last Updated on January 6, 2023
We have already familiarized ourselves with the thought behind the Transformer model and its consideration mechanism. We have already started our journey of implementing an entire model by seeing the best solution to implement the scaled-dot product consideration. We shall now progress one step extra into our journey by encapsulating the scaled-dot product consideration proper right into a multi-head consideration mechanism, which is a core half. Our end goal stays to make use of the entire model to Natural Language Processing (NLP).
In this tutorial, you will uncover the best solution to implement multi-head consideration from scratch in TensorCirculation and Keras.
After ending this tutorial, you will know:
- The layers that sort part of the multi-head consideration mechanism.
- How to implement the multi-head consideration mechanism from scratch.
Kick-start your enterprise with my e ebook Building Transformer Models with Attention. It presents self-study tutorials with working code to info you into establishing a fully-working transformer model which will
translate sentences from one language to a unique…
Let’s get started.

How to implement multi-head consideration from scratch in TensorCirculation and Keras
Photo by Everaldo Coelho, some rights reserved.
Tutorial Overview
This tutorial is cut up into three components; they’re:
- Recap of the Transformer Architecture
- The Transformer Multi-Head Attention
- Implementing Multi-Head Attention From Scratch
- Testing Out the Code
Prerequisites
For this tutorial, we assume that you just’re already acquainted with:
- The thought of consideration
- The Transfomer consideration mechanism
- The Transformer model
- The scaled dot-product consideration
Recap of the Transformer Architecture
Recall having seen that the Transformer construction follows an encoder-decoder development. The encoder, on the left-hand side, is tasked with mapping an enter sequence to a sequence of regular representations; the decoder, on the right-hand side, receives the output of the encoder together with the decoder output on the sooner time step to generate an output sequence.

The encoder-decoder development of the Transformer construction
Taken from “Attention Is All You Need“
In producing an output sequence, the Transformer does not depend upon recurrence and convolutions.
You have seen that the decoder part of the Transformer shares many similarities in its construction with the encoder. One of the core mechanisms that every the encoder and decoder share is the multi-head consideration mechanism.
The Transformer Multi-Head Attention
Each multi-head consideration block is made up of 4 consecutive ranges:
- On the first diploma, three linear (dense) layers that each acquire the queries, keys, or values
- On the second diploma, a scaled dot-product consideration function. The operations carried out on every the first and second ranges are repeated h situations and carried out in parallel, consistent with the number of heads composing the multi-head consideration block.
- On the third diploma, a concatenation operation that joins the outputs of the fully totally different heads
- On the fourth diploma, a final linear (dense) layer that produces the output

Multi-head consideration
Taken from “Attention Is All You Need“
Recall as correctly the required components which will perform establishing blocks to your implementation of the multi-head consideration:
- The queries, keys, and values: These are the inputs to each multi-head consideration block. In the encoder stage, they each carry the an identical enter sequence after this has been embedded and augmented by positional information. Similarly, on the decoder side, the queries, keys, and values fed into the first consideration block symbolize the an identical aim sequence after this is ready to have moreover been embedded and augmented by positional information. The second consideration block of the decoder receives the encoder output inside the kind of keys and values, and the normalized output of the first decoder consideration block as a result of the queries. The dimensionality of the queries and keys is denoted by $d_k$, whereas the dimensionality of the values is denoted by $d_v$.
- The projection matrices: When utilized to the queries, keys, and values, these projection matrices generate fully totally different subspace representations of each. Each consideration head then works on thought of considered one of these projected variations of the queries, keys, and values. An additional projection matrix might be utilized to the output of the multi-head consideration block after the outputs of each specific individual head would have been concatenated collectively. The projection matrices are found all through teaching.
Let’s now see the best solution to implement the multi-head consideration from scratch in TensorCirculation and Keras.
Implementing Multi-Head Attention from Scratch
Let’s start by creating the class, MultiHeadAttention
, which inherits from the Layer
base class in Keras and initialize quite a few event attributes that you just simply shall be working with (attribute descriptions is also found throughout the suggestions):
Python
1 2 3 4 5 6 7 8 9 10 11 12 | class MultiHeadAttention(Layer): def __init__(self, h, d_k, d_v, d_model, **kwargs): super(MultiHeadAttention, self).__init__(**kwargs) self.consideration = DotProductAttention() # Scaled dot product consideration self.heads = h # Number of consideration heads to utilize self.d_k = d_okay # Dimensionality of the linearly projected queries and keys self.d_v = d_v # Dimensionality of the linearly projected values self.W_q = Dense(d_k) # Learned projection matrix for the queries self.W_k = Dense(d_k) # Learned projection matrix for the keys self.W_v = Dense(d_v) # Learned projection matrix for the values self.W_o = Dense(d_model) # Learned projection matrix for the multi-head output ... |
Here observe that an event of the DotProductAttention
class that was carried out earlier has been created, and its output was assigned to the variable consideration
. Recall that you just simply carried out the DotProductAttention
class as follows:
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | from tensorflow import matmul, math, cast, float32 from tensorflow.keras.layers import Layer from keras.backend import softmax # Implementing the Scaled-Dot Product Attention class DotProductAttention(Layer): def __init__(self, **kwargs): super(DotProductAttention, self).__init__(**kwargs) def title(self, queries, keys, values, d_k, masks=None): # Scoring the queries in opposition to the keys after transposing the latter, and scaling scores = matmul(queries, keys, transpose_b=True) / math.sqrt(cast(d_k, float32)) # Apply masks to the attention scores if masks is not None: scores += –1e9 * masks # Computing the weights by a softmax operation weights = softmax(scores) # Computing the attention by a weighted sum of the value vectors return matmul(weights, values) |
Next, you will be reshaping the linearly projected queries, keys, and values in such a approach as to allow the attention heads to be computed in parallel.
The queries, keys, and values will in all probability be fed as enter into the multi-head consideration block having a type of (batch measurement, sequence dimension, model dimensionality), the place the batch measurement is a hyperparameter of the teaching course of, the sequence dimension defines the utmost dimension of the enter/output phrases, and the model dimensionality is the dimensionality of the outputs produced by all sub-layers of the model. They are then handed via the respective dense layer to be linearly projected to a type of (batch measurement, sequence dimension, queries/keys/values dimensionality).
The linearly projected queries, keys, and values will in all probability be rearranged into (batch measurement, number of heads, sequence dimension, depth), by first reshaping them into (batch measurement, sequence dimension, number of heads, depth) after which transposing the second and third dimensions. For this aim, you will create the class approach, reshape_tensor
, as follows:
Python
1 2 3 4 5 6 7 8 9 10 | def reshape_tensor(self, x, heads, flag): if flag: # Tensor type after reshaping and transposing: (batch_size, heads, seq_length, -1) x = reshape(x, type=(type(x)[0], type(x)[1], heads, –1)) x = transpose(x, perm=(0, 2, 1, 3)) else: # Reverting the reshaping and transposing operations: (batch_size, seq_length, d_model) x = transpose(x, perm=(0, 2, 1, 3)) x = reshape(x, type=(type(x)[0], type(x)[1], –1)) return x |
The reshape_tensor
approach receives the linearly projected queries, keys, or values as enter (whereas setting the flag to True
) to be rearranged as beforehand outlined. Once the multi-head consideration output has been generated, that’s moreover fed into the an identical function (this time setting the flag to False
) to hold out a reverse operation, efficiently concatenating the outcomes of all heads collectively.
Hence, the next step is to feed the linearly projected queries, keys, and values into the reshape_tensor
approach to be rearranged, then feed them into the scaled dot-product consideration function. In order to take motion, let’s create one different class approach, title
, as follows:
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | def title(self, queries, keys, values, masks=None): # Rearrange the queries to have the power to compute all heads in parallel q_reshaped = self.reshape_tensor(self.W_q(queries), self.heads, True) # Resulting tensor type: (batch_size, heads, input_seq_length, -1) # Rearrange the keys to have the power to compute all heads in parallel k_reshaped = self.reshape_tensor(self.W_k(keys), self.heads, True) # Resulting tensor type: (batch_size, heads, input_seq_length, -1) # Rearrange the values to have the power to compute all heads in parallel v_reshaped = self.reshape_tensor(self.W_v(values), self.heads, True) # Resulting tensor type: (batch_size, heads, input_seq_length, -1) # Compute the multi-head consideration output using the reshaped queries, keys and values o_reshaped = self.consideration(q_reshaped, k_reshaped, v_reshaped, self.d_k, masks) # Resulting tensor type: (batch_size, heads, input_seq_length, -1) ... |
Note that the reshape_tensor
approach may even acquire a masks (whose value defaults to None
) as enter, together with the queries, keys, and values.
Recall that the Transformer model introduces a look-ahead masks to cease the decoder from attending to succeeding phrases, such that the prediction for a particular phrase can solely rely upon acknowledged outputs for the phrases that come sooner than it. Furthermore, given that phrase embeddings are zero-padded to a particular sequence dimension, a padding masks moreover should be launched to cease the zero values from being processed along with the enter. These look-ahead and padding masks will probably be handed on to the scaled-dot product consideration via the masks
argument.
Once you’ve got generated the multi-head consideration output from all the attention heads, the last word steps are to concatenate once more all outputs collectively proper right into a tensor of type (batch measurement, sequence dimension, values dimensionality) and passing the end result via one final dense layer. For this aim, you will add the next two strains of code to the title
approach.
Python
1 2 3 4 5 6 7 8 | ... # Rearrange once more the output into concatenated sort output = self.reshape_tensor(o_reshaped, self.heads, False) # Resulting tensor type: (batch_size, input_seq_length, d_v) # Apply one final linear projection to the output to generate the multi-head consideration # Resulting tensor type: (batch_size, input_seq_length, d_model) return self.W_o(output) |
Putting all of the items collectively, you’ve got the following implementation of the multi-head consideration:
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | from tensorflow import math, matmul, reshape, type, transpose, cast, float32 from tensorflow.keras.layers import Dense, Layer from keras.backend import softmax # Implementing the Scaled-Dot Product Attention class DotProductAttention(Layer): def __init__(self, **kwargs): super(DotProductAttention, self).__init__(**kwargs) def title(self, queries, keys, values, d_k, masks=None): # Scoring the queries in opposition to the keys after transposing the latter, and scaling scores = matmul(queries, keys, transpose_b=True) / math.sqrt(cast(d_k, float32)) # Apply masks to the attention scores if masks is not None: scores += –1e9 * masks # Computing the weights by a softmax operation weights = softmax(scores) # Computing the attention by a weighted sum of the value vectors return matmul(weights, values) # Implementing the Multi-Head Attention class MultiHeadAttention(Layer): def __init__(self, h, d_k, d_v, d_model, **kwargs): super(MultiHeadAttention, self).__init__(**kwargs) self.consideration = DotProductAttention() # Scaled dot product consideration self.heads = h # Number of consideration heads to utilize self.d_k = d_okay # Dimensionality of the linearly projected queries and keys self.d_v = d_v # Dimensionality of the linearly projected values self.d_model = d_model # Dimensionality of the model self.W_q = Dense(d_k) # Learned projection matrix for the queries self.W_k = Dense(d_k) # Learned projection matrix for the keys self.W_v = Dense(d_v) # Learned projection matrix for the values self.W_o = Dense(d_model) # Learned projection matrix for the multi-head output def reshape_tensor(self, x, heads, flag): if flag: # Tensor type after reshaping and transposing: (batch_size, heads, seq_length, -1) x = reshape(x, type=(type(x)[0], type(x)[1], heads, –1)) x = transpose(x, perm=(0, 2, 1, 3)) else: # Reverting the reshaping and transposing operations: (batch_size, seq_length, d_k) x = transpose(x, perm=(0, 2, 1, 3)) x = reshape(x, type=(type(x)[0], type(x)[1], self.d_k)) return x def title(self, queries, keys, values, masks=None): # Rearrange the queries to have the power to compute all heads in parallel q_reshaped = self.reshape_tensor(self.W_q(queries), self.heads, True) # Resulting tensor type: (batch_size, heads, input_seq_length, -1) # Rearrange the keys to have the power to compute all heads in parallel k_reshaped = self.reshape_tensor(self.W_k(keys), self.heads, True) # Resulting tensor type: (batch_size, heads, input_seq_length, -1) # Rearrange the values to have the power to compute all heads in parallel v_reshaped = self.reshape_tensor(self.W_v(values), self.heads, True) # Resulting tensor type: (batch_size, heads, input_seq_length, -1) # Compute the multi-head consideration output using the reshaped queries, keys and values o_reshaped = self.consideration(q_reshaped, k_reshaped, v_reshaped, self.d_k, masks) # Resulting tensor type: (batch_size, heads, input_seq_length, -1) # Rearrange once more the output into concatenated sort output = self.reshape_tensor(o_reshaped, self.heads, False) # Resulting tensor type: (batch_size, input_seq_length, d_v) # Apply one final linear projection to the output to generate the multi-head consideration # Resulting tensor type: (batch_size, input_seq_length, d_model) return self.W_o(output) |
Want to Get Started With Building Transformer Models with Attention?
Take my free 12-day piece of email crash course now (with sample code).
Click to sign-up and as well as get a free PDF Ebook mannequin of the course.
Testing Out the Code
You will in all probability be working with the parameter values specified throughout the paper, Attention Is All You Need, by Vaswani et al. (2023):
Python
1 2 3 4 5 6 | h = 8 # Number of self-attention heads d_k = 64 # Dimensionality of the linearly projected queries and keys d_v = 64 # Dimensionality of the linearly projected values d_model = 512 # Dimensionality of the model sub-layers’ outputs batch_size = 64 # Batch measurement from the teaching course of ... |
As for the sequence dimension and the queries, keys, and values, you will be working with dummy info within the interim until you arrive on the stage of teaching the entire Transformer model in a separate tutorial, at which degree you will be using exact sentences:
Python
1 2 3 4 5 6 7 | ... input_seq_length = 5 # Maximum dimension of the enter sequence queries = random.random((batch_size, input_seq_length, d_k)) keys = random.random((batch_size, input_seq_length, d_k)) values = random.random((batch_size, input_seq_length, d_v)) ... |
In the entire Transformer model, values for the sequence dimension and the queries, keys, and values will in all probability be obtained via a technique of phrase tokenization and embedding. We will in all probability be defending this in a separate tutorial.
Returning to the testing course of, the next step is to create a model new event of the MultiHeadAttention
class, assigning its output to the multihead_attention
variable:
Python
1 2 3 | ... multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model) ... |
Since the MultiHeadAttention
class inherits from the Layer
base class, the title()
strategy of the earlier will in all probability be routinely invoked by the magic __call()__
strategy of the latter. The final step is to go throughout the enter arguments and print the end result:
Python
1 2 | ... print(multihead_attention(queries, keys, values)) |
Tying all of the items collectively produces the following code itemizing:
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | from numpy import random input_seq_length = 5 # Maximum dimension of the enter sequence h = 8 # Number of self-attention heads d_k = 64 # Dimensionality of the linearly projected queries and keys d_v = 64 # Dimensionality of the linearly projected values d_model = 512 # Dimensionality of the model sub-layers’ outputs batch_size = 64 # Batch measurement from the teaching course of queries = random.random((batch_size, input_seq_length, d_k)) keys = random.random((batch_size, input_seq_length, d_k)) values = random.random((batch_size, input_seq_length, d_v)) multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model) print(multihead_attention(queries, keys, values)) |
Running this code produces an output of type (batch measurement, sequence dimension, model dimensionality). Note that you will in all probability see a definite output because of random initialization of the queries, keys, and values and the parameter values of the dense layers.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | tf.Tensor( [[[-0.02185373 0.32784638 0.15958631 … -0.0353895 0.6645204 -0.2588266 ] [-0.02272229 0.32292002 0.16208754 … -0.03644213 0.66478664 -0.26139447] [-0.01876744 0.32900316 0.16190802 … -0.03548665 0.6645842 -0.26155376] [-0.02193783 0.32687354 0.15801215 … -0.03232524 0.6642926 -0.25795174] [-0.02224652 0.32437912 0.1596448 … -0.0340827 0.6617497 -0.26065096]] … [[ 0.05414441 0.27019292 0.1845745 … 0.0809482 0.63738805 -0.34231138] [ 0.05546578 0.27191412 0.18483458 … 0.08379208 0.6366671 -0.34372023] [ 0.05190979 0.27185103 0.18378328 … 0.08341806 0.63851804 -0.3422392 ] [ 0.05437043 0.27318984 0.18792395 … 0.08043509 0.6391771 -0.34357914] [ 0.05406848 0.27073097 0.18579456 … 0.08388947 0.6376929 -0.34230167]]], type=(64, 5, 512), dtype=float32) |
Further Reading
This half presents additional sources on the topic when you’re searching for to go deeper.
Books
Papers
Summary
In this tutorial, you discovered the best solution to implement multi-head consideration from scratch in TensorCirculation and Keras.
Specifically, you found:
- The layers that sort part of the multi-head consideration mechanism
- How to implement the multi-head consideration mechanism from scratch
Do you’ve got any questions?
Ask your questions throughout the suggestions beneath, and I’ll do my biggest to answer.
Learn Transformers and Attention!
Teach your deep finding out model to be taught a sentence
…using transformer fashions with consideration
Discover how in my new Ebook:
Building Transformer Models with Attention
It presents self-study tutorials with working code to info you into establishing a fully-working transformer fashions which will
translate sentences from one language to a unique…
Give magical power of understanding human language for
Your Projects
See What’s Inside
TensorCirculation 2 Tutorial: Get Started in Deep Learning…
Building Transformer Models with Attention Crash…
The Transformer Attention Mechanism
How to Develop an Encoder-Decoder Model with…
Handwritten Digit Recognition Using Convolutional…
Multi-Label Classification of Satellite Photos of…
- Get link
- X
- Other Apps
Comments
Post a Comment