How to Implement Multi-Head Attention from Scratch in TensorCirculation and Keras

Last Updated on January 6, 2023

We have already familiarized ourselves with the thought behind the Transformer model and its consideration mechanism. We have already started our journey of implementing an entire model by seeing the best solution to implement the scaled-dot product consideration. We shall now progress one step extra into our journey by encapsulating the scaled-dot product consideration proper right into a multi-head consideration mechanism, which is a core half. Our end goal stays to make use of the entire model to Natural Language Processing (NLP).

In this tutorial, you will uncover the best solution to implement multi-head consideration from scratch in TensorCirculation and Keras.

After ending this tutorial, you will know:

The layers that sort part of the multi-head consideration mechanism.
How to implement the multi-head consideration mechanism from scratch.

Kick-start your enterprise with my e ebook Building Transformer Models with Attention. It presents self-study tutorials with working code to info you into establishing a fully-working transformer model which will
translate sentences from one language to a unique…

Let’s get started.

How to implement multi-head consideration from scratch in TensorCirculation and Keras
Photo by Everaldo Coelho, some rights reserved.

Tutorial Overview

This tutorial is cut up into three components; they’re:

Recap of the Transformer Architecture
- The Transformer Multi-Head Attention
Implementing Multi-Head Attention From Scratch
Testing Out the Code

Prerequisites

For this tutorial, we assume that you just’re already acquainted with:

The thought of consideration
The Transfomer consideration mechanism
The Transformer model
The scaled dot-product consideration

Recap of the Transformer Architecture

Recall having seen that the Transformer construction follows an encoder-decoder development. The encoder, on the left-hand side, is tasked with mapping an enter sequence to a sequence of regular representations; the decoder, on the right-hand side, receives the output of the encoder together with the decoder output on the sooner time step to generate an output sequence.

The encoder-decoder development of the Transformer construction
Taken from “Attention Is All You Need“

In producing an output sequence, the Transformer does not depend upon recurrence and convolutions.

You have seen that the decoder part of the Transformer shares many similarities in its construction with the encoder. One of the core mechanisms that every the encoder and decoder share is the multi-head consideration mechanism.

The Transformer Multi-Head Attention

Each multi-head consideration block is made up of 4 consecutive ranges:

On the first diploma, three linear (dense) layers that each acquire the queries, keys, or values
On the second diploma, a scaled dot-product consideration function. The operations carried out on every the first and second ranges are repeated h situations and carried out in parallel, consistent with the number of heads composing the multi-head consideration block.
On the third diploma, a concatenation operation that joins the outputs of the fully totally different heads
On the fourth diploma, a final linear (dense) layer that produces the output

Multi-head consideration
Taken from “Attention Is All You Need“

Recall as correctly the required components which will perform establishing blocks to your implementation of the multi-head consideration:

The queries, keys, and values: These are the inputs to each multi-head consideration block. In the encoder stage, they each carry the an identical enter sequence after this has been embedded and augmented by positional information. Similarly, on the decoder side, the queries, keys, and values fed into the first consideration block symbolize the an identical aim sequence after this is ready to have moreover been embedded and augmented by positional information. The second consideration block of the decoder receives the encoder output inside the kind of keys and values, and the normalized output of the first decoder consideration block as a result of the queries. The dimensionality of the queries and keys is denoted by $d_k$, whereas the dimensionality of the values is denoted by $d_v$.

The projection matrices: When utilized to the queries, keys, and values, these projection matrices generate fully totally different subspace representations of each. Each consideration head then works on thought of considered one of these projected variations of the queries, keys, and values. An additional projection matrix might be utilized to the output of the multi-head consideration block after the outputs of each specific individual head would have been concatenated collectively. The projection matrices are found all through teaching.

Let’s now see the best solution to implement the multi-head consideration from scratch in TensorCirculation and Keras.

Implementing Multi-Head Attention from Scratch

Let’s start by creating the class, MultiHeadAttention, which inherits from the Layer base class in Keras and initialize quite a few event attributes that you just simply shall be working with (attribute descriptions is also found throughout the suggestions):

class MultiHeadAttention(Layer):<br />    def __init__(self, h, d_k, d_v, d_model, **kwargs):<br />        super(MultiHeadAttention, self).__init__(**kwargs)<br />        self.consideration = DotProductAttention()  # Scaled dot product consideration<br />        self.heads = h  # Number of consideration heads to utilize<br />        self.d_k = d_k  # Dimensionality of the linearly projected queries and keys<br />        self.d_v = d_v  # Dimensionality of the linearly projected values<br />        self.W_q = Dense(d_k)  # Learned projection matrix for the queries<br />        self.W_k = Dense(d_k)  # Learned projection matrix for the keys<br />        self.W_v = Dense(d_v)  # Learned projection matrix for the values<br />        self.W_o = Dense(d_model)  # Learned projection matrix for the multi-head output<br />        …

class MultiHeadAttention(Layer):

def __init__(self, h, d_k, d_v, d_model, **kwargs):

super(MultiHeadAttention, self).__init__(**kwargs)

self.consideration = DotProductAttention() # Scaled dot product consideration

self.heads = h # Number of consideration heads to utilize

self.d_k = d_okay # Dimensionality of the linearly projected queries and keys

self.d_v = d_v # Dimensionality of the linearly projected values

self.W_q = Dense(d_k) # Learned projection matrix for the queries

self.W_k = Dense(d_k) # Learned projection matrix for the keys

self.W_v = Dense(d_v) # Learned projection matrix for the values

self.W_o = Dense(d_model) # Learned projection matrix for the multi-head output

...

Here observe that an event of the DotProductAttention class that was carried out earlier has been created, and its output was assigned to the variable consideration. Recall that you just simply carried out the DotProductAttention class as follows:

from tensorflow import matmul, math, cast, float32<br />from tensorflow.keras.layers import Layer<br />from keras.backend import softmax</p><p># Implementing the Scaled-Dot Product Attention<br />class DotProductAttention(Layer):<br />    def __init__(self, **kwargs):<br />        super(DotProductAttention, self).__init__(**kwargs)</p><p>    def title(self, queries, keys, values, d_k, masks=None):<br />        # Scoring the queries in opposition to the keys after transposing the latter, and scaling<br />        scores = matmul(queries, keys, transpose_b=True) / math.sqrt(cast(d_k, float32))</p><p>        # Apply masks to the attention scores<br />        if masks simply is not None:<br />            scores += -1e9 * masks</p><p>        # Computing the weights by a softmax operation<br />        weights = softmax(scores)</p><p>        # Computing the attention by a weighted sum of the value vectors<br />        return matmul(weights, values)

from tensorflow import matmul, math, cast, float32

from tensorflow.keras.layers import Layer

from keras.backend import softmax

# Implementing the Scaled-Dot Product Attention

class DotProductAttention(Layer):

def __init__(self, **kwargs):

super(DotProductAttention, self).__init__(**kwargs)

def title(self, queries, keys, values, d_k, masks=None):

# Scoring the queries in opposition to the keys after transposing the latter, and scaling

scores = matmul(queries, keys, transpose_b=True) / math.sqrt(cast(d_k, float32))

# Apply masks to the attention scores

if masks is not None:

scores += –1e9 * masks

# Computing the weights by a softmax operation

weights = softmax(scores)

# Computing the attention by a weighted sum of the value vectors

return matmul(weights, values)

Next, you will be reshaping the linearly projected queries, keys, and values in such a approach as to allow the attention heads to be computed in parallel.

The queries, keys, and values will in all probability be fed as enter into the multi-head consideration block having a type of (batch measurement, sequence dimension, model dimensionality), the place the batch measurement is a hyperparameter of the teaching course of, the sequence dimension defines the utmost dimension of the enter/output phrases, and the model dimensionality is the dimensionality of the outputs produced by all sub-layers of the model. They are then handed via the respective dense layer to be linearly projected to a type of (batch measurement, sequence dimension, queries/keys/values dimensionality).

The linearly projected queries, keys, and values will in all probability be rearranged into (batch measurement, number of heads, sequence dimension, depth), by first reshaping them into (batch measurement, sequence dimension, number of heads, depth) after which transposing the second and third dimensions. For this aim, you will create the class approach, reshape_tensor, as follows:

def reshape_tensor(self, x, heads, flag):<br />    if flag:<br />        # Tensor type after reshaping and transposing: (batch_size, heads, seq_length, -1)<br />        x = reshape(x, type=(type(x)[0], type(x)[1], heads, -1))<br />        x = transpose(x, perm=(0, 2, 1, 3))<br />    else:<br />        # Reverting the reshaping and transposing operations: (batch_size, seq_length, d_model)<br />        x = transpose(x, perm=(0, 2, 1, 3))<br />        x = reshape(x, type=(type(x)[0], type(x)[1], -1))<br />    return x

def reshape_tensor(self, x, heads, flag):

if flag:

# Tensor type after reshaping and transposing: (batch_size, heads, seq_length, -1)

x = reshape(x, type=(type(x)[0], type(x)[1], heads, –1))

x = transpose(x, perm=(0, 2, 1, 3))

else:

# Reverting the reshaping and transposing operations: (batch_size, seq_length, d_model)

x = transpose(x, perm=(0, 2, 1, 3))

x = reshape(x, type=(type(x)[0], type(x)[1], –1))

return x

The reshape_tensor approach receives the linearly projected queries, keys, or values as enter (whereas setting the flag to True) to be rearranged as beforehand outlined. Once the multi-head consideration output has been generated, that’s moreover fed into the an identical function (this time setting the flag to False) to hold out a reverse operation, efficiently concatenating the outcomes of all heads collectively.

Hence, the next step is to feed the linearly projected queries, keys, and values into the reshape_tensor approach to be rearranged, then feed them into the scaled dot-product consideration function. In order to take motion, let’s create one different class approach, title, as follows:

def title(self, queries, keys, values, masks=None):<br />    # Rearrange the queries to have the power to compute all heads in parallel<br />    q_reshaped = self.reshape_tensor(self.W_q(queries), self.heads, True)<br />    # Resulting tensor type: (batch_size, heads, input_seq_length, -1)</p><p>    # Rearrange the keys to have the power to compute all heads in parallel<br />    k_reshaped = self.reshape_tensor(self.W_k(keys), self.heads, True)<br />    # Resulting tensor type: (batch_size, heads, input_seq_length, -1)</p><p>    # Rearrange the values to have the power to compute all heads in parallel<br />    v_reshaped = self.reshape_tensor(self.W_v(values), self.heads, True)<br />    # Resulting tensor type: (batch_size, heads, input_seq_length, -1)</p><p>    # Compute the multi-head consideration output using the reshaped queries, keys and values<br />    o_reshaped = self.consideration(q_reshaped, k_reshaped, v_reshaped, self.d_k, masks)<br />    # Resulting tensor type: (batch_size, heads, input_seq_length, -1)<br />    …

def title(self, queries, keys, values, masks=None):

# Rearrange the queries to have the power to compute all heads in parallel

q_reshaped = self.reshape_tensor(self.W_q(queries), self.heads, True)

# Resulting tensor type: (batch_size, heads, input_seq_length, -1)

# Rearrange the keys to have the power to compute all heads in parallel

k_reshaped = self.reshape_tensor(self.W_k(keys), self.heads, True)

# Resulting tensor type: (batch_size, heads, input_seq_length, -1)

# Rearrange the values to have the power to compute all heads in parallel

v_reshaped = self.reshape_tensor(self.W_v(values), self.heads, True)

# Resulting tensor type: (batch_size, heads, input_seq_length, -1)

# Compute the multi-head consideration output using the reshaped queries, keys and values

o_reshaped = self.consideration(q_reshaped, k_reshaped, v_reshaped, self.d_k, masks)

# Resulting tensor type: (batch_size, heads, input_seq_length, -1)

...

Note that the reshape_tensor approach may even acquire a masks (whose value defaults to None) as enter, together with the queries, keys, and values.

Recall that the Transformer model introduces a look-ahead masks to cease the decoder from attending to succeeding phrases, such that the prediction for a particular phrase can solely rely upon acknowledged outputs for the phrases that come sooner than it. Furthermore, given that phrase embeddings are zero-padded to a particular sequence dimension, a padding masks moreover should be launched to cease the zero values from being processed along with the enter. These look-ahead and padding masks will probably be handed on to the scaled-dot product consideration via the masks argument.

Once you’ve got generated the multi-head consideration output from all the attention heads, the last word steps are to concatenate once more all outputs collectively proper right into a tensor of type (batch measurement, sequence dimension, values dimensionality) and passing the end result via one final dense layer. For this aim, you will add the next two strains of code to the title approach.

…<br /># Rearrange once more the output into concatenated sort<br />output = self.reshape_tensor(o_reshaped, self.heads, False)<br /># Resulting tensor type: (batch_size, input_seq_length, d_v)</p><p># Apply one final linear projection to the output to generate the multi-head consideration<br /># Resulting tensor type: (batch_size, input_seq_length, d_model)<br />return self.W_o(output)

...

# Rearrange once more the output into concatenated sort

output = self.reshape_tensor(o_reshaped, self.heads, False)

# Resulting tensor type: (batch_size, input_seq_length, d_v)

# Apply one final linear projection to the output to generate the multi-head consideration

# Resulting tensor type: (batch_size, input_seq_length, d_model)

return self.W_o(output)

Putting all of the items collectively, you’ve got the following implementation of the multi-head consideration:

from tensorflow import math, matmul, reshape, type, transpose, cast, float32<br />from tensorflow.keras.layers import Dense, Layer<br />from keras.backend import softmax</p><p># Implementing the Scaled-Dot Product Attention<br />class DotProductAttention(Layer):<br />    def __init__(self, **kwargs):<br />        super(DotProductAttention, self).__init__(**kwargs)</p><p>    def title(self, queries, keys, values, d_k, masks=None):<br />        # Scoring the queries in opposition to the keys after transposing the latter, and scaling<br />        scores = matmul(queries, keys, transpose_b=True) / math.sqrt(cast(d_k, float32))</p><p>        # Apply masks to the attention scores<br />        if masks simply is not None:<br />            scores += -1e9 * masks</p><p>        # Computing the weights by a softmax operation<br />        weights = softmax(scores)</p><p>        # Computing the attention by a weighted sum of the value vectors<br />        return matmul(weights, values)</p><p># Implementing the Multi-Head Attention<br />class MultiHeadAttention(Layer):<br />    def __init__(self, h, d_k, d_v, d_model, **kwargs):<br />        super(MultiHeadAttention, self).__init__(**kwargs)<br />        self.consideration = DotProductAttention()  # Scaled dot product consideration<br />        self.heads = h  # Number of consideration heads to utilize<br />        self.d_k = d_k  # Dimensionality of the linearly projected queries and keys<br />        self.d_v = d_v  # Dimensionality of the linearly projected values<br />        self.d_model = d_model  # Dimensionality of the model<br />        self.W_q = Dense(d_k)  # Learned projection matrix for the queries<br />        self.W_k = Dense(d_k)  # Learned projection matrix for the keys<br />        self.W_v = Dense(d_v)  # Learned projection matrix for the values<br />        self.W_o = Dense(d_model)  # Learned projection matrix for the multi-head output</p><p>    def reshape_tensor(self, x, heads, flag):<br />        if flag:<br />            # Tensor type after reshaping and transposing: (batch_size, heads, seq_length, -1)<br />            x = reshape(x, type=(type(x)[0], type(x)[1], heads, -1))<br />            x = transpose(x, perm=(0, 2, 1, 3))<br />        else:<br />            # Reverting the reshaping and transposing operations: (batch_size, seq_length, d_k)<br />            x = transpose(x, perm=(0, 2, 1, 3))<br />            x = reshape(x, type=(type(x)[0], type(x)[1], self.d_k))<br />        return x</p><p>    def title(self, queries, keys, values, masks=None):<br />        # Rearrange the queries to have the power to compute all heads in parallel<br />        q_reshaped = self.reshape_tensor(self.W_q(queries), self.heads, True)<br />        # Resulting tensor type: (batch_size, heads, input_seq_length, -1)</p><p>        # Rearrange the keys to have the power to compute all heads in parallel<br />        k_reshaped = self.reshape_tensor(self.W_k(keys), self.heads, True)<br />        # Resulting tensor type: (batch_size, heads, input_seq_length, -1)</p><p>        # Rearrange the values to have the power to compute all heads in parallel<br />        v_reshaped = self.reshape_tensor(self.W_v(values), self.heads, True)<br />        # Resulting tensor type: (batch_size, heads, input_seq_length, -1)</p><p>        # Compute the multi-head consideration output using the reshaped queries, keys and values<br />        o_reshaped = self.consideration(q_reshaped, k_reshaped, v_reshaped, self.d_k, masks)<br />        # Resulting tensor type: (batch_size, heads, input_seq_length, -1)</p><p>        # Rearrange once more the output into concatenated sort<br />        output = self.reshape_tensor(o_reshaped, self.heads, False)<br />        # Resulting tensor type: (batch_size, input_seq_length, d_v)</p><p>        # Apply one final linear projection to the output to generate the multi-head consideration<br />        # Resulting tensor type: (batch_size, input_seq_length, d_model)<br />        return self.W_o(output)

from tensorflow import math, matmul, reshape, type, transpose, cast, float32

from tensorflow.keras.layers import Dense, Layer

from keras.backend import softmax

# Implementing the Scaled-Dot Product Attention

class DotProductAttention(Layer):

def __init__(self, **kwargs):

super(DotProductAttention, self).__init__(**kwargs)

def title(self, queries, keys, values, d_k, masks=None):

# Scoring the queries in opposition to the keys after transposing the latter, and scaling

scores = matmul(queries, keys, transpose_b=True) / math.sqrt(cast(d_k, float32))

# Apply masks to the attention scores

if masks is not None:

scores += –1e9 * masks

# Computing the weights by a softmax operation

weights = softmax(scores)

# Computing the attention by a weighted sum of the value vectors

return matmul(weights, values)

# Implementing the Multi-Head Attention

class MultiHeadAttention(Layer):

def __init__(self, h, d_k, d_v, d_model, **kwargs):

super(MultiHeadAttention, self).__init__(**kwargs)

self.consideration = DotProductAttention() # Scaled dot product consideration

self.heads = h # Number of consideration heads to utilize

self.d_k = d_okay # Dimensionality of the linearly projected queries and keys

self.d_v = d_v # Dimensionality of the linearly projected values

self.d_model = d_model # Dimensionality of the model

self.W_q = Dense(d_k) # Learned projection matrix for the queries

self.W_k = Dense(d_k) # Learned projection matrix for the keys

self.W_v = Dense(d_v) # Learned projection matrix for the values

self.W_o = Dense(d_model) # Learned projection matrix for the multi-head output

def reshape_tensor(self, x, heads, flag):

if flag:

# Tensor type after reshaping and transposing: (batch_size, heads, seq_length, -1)

x = reshape(x, type=(type(x)[0], type(x)[1], heads, –1))

x = transpose(x, perm=(0, 2, 1, 3))

else:

# Reverting the reshaping and transposing operations: (batch_size, seq_length, d_k)

x = transpose(x, perm=(0, 2, 1, 3))

x = reshape(x, type=(type(x)[0], type(x)[1], self.d_k))