A Gentle Introduction to Positional Encoding in Transformer Models, Part 1

Last Updated on January 6, 2023

In languages, the order of the phrases and their place in a sentence truly points. The which suggests of your total sentence can change if the phrases are re-ordered. When implementing NLP choices, recurrent neural networks have an inbuilt mechanism that gives with the order of sequences. The transformer model, however, does not use recurrence or convolution and treats each info degree as unbiased of the other. Hence, positional information is added to the model explicitly to retain the data referring to the order of phrases in a sentence. Positional encoding is the scheme through which the data of the order of objects in a sequence is maintained.

For this tutorial, we’ll simplify the notations used on this distinctive paper, Attention Is All You Need by Vaswani et al. After ending this tutorial, you may know:

What is positional encoding, and why it’s important
Positional encoding in transformers
Code and visualize a positional encoding matrix in Python using NumPy

Kick-start your mission with my information Building Transformer Models with Attention. It affords self-study tutorials with working code to info you into developing a fully-working transformer model that will
translate sentences from one language to a special…

Let’s get started.

A fragile introduction to positional encoding in transformer fashions
Photo by Muhammad Murtaza Ghani on Unsplash, some rights reserved

Tutorial Overview

This tutorial is cut up into 4 elements; they’re:

What is positional encoding
Mathematics behind positional encoding in transformers
Implementing the positional encoding matrix using NumPy
Understanding and visualizing the positional encoding matrix

What Is Positional Encoding?

Positional encoding describes the state of affairs or place of an entity in a sequence so that each place is assigned a novel illustration. There are many the reason why a single amount, such as a result of the index price, simply is not used to represent an merchandise’s place in transformer fashions. For prolonged sequences, the indices can develop big in magnitude. If you normalize the index price to lie between 0 and 1, it’s going to presumably create points for variable dimension sequences as they’d be normalized in one other means.

Transformers use a wise positional encoding scheme, the place each place/index is mapped to a vector. Hence, the output of the positional encoding layer is a matrix, the place each row of the matrix represents an encoded object of the sequence summed with its positional information. An occasion of the matrix that encodes solely the positional information is confirmed inside the decide beneath.

A Quick Run-Through of the Trigonometric Sine Function

This is a quick recap of sine options; it’s possible you’ll work equivalently with cosine options. The carry out’s range is [-1,+1]. The frequency of this waveform is the number of cycles achieved in a single second. The wavelength is the house over which the waveform repeats itself. The wavelength and frequency for varied waveforms are confirmed beneath:

Want to Get Started With Building Transformer Models with Attention?

Take my free 12-day email correspondence crash course now (with sample code).

Click to sign-up and as well as get a free PDF Ebook mannequin of the course.

Positional Encoding Layer in Transformers

Let’s dive straight into this. Suppose you’ve got an enter sequence of dimension $L$ and require the place of the $okay^{th}$ object inside this sequence. The positional encoding is given by sine and cosine options of assorted frequencies:

begin{eqnarray}
P(okay, 2i) &=& sinBig(frac{okay}{n^{2i/d}}Big)
P(okay, 2i+1) &=& cosBig(frac{okay}{n^{2i/d}}Big)
end{eqnarray}

Here:

$okay$: Position of an object inside the enter sequence, $0 leq okay < L/2$

$d$: Dimension of the output embedding space

$P(okay, j)$: Position carry out for mapping a spot $okay$ inside the enter sequence to index $(okay,j)$ of the positional matrix

$n$: User-defined scalar, set to 10,000 by the authors of Attention Is All You Need.

$i$: Used for mapping to column indices $0 leq i < d/2$, with a single price of $i$ maps to every sine and cosine options

In the above expression, you might even see that even positions correspond to a sine carry out and odd positions correspond to cosine options.

Example

To understand the above expression, let’s take an occasion of the phrase “I am a robot,” with n=100 and d=4. The following desk reveals the positional encoding matrix for this phrase. In reality, the positional encoding matrix might be the similar for any four-letter phrase with n=100 and d=4.

Coding the Positional Encoding Matrix from Scratch

Here is a quick Python code to implement positional encoding using NumPy. The code is simplified to make the understanding of positional encoding easier.

import numpy as np<br />import matplotlib.pyplot as plt</p><p>def getPositionEncoding(seq_len, d, n=10000):<br />    P = np.zeros((seq_len, d))<br />    for okay in range(seq_len):<br />        for i in np.arange(int(d/2)):<br />            denominator = np.vitality(n, 2*i/d)<br />            P[k, 2*i] = np.sin(okay/denominator)<br />            P[k, 2*i+1] = np.cos(okay/denominator)<br />    return P</p><p>P = getPositionEncoding(seq_len=4, d=4, n=100)<br />print(P)

import numpy as np

import matplotlib.pyplot as plt

def getPositionEncoding(seq_len, d, n=10000):

P = np.zeros((seq_len, d))

for okay in range(seq_len):

for i in np.arange(int(d/2)):

denominator = np.vitality(n, 2*i/d)

P[k, 2*i] = np.sin(okay/denominator)

P[k, 2*i+1] = np.cos(okay/denominator)

return P

P = getPositionEncoding(seq_len=4, d=4, n=100)

print(P)

[[ 0.          1.          0.          1.        ]<br /> [ 0.84147098  0.54030231  0.09983342  0.99500417]<br /> [ 0.90929743 -0.41614684  0.19866933  0.98006658]<br /> [ 0.14112001 -0.9899925   0.29552023  0.95533649]]

[[ 0. 1. 0. 1. ]

[ 0.84147098 0.54030231 0.09983342 0.99500417]

[ 0.90929743 -0.41614684 0.19866933 0.98006658]

[ 0.14112001 -0.9899925 0.29552023 0.95533649]]

Understanding the Positional Encoding Matrix

To understand the positional encoding, let’s start by attempting on the sine wave for varied positions with n=10,000 and d=512.

def plotSinusoid(okay, d=512, n=10000):<br />    x = np.arange(0, 100, 1)<br />    denominator = np.vitality(n, 2*x/d)<br />    y = np.sin(okay/denominator)<br />    plt.plot(x, y)<br />    plt.title(‘okay = ‘ + str(okay))</p><p>fig = plt.decide(figsize=(15, 4))<br />for i in range(4):<br />    plt.subplot(141 + i)<br />    plotSinusoid(i*4)

def plotSinusoid(okay, d=512, n=10000):

x = np.arange(0, 100, 1)

denominator = np.vitality(n, 2*x/d)

y = np.sin(okay/denominator)

plt.plot(x, y)

plt.title(‘okay = ‘ + str(okay))

fig = plt.decide(figsize=(15, 4))

for i in range(4):

plt.subplot(141 + i)

plotSinusoid(i*4)

The following decide is the output of the above code:

Sine wave for varied place indices

You can see that each place $okay$ corresponds to a definite sinusoid, which encodes a single place proper right into a vector. If you look rigorously on the positional encoding carry out, you might even see that the wavelength for a tough and quick $i$ is given by:

$$
lambda_{i} = 2 pi n^{2i/d}
$$

Hence, the wavelengths of the sinusoids kind a geometrical improvement and fluctuate from $2pi$ to $2pi n$. The scheme for positional encoding has a number of advantages.

The sine and cosine options have values in [-1, 1], which retains the values of the positional encoding matrix in a normalized range.
As the sinusoid for each place is completely totally different, you’ve got a novel technique of encoding each place.
You have a way of measuring or quantifying the similarity between completely totally different positions, due to this fact enabling you to encode the relative positions of phrases.

Visualizing the Positional Matrix

Let’s visualize the positional matrix on bigger values. Use Python’s matshow() method from the matplotlib library. Setting n=10,000 as carried out inside the genuine paper, you get the following:

P = getPositionEncoding(seq_len=100, d=512, n=10000)<br />cax = plt.matshow(P)<br />plt.gcf().colorbar(cax)

P = getPositionEncoding(seq_len=100, d=512, n=10000)

cax = plt.matshow(P)

plt.gcf().colorbar(cax)

The positional encoding matrix for n=10,000, d=512, sequence dimension=100

What Is the Final Output of the Positional Encoding Layer?

The positional encoding layer sums the positional vector with the phrase encoding and outputs this matrix for the next layers. The entire course of is confirmed beneath.

The positional encoding layer inside the transformer

Summary

In this tutorial, you discovered positional encoding in transformers.

Specifically, you realized:

What is positional encoding, and why it is needed.
How to implement positional encoding in Python using NumPy
How to visualise the positional encoding matrix

Do you’ve got any questions on positional encoding talked about on this publish? Ask your questions inside the suggestions beneath, and I’ll do my biggest to answer.

Search This Blog

Solution Desk

Why Does My Snapchat AI Have a Story? Has Snapchat AI Been Hacked?