A Gentle Introduction to Vector Space Models

Last Updated on October 23, 2023

Vector space fashions are to consider the connection between data which is likely to be represented by vectors. It is well-liked in information retrieval strategies however as well as useful for various features. Generally, this allows us to match the similarity of two vectors from a geometrical perspective.

In this tutorial, we’re going to see what’s a vector space model and what it should most likely do.

After ending this tutorial, you will know:

What is a vector space model and the properties of cosine similarity
How cosine similarity could enable you consider two vectors
What is the excellence between cosine similarity and L2 distance

Let’s get started.

A Gentle Introduction to Sparse Matrices for Machine Learning

A Gentle Introduction to Vector Space Models
Photo by liamfletch, some rights reserved.

Tutorial overview

This tutorial is break up into 3 parts; they’re:

Vector space and cosine system
Using vector space model for similarity
Common use of vector space fashions and cosine distance

Vector space and cosine system

A vector space is a mathematical time interval that defines some vector operations. In layman’s time interval, we’re ready to consider it is a $n$-dimensional metric space the place each stage is represented by a $n$-dimensional vector. In this space, we’re in a position to do any vector addition or scalar-vector multiplications.

It is useful to consider a vector space because of it is useful to indicate points as a vector. For occasion in machine learning, we usually have an data stage with a lot of choices. Therefore, it is useful for us to indicate an data stage as a vector.

With a vector, we’re in a position to compute its norm. The commonest one is the L2-norm or the scale of the vector. With two vectors within the an identical vector space, we’re in a position to uncover their distinction. Assume it is a three-d vector space, the two vectors are $(x_1, x_2, x_3)$ and $(y_1, y_2, y_3)$. Their distinction is the vector $(y_1-x_1, y_2-x_2, y_3-x_3)$, and the L2-norm of the excellence is the distance or additional precisely the Euclidean distance between these two vectors:

$$
sqrt{(y_1-x_1)^2+(y_2-x_2)^2+(y_3-x_3)^2}
$$

Besides distance, we’re in a position to moreover have in mind the angle between two vectors. If we have in mind the vector $(x_1, x_2, x_3)$ as a line section from the aim $(0,0,0)$ to $(x_1,x_2,x_3)$ inside the 3D coordinate system, then there could also be one different line section from $(0,0,0)$ to $(y_1,y_2, y_3)$. They make an angle at their intersection:

The angle between the two line segments may very well be found using the cosine system:

$$
cos theta = frac{acdot b} {lVert arVert_2lVert brVert_2}
$$

the place $acdot b$ is the vector dot-product and $lVert arVert_2$ is the L2-norm of vector $a$. This system arises from considering the dot-product as a result of the projection of vector $a$ onto the course as pointed by vector $b$. The nature of cosine tells that, as a result of the angle $theta$ will improve from 0 to 90 ranges, cosine decreases from 1 to 0. Sometimes we would identify $1-costheta$ the cosine distance because of it runs from 0 to 1 as the two vectors are shifting further away from each other. This is an important property that we’ll exploit inside the vector space model.

Using vector space model for similarity

Let’s check out an occasion of how the vector space model is useful.

World Bank collects different details about worldwide areas and areas on the planet. While every nation is completely completely different, we’re in a position to try to guage worldwide areas beneath vector space model. For consolation, we’re going to use the pandas_datareader module in Python to study data from World Bank. You may arrange pandas_datareader using pip or conda command:

pip arrange pandas_datareader

1	pip arrange pandas_datareader

The data sequence collected by World Bank are named by an identifier. For occasion, “SP.URB.TOTL” is the complete metropolis inhabitants of a country. Many of the sequence are yearly. When we get hold of a sequence, we have now now to put inside the start and end years. Usually the information are often not updated on time. Hence it is best to take a look on the information a lot of years once more pretty than the most recent yr to steer clear of missing data.

In underneath, we try to accumulate some monetary data of every nation in 2010:

from pandas_datareader import wb<br />import pandas as pd<br />pd.decisions.present.width = 0</p><p>names = [<br />    “NE.EXP.GNFS.CD”, # Exports of goods and services (current US$)<br />    “NE.IMP.GNFS.CD”, # Imports of goods and services (current US$)<br />    “NV.AGR.TOTL.CD”, # Agriculture, forestry, and fishing, value added (current US$)<br />    “NY.GDP.MKTP.CD”, # GDP (current US$)<br />    “NE.RSB.GNFS.CD”, # External balance on goods and services (current US$)<br />]</p><p>df = wb.get hold of(nation=”all”, indicator=names, start=2010, end=2010).reset_index()<br />worldwide areas = wb.get_countries()<br />non_aggregates = worldwide areas[countries[“region”] != “Aggregates”].determine<br />df_nonagg = df[df[“country”].isin(non_aggregates)].dropna()<br />print(df_nonagg)

from pandas_datareader import wb

import pandas as pd

pd.decisions.present.width = 0

names = [

“NE.EXP.GNFS.CD”, # Exports of goods and services (current US$)

“NE.IMP.GNFS.CD”, # Imports of goods and services (current US$)

“NV.AGR.TOTL.CD”, # Agriculture, forestry, and fishing, value added (current US$)

“NY.GDP.MKTP.CD”, # GDP (current US$)

“NE.RSB.GNFS.CD”, # External balance on goods and services (current US$)

]

df = wb.get hold of(nation=“all”, indicator=names, start=2010, end=2010).reset_index()

worldwide areas = wb.get_countries()

non_aggregates = worldwide areas[countries[“region”] != “Aggregates”].determine

df_nonagg = df[df[“country”].isin(non_aggregates)].dropna()

print(df_nonagg)

                 nation  yr  NE.EXP.GNFS.CD  NE.IMP.GNFS.CD  NV.AGR.TOTL.CD  NY.GDP.MKTP.CD  NE.RSB.GNFS.CD<br />50               Albania  2010    3.337089e+09    5.792189e+09    2.141580e+09    1.192693e+10   -2.455100e+09<br />51               Algeria  2010    6.197541e+10    5.065473e+10    1.364852e+10    1.612073e+11    1.132067e+10<br />54                Angola  2010    5.157282e+10    3.568226e+10    5.179055e+09    8.379950e+10    1.589056e+10<br />55   Antigua and Barbuda  2010    9.142222e+08    8.415185e+08    1.876296e+07    1.148700e+09    7.270370e+07<br />56             Argentina  2010    8.020887e+10    6.793793e+10    3.021382e+10    4.236274e+11    1.227093e+10<br />..                   …   …             …             …             …             …             …<br />259        Venezuela, RB  2010    1.121794e+11    6.922736e+10    2.113513e+10    3.931924e+11    4.295202e+10<br />260              Vietnam  2010    8.347359e+10    9.299467e+10    2.130649e+10    1.159317e+11   -9.521076e+09<br />262   West Bank and Gaza  2010    1.367300e+09    5.264300e+09    8.716000e+08    9.681500e+09   -3.897000e+09<br />264               Zambia  2010    7.503513e+09    6.256989e+09    1.909207e+09    2.026556e+10    1.246524e+09<br />265             Zimbabwe  2010    3.569254e+09    6.440274e+09    1.157187e+09    1.204166e+10   -2.871020e+09</p><p>[174 rows x 7 columns]

nation yr NE.EXP.GNFS.CD NE.IMP.GNFS.CD NV.AGR.TOTL.CD NY.GDP.MKTP.CD NE.RSB.GNFS.CD

50 Albania 2010 3.337089e+09 5.792189e+09 2.141580e+09 1.192693e+10 -2.455100e+09

51 Algeria 2010 6.197541e+10 5.065473e+10 1.364852e+10 1.612073e+11 1.132067e+10

54 Angola 2010 5.157282e+10 3.568226e+10 5.179055e+09 8.379950e+10 1.589056e+10

55 Antigua and Barbuda 2010 9.142222e+08 8.415185e+08 1.876296e+07 1.148700e+09 7.270370e+07

56 Argentina 2010 8.020887e+10 6.793793e+10 3.021382e+10 4.236274e+11 1.227093e+10

.. … … … … … … …

259 Venezuela, RB 2010 1.121794e+11 6.922736e+10 2.113513e+10 3.931924e+11 4.295202e+10

260 Vietnam 2010 8.347359e+10 9.299467e+10 2.130649e+10 1.159317e+11 -9.521076e+09

262 West Bank and Gaza 2010 1.367300e+09 5.264300e+09 8.716000e+08 9.681500e+09 -3.897000e+09

264 Zambia 2010 7.503513e+09 6.256989e+09 1.909207e+09 2.026556e+10 1.246524e+09

265 Zimbabwe 2010 3.569254e+09 6.440274e+09 1.157187e+09 1.204166e+10 -2.871020e+09

[174 rows x 7 columns]

In the above we obtained some monetary metrics of each nation in 2010. The carry out wb.get hold of() will get hold of the information from World Bank and return a pandas dataframe. Similarly wb.get_countries() will get the determine of the worldwide areas and areas as acknowledged by World Bank, which we’re going to use this to filter out the non-countries aggregates equivalent to “East Asia” and “World”. Pandas permits filtering rows by boolean indexing, which df["country"].isin(non_aggregates) affords a boolean vector of which row is inside the document of non_aggregates and based mostly totally on that, df[df["country"].isin(non_aggregates)] selects solely these. For different causes not all worldwide areas can have all data. Hence we use dropna() to remove these with missing data. In observe, we would want to use some imputation methods in its place of merely eradicating them. But for instance, we proceed with the 174 remaining data elements.

To increased illustrate the thought pretty than hiding the exact manipulation in pandas or numpy options, we first extract the information for each nation as a vector:

…<br />vectors = {}<br />for rowid, row in df_nonagg.iterrows():<br />    vectors[row[“country”]] = row[names].values</p><p>print(vectors)

...

vectors = {}

for rowid, row in df_nonagg.iterrows():

vectors[row[“country”]] = row[names].values

print(vectors)

{‘Albania’: array([3337088824.25553, 5792188899.58985, 2141580308.0144,<br />11926928505.5231, -2455100075.33431], dtype=object),<br />‘Algeria’: array([61975405318.205, 50654732073.2396, 13648522571.4516,<br />161207310515.42, 11320673244.9655], dtype=object),<br />‘Angola’: array([51572818660.8665, 35682259098.1843, 5179054574.41704,<br />83799496611.2004, 15890559562.6822], dtype=object),<br />…<br />‘West Bank and Gaza’: array([1367300000.0, 5264300000.0, 871600000.0, 9681500000.0,<br />-3897000000.0], dtype=object),<br />‘Zambia’: array([7503512538.82554, 6256988597.27752, 1909207437.82702,<br />20265559483.8548, 1246523941.54802], dtype=object),<br />‘Zimbabwe’: array([3569254400.0, 6440274000.0, 1157186600.0, 12041655200.0,<br />-2871019600.0], dtype=object)}

{‘Albania’: array([3337088824.25553, 5792188899.58985, 2141580308.0144,

11926928505.5231, -2455100075.33431], dtype=object),

‘Algeria’: array([61975405318.205, 50654732073.2396, 13648522571.4516,

161207310515.42, 11320673244.9655], dtype=object),

‘Angola’: array([51572818660.8665, 35682259098.1843, 5179054574.41704,

83799496611.2004, 15890559562.6822], dtype=object),

…

‘West Bank and Gaza’: array([1367300000.0, 5264300000.0, 871600000.0, 9681500000.0,

-3897000000.0], dtype=object),

‘Zambia’: array([7503512538.82554, 6256988597.27752, 1909207437.82702,

20265559483.8548, 1246523941.54802], dtype=object),

‘Zimbabwe’: array([3569254400.0, 6440274000.0, 1157186600.0, 12041655200.0,

-2871019600.0], dtype=object)}

The Python dictionary we created has the determine of each nation as a key and the monetary metrics as a numpy array. There are 5 metrics, due to this fact each is a vector of 5 dimensions.

What this helps us is that, we’re ready to make use of the vector illustration of each nation to see how comparable it is to a unique. Let’s attempt every the L2-norm of the excellence (the Euclidean distance) and the cosine distance. We select one nation, equivalent to Australia, and consider it to all completely different worldwide areas on the document based mostly totally on the chosen monetary metrics.

…<br />import numpy as np</p><p>euclid = {}<br />cosine = {}<br />aim = “Australia”</p><p>for nation in vectors:<br />    vecA = vectors[target]<br />    vecB = vectors[country]<br />    dist = np.linalg.norm(vecA – vecB)<br />    cos = (vecA @ vecB) / (np.linalg.norm(vecA) * np.linalg.norm(vecB))<br />    euclid[country] = dist    # Euclidean distance<br />    cosine[country] = 1-cos   # cosine distance

...

import numpy as np

euclid = {}

cosine = {}

aim = “Australia”

for nation in vectors:

vecA = vectors[target]

vecB = vectors[country]

dist = np.linalg.norm(vecA – vecB)

cos = (vecA @ vecB) / (np.linalg.norm(vecA) * np.linalg.norm(vecB))

euclid[country] = dist # Euclidean distance

cosine[country] = 1–cos # cosine distance

In the for-loop above, we set vecA as a result of the vector of the aim nation (i.e., Australia) and vecB as that of the other nation. Then we compute the L2-norm of their distinction as a result of the Euclidean distance between the two vectors. We moreover compute the cosine similarity using the system and minus it from 1 to get the cosine distance. With higher than 100 worldwide areas, we’re in a position to see which one has the shortest Euclidean distance to Australia:

…<br />import pandas as pd</p><p>df_distance = pd.DataPhysique({“euclid”: euclid, “cos”: cosine})<br />print(df_distance.sort_values(by=”euclid”).head())

...

import pandas as pd

df_distance = pd.DataPhysique({“euclid”: euclid, “cos”: cosine})

print(df_distance.sort_values(by=“euclid”).head())

                 euclid           cos<br />Australia  0.000000e+00 -2.220446e-16<br />Mexico     1.533802e+11  7.949549e-03<br />Spain      3.411901e+11  3.057903e-03<br />Turkey     3.798221e+11  3.502849e-03<br />Indonesia  4.083531e+11  7.417614e-03

euclid cos

Australia 0.000000e+00 -2.220446e-16

Mexico 1.533802e+11 7.949549e-03

Spain 3.411901e+11 3.057903e-03

Turkey 3.798221e+11 3.502849e-03

Indonesia 4.083531e+11 7.417614e-03

By sorting the result, we’re in a position to see that Mexico is the closest to Australia beneath Euclidean distance. However, with cosine distance, it is Colombia the closest to Australia.

…<br />df_distance.sort_values(by=”cos”).head()

1 2	... df_distance.sort_values(by=“cos”).head()

To understand why the two distances give completely completely different final result, we’re in a position to observe how the three worldwide areas’ metric consider to 1 one other:

…<br />print(df_nonagg[df_nonagg.country.isin([“Mexico”, “Colombia”, “Australia”])])

1 2	... print(df_nonagg[df_nonagg.country.isin([“Mexico”, “Colombia”, “Australia”])])

       nation  yr  NE.EXP.GNFS.CD  NE.IMP.GNFS.CD  NV.AGR.TOTL.CD  NY.GDP.MKTP.CD  NE.RSB.GNFS.CD<br />59   Australia  2010    2.270501e+11    2.388514e+11    2.518718e+10    1.146138e+12   -1.180129e+10<br />91    Colombia  2010    4.682683e+10    5.136288e+10    1.812470e+10    2.865631e+11   -4.536047e+09<br />176     Mexico  2010    3.141423e+11    3.285812e+11    3.405226e+10    1.057801e+12   -1.443887e+10

nation yr NE.EXP.GNFS.CD NE.IMP.GNFS.CD NV.AGR.TOTL.CD NY.GDP.MKTP.CD NE.RSB.GNFS.CD

59 Australia 2010 2.270501e+11 2.388514e+11 2.518718e+10 1.146138e+12 -1.180129e+10

91 Colombia 2010 4.682683e+10 5.136288e+10 1.812470e+10 2.865631e+11 -4.536047e+09

176 Mexico 2010 3.141423e+11 3.285812e+11 3.405226e+10 1.057801e+12 -1.443887e+10

From this desk, we see that the metrics of Australia and Mexico are very shut to 1 one other in magnitude. However, in case you consider the ratio of each metric all through the an identical nation, it is Colombia that match Australia increased. In reality from the cosine system, we’re in a position to see that

$$
cos theta = frac{acdot b} {lVert arVert_2lVert brVert_2} = frac{a}{lVert arVert_2} cdot frac{b} {lVert brVert_2}
$$

which suggests the cosine of the angle between the two vector is the dot-product of the corresponding vectors after they’ve been normalized to measurement of 1. Hence cosine distance is almost making use of a scaler to the information sooner than computing the area.

Putting these altogether, the subsequent is the whole code

from pandas_datareader import wb<br />import numpy as np<br />import pandas as pd<br />pd.decisions.present.width = 0</p><p># Download data from World Bank<br />names = [<br />    “NE.EXP.GNFS.CD”, # Exports of goods and services (current US$)<br />    “NE.IMP.GNFS.CD”, # Imports of goods and services (current US$)<br />    “NV.AGR.TOTL.CD”, # Agriculture, forestry, and fishing, value added (current US$)<br />    “NY.GDP.MKTP.CD”, # GDP (current US$)<br />    “NE.RSB.GNFS.CD”, # External balance on goods and services (current US$)<br />]<br />df = wb.get hold of(nation=”all”, indicator=names, start=2010, end=2010).reset_index()</p><p># We take away aggregates and preserve solely worldwide areas with no missing data<br />worldwide areas = wb.get_countries()<br />non_aggregates = worldwide areas[countries[“region”] != “Aggregates”].determine<br />df_nonagg = df[df[“country”].isin(non_aggregates)].dropna()</p><p># Extract vector for each nation<br />vectors = {}<br />for rowid, row in df_nonagg.iterrows():<br />    vectors[row[“country”]] = row[names].values</p><p># Compute the Euclidean and cosine distances<br />euclid = {}<br />cosine = {}</p><p>aim = “Australia”<br />for nation in vectors:<br />    vecA = vectors[target]<br />    vecB = vectors[country]<br />    dist = np.linalg.norm(vecA – vecB)<br />    cos = (vecA @ vecB) / (np.linalg.norm(vecA) * np.linalg.norm(vecB))<br />    euclid[country] = dist    # Euclidean distance<br />    cosine[country] = 1-cos   # cosine distance</p><p># Print the outcomes<br />df_distance = pd.DataPhysique({“euclid”: euclid, “cos”: cosine})<br />print(“Closest by Euclidean distance:”)<br />print(df_distance.sort_values(by=”euclid”).head())<br />print()<br />print(“Closest by Cosine distance:”)<br />print(df_distance.sort_values(by=”cos”).head())</p><p># Print the aspect metrics<br />print()<br />print(“Detail metrics:”)<br />print(df_nonagg[df_nonagg.country.isin([“Mexico”, “Colombia”, “Australia”])])

from pandas_datareader import wb

import numpy as np

import pandas as pd

pd.decisions.present.width = 0

# Download data from World Bank

names = [

“NE.EXP.GNFS.CD”, # Exports of goods and services (current US$)

“NE.IMP.GNFS.CD”, # Imports of goods and services (current US$)

“NV.AGR.TOTL.CD”, # Agriculture, forestry, and fishing, value added (current US$)

“NY.GDP.MKTP.CD”, # GDP (current US$)

“NE.RSB.GNFS.CD”, # External balance on goods and services (current US$)

]

df = wb.get hold of(nation=“all”, indicator=names, start=2010, end=2010).reset_index()

# We take away aggregates and preserve solely worldwide areas with no missing data

worldwide areas = wb.get_countries()

non_aggregates = worldwide areas[countries[“region”] != “Aggregates”].determine

df_nonagg = df[df[“country”].isin(non_aggregates)].dropna()

# Extract vector for each nation

vectors = {}

for rowid, row in df_nonagg.iterrows():

vectors[row[“country”]] = row[names].values

# Compute the Euclidean and cosine distances

euclid = {}

cosine = {}

aim = “Australia”

for nation in vectors:

vecA = vectors[target]

vecB = vectors[country]

dist = np.linalg.norm(vecA – vecB)

cos = (vecA @ vecB) / (np.linalg.norm(vecA) * np.linalg.norm(vecB))

euclid[country] = dist # Euclidean distance

cosine[country] = 1–cos # cosine distance

# Print the outcomes

df_distance = pd.DataPhysique({“euclid”: euclid, “cos”: cosine})

print(“Closest by Euclidean distance:”)

print(df_distance.sort_values(by=“euclid”).head())

print()

print(“Closest by Cosine distance:”)

print(df_distance.sort_values(by=“cos”).head())

# Print the aspect metrics

print()

print(“Detail metrics:”)

print(df_nonagg[df_nonagg.country.isin([“Mexico”, “Colombia”, “Australia”])])

Common use of vector space fashions and cosine distance

Vector space fashions are widespread in information retrieval strategies. We can present paperwork (e.g., a paragraph, a protracted passage, a e ebook, or maybe a sentence) as vectors. This vector may very well be as simple as counting of the phrases that the doc accommodates (i.e., a bag-of-word model) or a classy embedding vector (e.g., Doc2Vec). Then a query to go looking out most likely essentially the most associated doc may very well be answered by score all paperwork by the cosine distance. Cosine distance must be used because of we needn’t favor longer or shorter paperwork, nonetheless to cope with what it accommodates. Hence we leverage the normalization comes with it to consider how associated are the paperwork to the query pretty than what variety of events the phrases on the query are talked about in a doc.

If we have in mind each phrase in a doc as a attribute and compute the cosine distance, it is the “hard” distance because of we do not care about phrases with comparable meanings (e.g. “document” and “passage” have comparable meanings nonetheless not “distance”). Embedding vectors equivalent to word2vec would allow us to consider the ontology. Computing the cosine distance with the which suggests of phrases considered is the “mushy cosine distance“. Libraries equivalent to gensim affords a method to try this.

Another use case of the cosine distance and vector space model is in computer imaginative and prescient. Imagine the responsibility of recognizing hand gesture, we’re ready to ensure parts of the hand (e.g. 5 fingers) the necessary factor elements. Then with the (x,y) coordinates of the necessary factor elements lay out as a vector, we’re in a position to consider with our present database to see which cosine distance is the closest and determine which hand gesture it is. We need cosine distance because of all people’s hand has a particular dimension. We do not want that to affect our decision on what gesture it is displaying.

As it is doable you will take into consideration, there are way more examples you must use this method.

Further learning

This half affords additional belongings on the topic should you’re in search of to go deeper.

Books

Introduction to Linear Algebra, Fifth Edition, 2023.
Introduction to Information Retrieval, 2008.

Software

Articles

Vector space model on Wikipedia

Summary

In this tutorial, you discovered the vector space model for measuring the similarities of vectors.

Specifically, you realized:

How to assemble a vector space model
How to compute the cosine similarity and due to this fact the cosine distance between two vectors inside the vector space model
How to interpret the excellence between cosine distance and completely different distance metrics equivalent to Euclidean distance
What are the utilization of the vector space model

Search This Blog

Solution Desk

Why Does My Snapchat AI Have a Story? Has Snapchat AI Been Hacked?