A Gentle Introduction to Vector Space Models
- Get link
- X
- Other Apps
Last Updated on October 23, 2023
Vector space fashions are to consider the connection between data which is likely to be represented by vectors. It is well-liked in information retrieval strategies however as well as useful for various features. Generally, this allows us to match the similarity of two vectors from a geometrical perspective.
In this tutorial, we’re going to see what’s a vector space model and what it should most likely do.
After ending this tutorial, you will know:
- What is a vector space model and the properties of cosine similarity
- How cosine similarity could enable you consider two vectors
- What is the excellence between cosine similarity and L2 distance
Let’s get started.

A Gentle Introduction to Vector Space Models
Photo by liamfletch, some rights reserved.
Tutorial overview
This tutorial is break up into 3 parts; they’re:
- Vector space and cosine system
- Using vector space model for similarity
- Common use of vector space fashions and cosine distance
Vector space and cosine system
A vector space is a mathematical time interval that defines some vector operations. In layman’s time interval, we’re ready to consider it is a $n$-dimensional metric space the place each stage is represented by a $n$-dimensional vector. In this space, we’re in a position to do any vector addition or scalar-vector multiplications.
It is useful to consider a vector space because of it is useful to indicate points as a vector. For occasion in machine learning, we usually have an data stage with a lot of choices. Therefore, it is useful for us to indicate an data stage as a vector.
With a vector, we’re in a position to compute its norm. The commonest one is the L2-norm or the scale of the vector. With two vectors within the an identical vector space, we’re in a position to uncover their distinction. Assume it is a three-d vector space, the two vectors are $(x_1, x_2, x_3)$ and $(y_1, y_2, y_3)$. Their distinction is the vector $(y_1-x_1, y_2-x_2, y_3-x_3)$, and the L2-norm of the excellence is the distance or additional precisely the Euclidean distance between these two vectors:
$$
sqrt{(y_1-x_1)^2+(y_2-x_2)^2+(y_3-x_3)^2}
$$
Besides distance, we’re in a position to moreover have in mind the angle between two vectors. If we have in mind the vector $(x_1, x_2, x_3)$ as a line section from the aim $(0,0,0)$ to $(x_1,x_2,x_3)$ inside the 3D coordinate system, then there could also be one different line section from $(0,0,0)$ to $(y_1,y_2, y_3)$. They make an angle at their intersection:
The angle between the two line segments may very well be found using the cosine system:
$$
cos theta = frac{acdot b} {lVert arVert_2lVert brVert_2}
$$
the place $acdot b$ is the vector dot-product and $lVert arVert_2$ is the L2-norm of vector $a$. This system arises from considering the dot-product as a result of the projection of vector $a$ onto the course as pointed by vector $b$. The nature of cosine tells that, as a result of the angle $theta$ will improve from 0 to 90 ranges, cosine decreases from 1 to 0. Sometimes we would identify $1-costheta$ the cosine distance because of it runs from 0 to 1 as the two vectors are shifting further away from each other. This is an important property that we’ll exploit inside the vector space model.
Using vector space model for similarity
Let’s check out an occasion of how the vector space model is useful.
World Bank collects different details about worldwide areas and areas on the planet. While every nation is completely completely different, we’re in a position to try to guage worldwide areas beneath vector space model. For consolation, we’re going to use the pandas_datareader
module in Python to study data from World Bank. You may arrange pandas_datareader
using pip
or conda
command:
1 | pip arrange pandas_datareader |
The data sequence collected by World Bank are named by an identifier. For occasion, “SP.URB.TOTL” is the complete metropolis inhabitants of a country. Many of the sequence are yearly. When we get hold of a sequence, we have now now to put inside the start and end years. Usually the information are often not updated on time. Hence it is best to take a look on the information a lot of years once more pretty than the most recent yr to steer clear of missing data.
In underneath, we try to accumulate some monetary data of every nation in 2010:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | from pandas_datareader import wb import pandas as pd pd.decisions.present.width = 0 names = [ “NE.EXP.GNFS.CD”, # Exports of goods and services (current US$) “NE.IMP.GNFS.CD”, # Imports of goods and services (current US$) “NV.AGR.TOTL.CD”, # Agriculture, forestry, and fishing, value added (current US$) “NY.GDP.MKTP.CD”, # GDP (current US$) “NE.RSB.GNFS.CD”, # External balance on goods and services (current US$) ] df = wb.get hold of(nation=“all”, indicator=names, start=2010, end=2010).reset_index() worldwide areas = wb.get_countries() non_aggregates = worldwide areas[countries[“region”] != “Aggregates”].determine df_nonagg = df[df[“country”].isin(non_aggregates)].dropna() print(df_nonagg) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | nation yr NE.EXP.GNFS.CD NE.IMP.GNFS.CD NV.AGR.TOTL.CD NY.GDP.MKTP.CD NE.RSB.GNFS.CD 50 Albania 2010 3.337089e+09 5.792189e+09 2.141580e+09 1.192693e+10 -2.455100e+09 51 Algeria 2010 6.197541e+10 5.065473e+10 1.364852e+10 1.612073e+11 1.132067e+10 54 Angola 2010 5.157282e+10 3.568226e+10 5.179055e+09 8.379950e+10 1.589056e+10 55 Antigua and Barbuda 2010 9.142222e+08 8.415185e+08 1.876296e+07 1.148700e+09 7.270370e+07 56 Argentina 2010 8.020887e+10 6.793793e+10 3.021382e+10 4.236274e+11 1.227093e+10 .. … … … … … … … 259 Venezuela, RB 2010 1.121794e+11 6.922736e+10 2.113513e+10 3.931924e+11 4.295202e+10 260 Vietnam 2010 8.347359e+10 9.299467e+10 2.130649e+10 1.159317e+11 -9.521076e+09 262 West Bank and Gaza 2010 1.367300e+09 5.264300e+09 8.716000e+08 9.681500e+09 -3.897000e+09 264 Zambia 2010 7.503513e+09 6.256989e+09 1.909207e+09 2.026556e+10 1.246524e+09 265 Zimbabwe 2010 3.569254e+09 6.440274e+09 1.157187e+09 1.204166e+10 -2.871020e+09 [174 rows x 7 columns] |
In the above we obtained some monetary metrics of each nation in 2010. The carry out wb.get hold of()
will get hold of the information from World Bank and return a pandas dataframe. Similarly wb.get_countries()
will get the determine of the worldwide areas and areas as acknowledged by World Bank, which we’re going to use this to filter out the non-countries aggregates equivalent to “East Asia” and “World”. Pandas permits filtering rows by boolean indexing, which df["country"].isin(non_aggregates)
affords a boolean vector of which row is inside the document of non_aggregates
and based mostly totally on that, df[df["country"].isin(non_aggregates)]
selects solely these. For different causes not all worldwide areas can have all data. Hence we use dropna()
to remove these with missing data. In observe, we would want to use some imputation methods in its place of merely eradicating them. But for instance, we proceed with the 174 remaining data elements.
To increased illustrate the thought pretty than hiding the exact manipulation in pandas or numpy options, we first extract the information for each nation as a vector:
1 2 3 4 5 6 | ... vectors = {} for rowid, row in df_nonagg.iterrows(): vectors[row[“country”]] = row[names].values print(vectors) |
1 2 3 4 5 6 7 8 9 10 11 12 13 | {‘Albania’: array([3337088824.25553, 5792188899.58985, 2141580308.0144, 11926928505.5231, -2455100075.33431], dtype=object), ‘Algeria’: array([61975405318.205, 50654732073.2396, 13648522571.4516, 161207310515.42, 11320673244.9655], dtype=object), ‘Angola’: array([51572818660.8665, 35682259098.1843, 5179054574.41704, 83799496611.2004, 15890559562.6822], dtype=object), … ‘West Bank and Gaza’: array([1367300000.0, 5264300000.0, 871600000.0, 9681500000.0, -3897000000.0], dtype=object), ‘Zambia’: array([7503512538.82554, 6256988597.27752, 1909207437.82702, 20265559483.8548, 1246523941.54802], dtype=object), ‘Zimbabwe’: array([3569254400.0, 6440274000.0, 1157186600.0, 12041655200.0, -2871019600.0], dtype=object)} |
The Python dictionary we created has the determine of each nation as a key and the monetary metrics as a numpy array. There are 5 metrics, due to this fact each is a vector of 5 dimensions.
What this helps us is that, we’re ready to make use of the vector illustration of each nation to see how comparable it is to a unique. Let’s attempt every the L2-norm of the excellence (the Euclidean distance) and the cosine distance. We select one nation, equivalent to Australia, and consider it to all completely different worldwide areas on the document based mostly totally on the chosen monetary metrics.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | ... import numpy as np euclid = {} cosine = {} aim = “Australia” for nation in vectors: vecA = vectors[target] vecB = vectors[country] dist = np.linalg.norm(vecA – vecB) cos = (vecA @ vecB) / (np.linalg.norm(vecA) * np.linalg.norm(vecB)) euclid[country] = dist # Euclidean distance cosine[country] = 1–cos # cosine distance |
In the for-loop above, we set vecA
as a result of the vector of the aim nation (i.e., Australia) and vecB
as that of the other nation. Then we compute the L2-norm of their distinction as a result of the Euclidean distance between the two vectors. We moreover compute the cosine similarity using the system and minus it from 1 to get the cosine distance. With higher than 100 worldwide areas, we’re in a position to see which one has the shortest Euclidean distance to Australia:
1 2 3 4 5 | ... import pandas as pd df_distance = pd.DataPhysique({“euclid”: euclid, “cos”: cosine}) print(df_distance.sort_values(by=“euclid”).head()) |
1 2 3 4 5 6 | euclid cos Australia 0.000000e+00 -2.220446e-16 Mexico 1.533802e+11 7.949549e-03 Spain 3.411901e+11 3.057903e-03 Turkey 3.798221e+11 3.502849e-03 Indonesia 4.083531e+11 7.417614e-03 |
By sorting the result, we’re in a position to see that Mexico is the closest to Australia beneath Euclidean distance. However, with cosine distance, it is Colombia the closest to Australia.
1 2 | ... df_distance.sort_values(by=“cos”).head() |
To understand why the two distances give completely completely different final result, we’re in a position to observe how the three worldwide areas’ metric consider to 1 one other:
1 2 | ... print(df_nonagg[df_nonagg.country.isin([“Mexico”, “Colombia”, “Australia”])]) |
1 2 3 4 | nation yr NE.EXP.GNFS.CD NE.IMP.GNFS.CD NV.AGR.TOTL.CD NY.GDP.MKTP.CD NE.RSB.GNFS.CD 59 Australia 2010 2.270501e+11 2.388514e+11 2.518718e+10 1.146138e+12 -1.180129e+10 91 Colombia 2010 4.682683e+10 5.136288e+10 1.812470e+10 2.865631e+11 -4.536047e+09 176 Mexico 2010 3.141423e+11 3.285812e+11 3.405226e+10 1.057801e+12 -1.443887e+10 |
From this desk, we see that the metrics of Australia and Mexico are very shut to 1 one other in magnitude. However, in case you consider the ratio of each metric all through the an identical nation, it is Colombia that match Australia increased. In reality from the cosine system, we’re in a position to see that
$$
cos theta = frac{acdot b} {lVert arVert_2lVert brVert_2} = frac{a}{lVert arVert_2} cdot frac{b} {lVert brVert_2}
$$
which suggests the cosine of the angle between the two vector is the dot-product of the corresponding vectors after they’ve been normalized to measurement of 1. Hence cosine distance is almost making use of a scaler to the information sooner than computing the area.
Putting these altogether, the subsequent is the whole code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | from pandas_datareader import wb import numpy as np import pandas as pd pd.decisions.present.width = 0 # Download data from World Bank names = [ “NE.EXP.GNFS.CD”, # Exports of goods and services (current US$) “NE.IMP.GNFS.CD”, # Imports of goods and services (current US$) “NV.AGR.TOTL.CD”, # Agriculture, forestry, and fishing, value added (current US$) “NY.GDP.MKTP.CD”, # GDP (current US$) “NE.RSB.GNFS.CD”, # External balance on goods and services (current US$) ] df = wb.get hold of(nation=“all”, indicator=names, start=2010, end=2010).reset_index() # We take away aggregates and preserve solely worldwide areas with no missing data worldwide areas = wb.get_countries() non_aggregates = worldwide areas[countries[“region”] != “Aggregates”].determine df_nonagg = df[df[“country”].isin(non_aggregates)].dropna() # Extract vector for each nation vectors = {} for rowid, row in df_nonagg.iterrows(): vectors[row[“country”]] = row[names].values # Compute the Euclidean and cosine distances euclid = {} cosine = {} aim = “Australia” for nation in vectors: vecA = vectors[target] vecB = vectors[country] dist = np.linalg.norm(vecA – vecB) cos = (vecA @ vecB) / (np.linalg.norm(vecA) * np.linalg.norm(vecB)) euclid[country] = dist # Euclidean distance cosine[country] = 1–cos # cosine distance # Print the outcomes df_distance = pd.DataPhysique({“euclid”: euclid, “cos”: cosine}) print(“Closest by Euclidean distance:”) print(df_distance.sort_values(by=“euclid”).head()) print() print(“Closest by Cosine distance:”) print(df_distance.sort_values(by=“cos”).head()) # Print the aspect metrics print() print(“Detail metrics:”) print(df_nonagg[df_nonagg.country.isin([“Mexico”, “Colombia”, “Australia”])]) |
Common use of vector space fashions and cosine distance
Vector space fashions are widespread in information retrieval strategies. We can present paperwork (e.g., a paragraph, a protracted passage, a e ebook, or maybe a sentence) as vectors. This vector may very well be as simple as counting of the phrases that the doc accommodates (i.e., a bag-of-word model) or a classy embedding vector (e.g., Doc2Vec). Then a query to go looking out most likely essentially the most associated doc may very well be answered by score all paperwork by the cosine distance. Cosine distance must be used because of we needn’t favor longer or shorter paperwork, nonetheless to cope with what it accommodates. Hence we leverage the normalization comes with it to consider how associated are the paperwork to the query pretty than what variety of events the phrases on the query are talked about in a doc.
If we have in mind each phrase in a doc as a attribute and compute the cosine distance, it is the “hard” distance because of we do not care about phrases with comparable meanings (e.g. “document” and “passage” have comparable meanings nonetheless not “distance”). Embedding vectors equivalent to word2vec would allow us to consider the ontology. Computing the cosine distance with the which suggests of phrases considered is the “mushy cosine distance“. Libraries equivalent to gensim affords a method to try this.
Another use case of the cosine distance and vector space model is in computer imaginative and prescient. Imagine the responsibility of recognizing hand gesture, we’re ready to ensure parts of the hand (e.g. 5 fingers) the necessary factor elements. Then with the (x,y) coordinates of the necessary factor elements lay out as a vector, we’re in a position to consider with our present database to see which cosine distance is the closest and determine which hand gesture it is. We need cosine distance because of all people’s hand has a particular dimension. We do not want that to affect our decision on what gesture it is displaying.
As it is doable you will take into consideration, there are way more examples you must use this method.
Further learning
This half affords additional belongings on the topic should you’re in search of to go deeper.
Books
- Introduction to Linear Algebra, Fifth Edition, 2023.
- Introduction to Information Retrieval, 2008.
Software
Articles
Summary
In this tutorial, you discovered the vector space model for measuring the similarities of vectors.
Specifically, you realized:
- How to assemble a vector space model
- How to compute the cosine similarity and due to this fact the cosine distance between two vectors inside the vector space model
- How to interpret the excellence between cosine distance and completely different distance metrics equivalent to Euclidean distance
- What are the utilization of the vector space model
Get a Handle on Linear Algebra for Machine Learning!
Develop a working understand of linear algebra
…by writing strains of code in python
Discover how in my new Ebook:
Linear Algebra for Machine Learning
It affords self-study tutorials on topics like:
Vector Norms, Matrix Multiplication, Tensors, Eigendecomposition, SVD, PCA and way more…
Finally Understand the Mathematics of Data
Skip the Academics. Just Results.
See What’s Inside
How to Explore the GAN Latent Space When Generating Faces
How to Develop a Framework to Spot-Check Machine…
Gentle Introduction to Vector Norms in Machine Learning
Hyperparameter Optimization With Random Search and…
How to Develop Voting Ensembles With Python
A Gentle Introduction to Vectors for Machine Learning
- Get link
- X
- Other Apps
Comments
Post a Comment