Unlocking Machine Understanding of Language
Deep Dive into Word Embeddings and Vector Database: Revolutionizing Textual Search
Explore word embeddings’ power in capturing language nuances and how vector databases such as FAISS turbocharge semantic vector searches.
Introduction
Language is a complex beast, and for decades, the challenge of teaching machines to “understand” and process language seemed insurmountable. However, with the advent of word embeddings, we’ve taken a significant leap towards bridging the gap between human linguistics and machine understanding.
Before starting, if you want to learn more about data structure, algorithms, data science, and generative AI, I suggest checking out my other posts using the below lists:
Now, let’s get started!
What Exactly Are Word Embeddings?
At their core, word embeddings are mathematical representations of words. Instead of viewing words as mere labels, word embeddings see them as points in a high-dimensional space. Each word is mapped to a vector in this space, with the position of the vector capturing the word’s meaning.
Picture a basic 2D graph plotting animals based on their lifespan and weight. In such a representation, an elephant (long lifespan, heavyweight) would be far from a hummingbird (short lifespan, lightweight). This is a two-dimensional representation.
However, language has many more nuances. To capture the essence of words, we need more than just two dimensions. Enter high-dimensional spaces, often featuring 300 dimensions or more. In these spaces, the geometric distance between vectors encapsulates semantic meaning. Words with similar meanings are close together, while unrelated words are farther apart.
Illustrative Example: Understanding Context
Consider the words “bank”, “money”, and “river”. In traditional methods, “bank” might be equidistant to both “money” and “river” since it can mean a financial institution or the side of a river. However, with word embeddings, the context in which words appear in large datasets is considered.
“Bank” would have two vector representations, skewed towards “money” when in a financial context and towards “river” when in a geographical context. This nuance is what makes word embeddings so powerful.
How Are These Vectors Born?
Models like Word2Vec or GloVe churn through mountains of text data. They might be trained to predict surrounding words given a target word or a target word from its context. Through such training, these models can determine which words are contextually related.
For instance, the words “coffee”, “tea”, and “juice” might frequently appear around the word “drink”. Consequently, their vectors would end up close in the embedding space, denoting their thematic similarity.
Relationships and Analogies: More Than Just Similarity
A captivating property of word embeddings is their ability to capture relationships. A famous demonstration is: king — man + woman ≈ queen. This isn’t mere arithmetic. It indicates that the relationship between “king” and “man” is similar to the relationship between “queen” and “woman”. Such relationships can be found in word pairs, like Paris to France being similar to Tokyo to Japan.
If you found value in my articles and would like to support my work, consider buying me a book:
Despite the emergence of Chroma DB and Pinecone’s vector search revolution, older methods like FAISS still hold their ground. Since FAISS has been around longer, it boasts a well-established methodology. In this article, we’ll delve into FAISS. Let’s dive in.
FAISS: Supercharging Vector Search with Efficiency
While understanding and generating word embeddings is one part of the equation, searching through these high-dimensional vectors to find the most similar ones can be computationally intensive. This is where libraries like FAISS (Facebook AI Similarity Search) come into play.
What is FAISS?
FAISS is an open-source library developed by Facebook AI, designed explicitly for efficient similarity search in dense vector spaces. Traditional databases are not optimized for searching through millions of high-dimensional vectors quickly. FAISS, on the other hand, is built for this very purpose.
How Does FAISS Leverage Embeddings?
- Indexing: The first step is to create an index of all the word vectors you have. Just as a book index helps you find topics faster, FAISS indexes make vector search swift. For example, Suppose we have three word vectors for the words apple=[1.2,0.5], apple=[1.2,0.5], and banana=[0.8,1.3]. Indexing these would mean placing them in a structure where they can be quickly accessed. Each vector might be given an ID (0, 1, 2) for easy retrieval.
- Quantization: One of the key techniques FAISS uses is vector quantization, which essentially compresses vectors, allowing them to be stored and searched more efficiently without significant loss of accuracy. For example, a 300-dimensional vector might be represented using fewer bits or even reduced to a lower dimension, say 100 dimensions, without losing much information.
- Approximate Nearest Neighbor Search: Instead of performing exact searches, which can be slow, FAISS looks for approximate nearest neighbours. While the results might not be the closest vectors in space, they’re close enough for most practical applications and are retrieved much faster. For example, let’s say you have a new word embedding for the word “kitten,” and you want to find semantically similar words. Instead of calculating the exact distance between “kitten” and every other word in your collection (which can be computationally expensive), FAISS will retrieve a list of words that are approximately closest to “kitten”. The result might include words like “cat”, “pet”, “puppy”, etc.
Embeddings Meet FAISS: An Example
Imagine you have a database of millions of sentences, each converted into a 300-dimensional vector using a model like BERT or GPT embeddings. Now, you want to find sentences most semantically similar to the query “How to bake a cake?”
Without an efficient search mechanism, you’d have to compare the query vector with every single vector in your database — a time-consuming process. With FAISS, the library quickly scans through its index, leveraging quantization and approximate search, and returns a list of sentences that are semantically close to your query in a fraction of the time. Here is what happened behind the scenes:
- Convert “How to bake a cake?” into a 300-dimensional vector using the same BERT or GPT model.
- Use FAISS to search for similar vectors:
- Coarse Quantizer: FAISS first identifies a few regions where the query vector will likely find its neighbours. This is a form of filtering where only a few promising regions are considered instead of searching the entire space.
- Refinement: Within these identified regions, FAISS uses the quantized vectors to compute an approximate distance between the query vector and the vectors in these regions.
- FAISS then returns the vectors with the smallest approximate distances.
3. Result: You get a list of sentences semantically close to “How to bake a cake?” without comparing the query vector with every single vector in the database.
The beauty of FAISS is in the balance it strikes between accuracy and efficiency. By using quantization and ANN, it sacrifices a tiny bit of accuracy for a massive gain in speed, which is often an acceptable trade-off in real-world applications.
Conclusions
Leveraging the deep semantic understanding of word embeddings and the lightning-fast similarity search of FAISS, today’s NLP landscape is primed for next-gen semantic search engines and applications. This revolution in semantic search has also significantly influenced the usage of large language models (LLMs), making them more effective and versatile. As textual data burgeons, tools like these transition from luxuries to essentials, translating vast swathes of information into actionable insights. With word embeddings steering innovations from chatbots to sentiment analyzers, we stand at the threshold of an era where machine language comprehension challenges the nuances of human linguistic capabilities.
Thank you for reading my post, and I hope it was useful for you. If you enjoyed the article and would like to show your support, please consider taking the following actions:
- 📚 If you found value in my articles and would like to support my work, consider buying me a book: Buy me a book
- 👏 Show your support by giving the article a clap, enhancing its visibility.
- 📖 Stay updated with my latest pieces by Follow Now.
- 🔔 Don’t miss out on my new posts. Subscribe to the newsletter.
- 🛎 For more regular updates, connect with me on LinkedIn.