Semantic Search ANN
Approximate nearest neighbor search (ANN) makes semantic search possible and blazing fast on billions of documents. ANN is also called approximate k-nearest neighbor search (approximate KNN). ANN speeds up the search speed while sacrificing some accuracy (still good results). As it can enable the search on billions of documents within milliseconds, ANN are commonly used in industrial retrieval and recommendation applications. In addition, more and more companies (i.e. AWS Open Distro, ElasticSearch, Google Vertex, Milvus, Pinecone, Weaviate) are providing easy to deploy frameworks or managed services so that small/medium companies can also apply ANN based semantic search to their use cases.
Exact KNN (use KNN directly) does exhaustive search on all data points one by one and returns the accurate K nearest neighbors. For small corpus with less than 100K documents, KNN can work well under milliseconds. For big corpus with millions of documents, KNN causes prohibitive latency for AI production systems. There are two keys steps on KNN search:
- Given a query vector, KNN needs to compare all vectors.
- For each pair comparison, KNN needs to calculate the distance (i.e. L2 Squared, cosine similarity) between two pairs.
ANN algorithms can be categorized into two types based on how to optimize the above two steps (the other categorization can be based on data structure, like tree-base, hash-based, graph based and vector quantization-based):
- Reduce search space. Instead of comparing all vectors, ANN selects a subset of vectors to compare against the query vector. HNSW organized all data vectors into a graph, with node being specific vector and edge being the distance between two vectors. In the graph construction and search phase, HNSW uses beam search to traverse the graph accordingly to insert the vector or return the nearest vectors. Inverted file (IVF) partitions all data vectors into different partitions. Each partition is represented as a centroid. In the query time, IVF firstly finds a list of partitions (small number) most closest to the query vector and then only explores vectors within those partitions. Annoy builds a binary tree via partitioning spaces recursively to enable a sub-linear time search.
- Avoid calculating the distance between two pairs. This can be achieved by caching the distance, which can be computed in the index time (not in the serve time). After LSH hashes vectors into different buckets, different bucket distances can be pre-computed and cached. In the serving time, we can map the query vector to a specific bucket and use the cached results to find the closest buckets. This way, it helps reduce many distance calculations.
The accuracy is normally defined as the recall. Let us say we use 10-recall. The basic idea is that we can use KNN to find the best 10 results. Then we use ANN to find the top 10. We count the intersection between the ground truth result and the returned results from ANN. We can try this for 1000 times and average the accuracy, and use that as the final results of different algorithms. This way, we can compare performances of different algorithms. There is also an open source project called ann-benchmarks to benchmark a variety of algorithms on a number of precomputed data sets. The results can help figure out which algorithm performs best. Currently, the most used ANN algorithm is HNSW, which provides a reliable algorithm and also good results for this. For now, it has been supported by a variety of open source libraries (i.e. Faiss, HNSWLib).
