LANGCHAIN — Can Anthropic Auto-Evaluate a 100k Context Window?

Technology offers us a unique opportunity, though rarely welcome, to practice patience. — Allan Lokos.

LANGCHAIN — How to Improve Documentation

Computer science is no more about computers than astronomy is about telescopes. — Edsger W. Dijkstra

medium.com

Anthropic has recently released a Claude model with a 100k token context window, which raises the question of whether the document retrieval stage is necessary for many question-answering (Q+A) or chat use-cases. In this article, we will explore the auto-evaluation of the Anthropic 100k context window and its comparison with other retrieval methods using the LangChain auto-evaluator tool.

Retrieval Architectures

In practice, the retrieval step is necessary because the language model (LLM) context window is limited. Anthropic’s 100k context window model provides a retriever-less option, eliminating the need for a separate retrieval step. Here’s a taxonomy of retriever architectures, with the retriever-less option highlighted.

Lexical / Statistical

Examples include TF-IDF, Elastic, etc.

Semantic

Pinecone, Chroma, etc.

Semantic with metadata filtering

Pinecone with filtering tools, self-querying, kor, etc.

kNN on document summaries

Llama-Index, etc.

Post-processing

Cohere re-rank, etc.

Retriever-less

Anthropic 100k context window, etc.

Evaluation Strategy

LangChain has introduced an auto-evaluator, a hosted app, and an open-source repository for grading LLM question-answer chains. This provides a good testing ground for comparing Anthropic 100k for Q+A against other retrieval methods, such as kNN on a VectorDB, SVMs, etc.

Results

On a test set of 5 questions for the 75-page GPT3 paper, the Anthropic 100k model performs as well as kNN (FAISS) + GPT3.5-Turbo. However, it comes with a higher latency compared to other methods. Additionally, testing on a 51-page PDF of building codes for San Francisco showed that Anthropic 100k fell short of SVM and kNN retrievers in some cases.

Testing for Yourself

The Anthropic 100k model has been deployed in the hosted app, allowing users to benchmark it relative to other approaches. Users can add a document of interest, select the Anthropic-100k retriever, and optionally add their own test set.

Conclusion

The retriever-less architecture shows promising performance on some challenges, but it may have higher latency than retriever-based approaches. For applications where latency is not critical and the corpus is reasonably small, retriever-less approaches have appeal, especially as the context window of LLMs grows and the models get faster.

LANGCHAIN — Announcing Langsmith

Software and cathedrals are much the same — first we build them, then we pray. — Sam Redwine.