avatarRahul S

Summarize

Chunking Strategies in RAG systems

CLICK TO GO TO THE ENTIRE GEN AI SERIES

Chunking is about splitting a text into smaller, more homogeneous units, called chunks, which can be easily processed by a language model (LLM).

Chunking plays a key role in optimizing semantic responses, as it allows us to reduce the complexity of the text and increase the relevance of the content.

Chunking also makes it easier to embed vector content via LLM, as it allows text to be represented in a multidimensional space, where each chunk has a specific position and direction.

This makes it possible to compare and associate chunks with queries, and thus generate targeted and relevant responses.

However, the choice of chunking strategy is not trivial. It depends on several factors, such as

  • the type and amount of data,
  • the nature and complexity of the queries, and
  • the characteristics and performance of the model.

In addition, the choice of chunking type has a significant impact on the final application, in terms of quality, efficiency, and scalability.

Here are my insights and guidelines, considering key factors:

1. Fixed-length Chunking:

Fixed chunking is a computationally cost-effective solution, which consists of splitting text into chunks of a fixed size, for example, words, characters, or n-grams. This strategy has the advantage of being quick and easy, but it also has disadvantages, such as information loss and context fragmentation.

2. Sentence-level Chunking:

This strategy proves effective when each sentence carries rich meaning and context. By concentrating on sentence intricacies, the model generates coherent and contextually relevant responses.

Although seldom used in RAG, sentence-level chunking, often involving tokenization based on sentence boundaries using NLP libraries, becomes valuable when seeking specific statements, such as in meeting transcripts for semantic similarity. This strategy has the advantage of maintaining the unity and completeness of the text, but it can generate chunks of very different sizes and not homogeneous.

3. Paragraph-level Chunking:

We employ this strategy when input text is organized into distinct sections or paragraphs, each encapsulating a separate idea or topic. This enables focused attention on relevant information within each paragraph. Identifying paragraph boundaries involves detecting newline characters or delimiters, signifying the end of a paragraph.

Useful for documents covering diverse aspects of the same topic, paragraph-level chunking helps pinpoint the most relevant part for context provision to the LLM.

4. Content-aware Chunking:

When aiming for precision in text understanding, content-aware chunking stands out. Particularly valuable in contexts like legal documents, this strategy involves segmenting text based on clauses or sections, enhancing context-specific responses.

We break up the text based on meaning and sentence structure, for example, using part-of-speech tagging or syntactic parsing. Thus employing advanced NLP techniques to discern semantic boundaries, it shines in handling structured or semi-structured data.

For instance, in legal documents, extracting warranty or indemnification clauses can be streamlined by combining specific chunks with metadata filtering in a vector database. This proves beneficial when constructing Retrieval Augmented Generation (RAG) use cases. This strategy has the advantage of preserving the sense and coherence of the text, but it requires more computational resources and greater algorithmic complexity.

5. Recursive Chunking:

For a nuanced understanding at various levels, recursive chunking excels with a hierarchical breakdown. Starting from dividing a text into paragraphs, then sentences, and ultimately words, this approach unveils context from high-level themes to detailed nuances.

Ideal for intricate documents like academic papers or legal contracts, recursive chunking facilitates flexibility in similarity searches for both broad and specific queries.

However, caution is warranted as similar chunks from the same source may be overrepresented in similarity searches, especially with longer overlap configurations in the text splitter.

This strategy has the advantage of offering greater granularity and variety of text, but it comes with greater complexity in managing and indexing chunks.

CHUNK INDEXING

Chunk indexing is the process of assigning each chunk a unique identifier and a set of attributes, which describe its content, location, and relationship to other chunks.

Chunk indexing is essential for facilitating search and response generation, as it allows you to select the most relevant chunks for a given query.

There are several approaches to chunk indexing, depending on your chunking strategy and the goal of the application.

  1. Detailed Indexing: It consists of chunking through sub-parts, e.g., sentences, and assigning each chunk an identifier based on its position in the text and a feature vector based on its content. This approach has the advantage of providing more specific context and greater accuracy, but it requires more memory and processing time.
  2. Question-based indexing strategy: It consists of chunking through knowledge domains, e.g., topics, and assigning each chunk an identifier based on its category and a vector of characteristics based on its relevance. This approach has the advantage of providing direct alignment with user requests and increased efficiency, but it can result in information loss and lower accuracy.
  3. Chunk summarization: A third possible approach is to optimize the indexing process, using chunking based on chunk summaries, which consists of generating a summary for each chunk, using extraction or compression techniques, and assigning to each chunk an identifier based on its summary and a feature vector based on its similarity. This approach has the advantage of providing greater synthesis and variety, but requires more complexity in generating and comparing summaries.

ad-hoc experimentation

In order to boost the effectiveness of chunking strategies within my LLM application, I follow a recommended approach of conducting ad-hoc experimentation with my data before delving into the chunking process. Here’s my step-by-step guide:

  1. Manually Inspecting Documents: I begin by thoroughly examining the documents I intend to retrieve for a specific query. I identify sections or chunks within the documents that represent the ideal context I want to provide to the LLM.
  2. Identifying Ideal Context: I pinpoint chunks that encapsulate relevant information and context for the given query. I consider the nature of my documents and the specific content that contributes most to meaningful responses.
  3. Experimenting with Chunking Strategies: I test different chunking strategies on the identified sections to gauge their effectiveness. I consider strategies such as content-aware chunking, sentence chunking, or recursive chunking.
  4. Evaluating Relevance: I assess the relevance and coherence of the obtained chunks for the LLM. I optimize the strategy based on the chunks that yield the most meaningful and contextually relevant information.

This iterative and hands-on approach allows me to tailor my chunking strategy to the specific characteristics of my data, ensuring that the LLM receives the most pertinent context for generating accurate and relevant responses.

Retrieval Augmented
Large Language Models
Chatbots
Generative Ai Tools
Recommended from ReadMedium