avatarFabio Matricardi

Summary

The article discusses the development and benefits of using a combination of open-source, CPU-efficient language models for exploratory document analysis, emphasizing the effectiveness of smaller, specialized models over a single large language model for various natural language processing tasks.

Abstract

The article delves into the concept of exploratory document analysis using a suite of open-source language models, challenging the assumption that a single large language model (LLM) is the best solution for all NLP tasks. It highlights the practicality of leveraging smaller, task-specific models to enhance performance and efficiency. The author demonstrates how a CPU-only setup can quickly process documents to generate summaries, rewritten versions, and relevant questions using models like ZephyrGPT4All and LaMini-Flan-T5. The article emphasizes the importance of understanding the inner workings of models like ChatGPT and the symphony of smaller models that contribute to its capabilities. It also explores the advantages of models like LaMini-Flan-T5, which outperform larger models in certain tasks, and the use of Rich and tqdm for visually appealing and efficient code execution. The author advocates for the strategic use of summaries in Retrieval-Augmented Generation (RAG) systems to optimize retrieval and generation processes, and concludes by discussing the potential and challenges of building one's own AI, including the need for multi-language support and the limitations of context window sizes.

Opinions

  • The author believes that relying on a single LLM for all tasks is based on a potentially incorrect assumption and that a combination of smaller, specialized models is more efficient and effective.
  • There is an emphasis on the importance of pre-processing documents with specialized models to improve the quality of AI outputs, as highlighted in the "Garbage-In/Garbage-Out" principle.
  • The author is optimistic about the capabilities of open-source models like ZephyrGPT4All and LaMini-Flan-T5, particularly their performance on CPU-only machines.
  • The article suggests that the T5 family of models, with their encoder-decoder architecture, offers advantages over decoder-only models in certain scenarios, especially when the number of parameters is limited.
  • The author promotes the idea of using generated summaries for more efficient retrieval in RAG systems, which can significantly improve the performance of AI in document analysis and question-answering tasks.
  • There is a recognition of the challenges in multi-language support and the limitations of context window sizes in current language models.
  • The author encourages the AI community to experiment and build their own AI systems, highlighting the resources and tools available, including the use of GitHub repositories and the importance of subscribing to platforms like Medium for access to high-quality content and discussions.

Exploratory Document Analysis… is this a thing?

How to leverage a lighting fast RAG pre-processing LLM building your own AI. All Open-source, all for free.

Generated with Leonardo.AI

Why are we obsessed with one LLM to rule them all?

Maybe because our life is much easier, knowing that we can offload to one single Large Language Model all the task we can think of.

But what if this concept is based on a wrong assumption from the beginning?

In this article we are going to explore together the realistic architecture of AI models with a special consideration of the Retrieval Augmented Generation technology.

A peek to what kind of performances we can achieve

15 second only on CPU to get everything

For easiness I did everything on my local Computer, using only CPU resources (someone called my thePoorGPUguy…): as you can see in 15 seconds the LLM pre-processed a good quality document creating a Summary and a rewritten version of it. It then extracted by itself few relevant questions and used RAG to reply to them basing itself only on the Summary of the article.

How the Big Shots works

It is wrong to think that magically ChatGPT or even Cohere-Claude are able to do everything. Think about it: a pre-trained model cannot be easily extended to produce also visuals or multimedia outputs.

You must consider, in fact, ChatGPT and the other Big actors as a symphony of smaller models working together. The hint that you can choose a cheapest model for your generation should be a clue good enough (ada, da-vinci and so on…).

image created by the author

Generally speaking, ChatGPT comprises two main models: the “InstructGPT” model and the “ChatGPT” model. The InstructGPT model is specifically designed for following instructions and providing detailed responses, while the ChatGPT model focuses on generating conversational responses in a chat-like format. These models work together to facilitate interactive and engaging conversations with users.

In addition we have all the auxiliaries. From the official page of OpenAI we can see that also the Embeddings are more than one:

official page of Open AI: text-and-code embeddings

Current OpenAI models

OpenAI’s models are diverse, each designed to perform specific tasks. Here’s a brief overview of some of the models:

  1. GPT-4: This model is an improvement on the GPT-3.5 series, capable of understanding and generating both natural language and code.
  2. GPT-3.5: This model series also understands and generates natural language or code. The most capable and cost-effective model in this family is the gpt-3.5-turbo, optimized for chat but also effective for traditional completions tasks.
  3. DALL·E: This model can generate and edit images based on a natural language prompt, offering a unique blend of visual creativity and language understanding.
  4. Whisper: Whisper is a speech recognition model that can convert audio into text. It’s trained on a diverse dataset, enabling it to perform multilingual speech recognition, speech translation, and language identification.
  5. Embeddings: These models convert text into a numerical form, useful for tasks like search, clustering, recommendations, anomaly detection, and classification.
  6. Moderation: This model is fine-tuned to detect whether text may be sensitive or unsafe, helping to maintain a safe and respectful environment.
  7. GPT-3: This model series can understand and generate natural language. Despite being superseded by the more powerful GPT-3.5 models, the original GPT-3 base models are still available for fine-tuning.
  8. CLIP: Connecting text and images. A new neural network called which efficiently learns visual concepts from natural language supervision.

Learn how to start to Build Your Own ai with This Free eBook

Can we do the same with Open Source Models?

The answer is absolutely YES! And I think we MUST do it, not as an option.

Regardless of the computational resources you may have, it is not efficient to use a single model to perform all the NLP tasks. It is also not that wise.

Some LLM have been specifically trained for summarization, for QnA, for entity recognition, sentiment analysis and so on. Why don’t we make use of them in synergy with our preferred open source GPT model?

I am indeed building a sort of experiment, called ZephyrGPT4All. The concept behind it is simple: to take the idea of GPT4All (that uses CPU quantized models) and make them available for RAG question&answering. As base model for the text generation is Zephyr-7b, a new born from Mistral-7b that has remarkable multi-language capabilities.

But for the Document pre-processing I decided to use a Flan/T5 family model.

Exploratory Document Analysis

Probably this terminology is used only by me (inherited from Machine Learning Exploratory Data Analysis…).

This concept is simple: we clean the documents (refer to my previously published article Garbage-In/Garbage-Out…) and then we use a slim and highly skilled NLP model to pre-process the document itself.

If you indulged on the initial peek-preview it took no more than 15 seconds, on my CPU only computer, to produce the Summary of an article (around 2300 words), to rewrite it in a simple tone, and to generate few relevant questions from the entire text. After that I used a RAG prompt to reply to very same LLM generated questions.

I have done the same process with Zephyr-7b GPTQ quantized version: the entire process took almost 2 minutes. Imagine if you were using the GGUF/GGML version of the model: we could have waited for more than 30 minutes.

Prototyping on Google Colab

As I usually carry out my tests, first I gave it a try on the free tier of Google Colab Notebooks. Only after clear results I apply the concept on my local machine.

The Notebook is available on my GitHub repository

I previously talked about the T5 family. These models follows a different architecture from the Generative Pre-Trained models: in fact the T5 are encoder-decoder models.

I started initially testing the Document pre-processing using the biggest model capable of Google Colab hardware spec with CPU only: the first candidate was declare-lab/flan-alpaca-large. This model has 770 Million parameters and downloaded is around 3.2 Gigabyte.

Since the results were good I started to think “what happens if I go with a smaller parameter model?”. So I resumed my beloved LaMini models.

The results of the tests run by the creators of the models show amazing results: the encoder-decoder LaMini language models (LaMiniT5 series and LaMini-Flan-T5 series) outperform the decoder-only LaMini language models (LaMini-GPT series) when the number of parameters is limited (less than 500M parameters).

LaMini-Flan-T5–248M even outperforms LLaMa-7B on downstream NLP tasks. When the model size is higher, LaMini-Flan-T5 is comparable to LaMini-GPT. Yet, both LaMini-Flan-T5 and LaMini-T5 demonstrate strong human evaluation results for user-oriented instructions, despite their relatively small size. From this plot you can immediately see that in terms of performance the LaMini-Flan-T5 family is producing amazing results even below 500 Million parameters.

The performance comparison between encoder-decoder models and decoder-only models of LaMini-LM family on the downstream NLP tasks. The horizontal dash lines indicate the average performance given by Alpaca-7B and LLaMa-7B. — source https://mbzuai-nlp.github.io/LaMini-LM/

I repeated the same tests (that you can find in the Colab Notebook from my GitHub repo) with LaMini-Flan-T5–248 Million parameters (on hard disk 900 Mb) and finally also with LaMini-Flan-T5–77 Million parameters (on hard disk 308 Mb only!!!) and guess what…?

The 77 Million parameters model perform really good with a lighting fast inference speed.

Code results do not lie…

Let’s have a look at the code and the outputs. The dependencies required are really little:

pip install transformers
pip install langchain
pip install tiktoken
pip install rich

We need the transformers library to interact with the Hugging Face models; langchain is used here only for the Document loader and splitters; tiktoken is required since we are going to split the documents by tokens; and finally rich is a text GUI library to give out some visual appeal to the text generation.

The text generation pipeline is really simple:

## Test  MBZUAI/LaMini-Flan-T5-77M
# to verify if smaller model give shorter inference speed 
# but still good quality

from transformers import pipeline

model77 = pipeline(model="MBZUAI/LaMini-Flan-T5-77M")

And that is all! The next time we call the pipeline model77 we will include the prompt and the arguments for fine tuning the generation process. Here an example:

text = """Some long text you can copy paste from an article..."""
template = f'''ARTICLE: {text}

What is a one-paragraph summary of the above article?

'''
res = model77(template, temperature=0.3, 
              repetition_penalty=1.3, max_length=400, 
              do_sample=True)[0]['generated_text']
print(res)

The text variable contains a long text you want to summarize (remember that these models have a context window of 512 tokens only…); we use a python f-string to generate a prompt template were we inject the long text we want to be summarized; finally we call the pipeline with few arguments:

  • temperature: 0 to 1, express how much you want the LLM to be creative (1) or factual (0)
  • repetition_penalty: this argument is a factor that force the LLM to not repeat some words. It is extremely important, especially with small parameters Models
  • max_length: specify the maximum length of the generated text
pipeline run with a text of 1571 characters

The approach I used to overcome the context window limitation is to stuff the different summaries chunk by chunk.

  • I split the text into chunks of 460 tokens (450 of text and 10 overlap from the previous chunk)
  • run the summary with the prompt injection for the first chunk
  • reiterate the process for all the chunks
  • join all the results into one single string

Here the code to load the text file and split into Token chunks:

# Load the TXT file
with open('/content/EN_Vector Search Is Not All You Need by Anthony Alc.txt') as f:
  fulltext = f.read()
f.close()

# Split into chunks by tokens
from langchain.document_loaders import TextLoader
from langchain.text_splitter import TokenTextSplitter
TOKENtext_splitter = TokenTextSplitter(chunk_size=450, chunk_overlap=10)
sum_context = TOKENtext_splitter.split_text(fulltext)

Then iterate with a for loop all over the chunks and join the summaries:

final = ''
strt = datetime.datetime.now()
for i in trange(0,len(sum_context)):
  text = sum_context[i]
  template_bullets = f'''ARTICLE: {text}

  What is a one-paragraph summary of the above article?

  '''
  start = datetime.datetime.now()
  res = model77(template_bullets, temperature=0.3, 
                repetition_penalty=1.3, max_length=400, 
                do_sample=True)[0]['generated_text']
  final = final + ' '+ res
  delta = datetime.datetime.now()-start
  console.print(f"[green2]Completed in {delta}")

delt = datetime.datetime.now()-strt
console.print(Markdown("# FULL SUMMARY"))
console.print(Markdown(final))
console.print(f"[red1 bold]Full summary Completed in {delt}")
console.print(Markdown("---"))

Rich comes with an integration with tqdm and this is really convenient: we can monitor the progress of the for loop with a progress bar and statistics.

don’t freak out is you don’t see 100% — the tqdm integration is still in beta and has a bug. Be certain that all the 5 chucks have been processed!

The Next step: recommendation system with questions

My idea of Exploratory Document Analysis consists in 2 parts:

  1. We extract the Summary of the text because it is a powerful context for the Retrieval Augmented Generation
  2. We provide the main topics of the Document with suggested questions: it is always amazing how a ChatBot always suggests you further questions after a generation… we are going to do the same as soon as we load a document.

I did not invent anything: to look for a good prompt for the Flan-T5 family and extract questions, I followed the instructions from the GitHub repo for the Flan Paper.

template_final = f'''{text}.\nAsk few question about this article.
'''

In that repository there are all the explanations and the examples of the question the model have been trained with. Really convenient 😁. You can also refer to this article for more in-depth understanding.

# Generate Suggested questions from the text
# Then Reply to the questions
final = ''
strt = datetime.datetime.now()
for i in trange(0,len(sum_context)):
  text = sum_context[i]
  template_final = f'''{text}.\nAsk few question about this article.
'''
  res = model77(template_final, temperature=0.3, 
                repetition_penalty=1.3, max_length=400, 
                do_sample=True)[0]['generated_text']
  final = final + '\n '+ res

delt = datetime.datetime.now()-strt
console.print(Markdown("---"))
console.print(f"[red1 bold]Questions generated in {delt}")

lst = final.split('\n')
final_lst = []
for items in lst:
  if items == '':
    pass
  else:
    final_lst.append(items)

We iterate with a for loop through the chunks, and for every chunk we extract one question.

To automate the next process of question&answering, we take every generated question and create a list. We remove as well from the list any empty line.

The last step is amazing: we are going to reply to all the questions from all the chunks using ONLY the Summary generated by LaMini-77M.

for item in final_lst:
  question = item

  template_qna = f'''Read this and answer the question. 
  If the question is unanswerable, ""say \"unanswerable\".
  \n\n{f}\n\n{question}
  '''

  start = datetime.datetime.now()
  res = model77(template_qna, temperature=0.3, 
                repetition_penalty=1.3, max_length=400, 
                do_sample=True)[0]['generated_text']
  elaps = datetime.datetime.now() - start
  console.print(f"[bold deep_sky_blue1]{question}")
  console.print(Markdown(res))
  console.print(f"[red1 bold]Qna Completed in {elaps}")
  console.print(Markdown("---"))

The rich library is providing a cool visual touch the print function. The inference time for the answers is in few cases even less than a second!

Results of the QnA without using and Retrieval Vector dB, but only the Summarization

Why the Summary is Super Important

When building a Retrieval-Augmented Generation (RAG) system, how you chunk up the context data is crucial. Frameworks often handle the chunking process for you, but it’s important to understand and experiment with it yourself. Chunk size can make a significant difference in performance. It’s worth exploring what works best for your specific application.

In general, smaller chunks tend to improve retrieval, but they may limit the amount of surrounding context available for generation. There are various ways to approach chunking, so it’s essential to be thoughtful and not approach it blindly.

There are various ways to enhance the process of combining documents for context in your RAG system. It can be as straightforward as manually merging multiple documents that cover the same topic. However, you can take it a step further to make it more sophisticated and efficient.

To overcome this problem, Summaries are the key!

One creative approach that has caught my attention involves leveraging the capabilities of the Language Model (LLM) itself. In this approach, the LLM is used to generate summaries for each document included in the context. These summaries capture the key information and essence of the original documents in a concise manner. (refer to a really well written analysis By Anthony Alcarazreferred article here).

During the retrieval step, instead of directly searching through the entire content of each document, the system can initially perform a search over these generated summaries. This enables a faster and more focused retrieval process, as the system can quickly identify the most relevant documents based on the summary-level information.

The system then dives into the details of the selected documents only when necessary, optimizing the retrieval process and reducing computational overhead.

It also good to include the summary inside the retrieved document collection and then use the Re-Ranking strategy to order them according to the relevance.

By incorporating this creative approach, you can enhance the efficiency and effectiveness of your RAG system, enabling faster retrieval and more targeted generation of responses.

Drawbacks and Conclusions

Build Your Own AI is easier said than done. There are technology to pick up and workflow to be engineered. Nevertheless it is a fascinating process where we are in control of the data cleaning, data processing and generation stage.

Using more than one model is certainly the key for success: we should consider at least three of them:

  1. A sentence embedding: pick the right one, based on the taarget language (english, italian, french… if you are not going to explore only english documents you have to choose a multi-language one)
  2. A fast and slim pre-processing model: here the T5 family has a lot to give, with fast inference an low computataional resources
  3. The Text Generation model: with the quantization technology it is now possible to leverage models with huge number of parameters. In my experience a 7 Billion parameter is really good to understand complex prompts for comparison and analysis, including the capability to provide the output in a specific format.

Multi-language

The main problem of the Flan/T5 family of models is that the multi-language one are really huge. That is the main reason we cannot use them blindly in every step of the RAG chain.

I tried to feed an italian document to LaMiniFlan: it understand it but it cannot give me the output in the same language.

Context window

Another drawback of the T5 family is the context window: it is stuck at 512 tokens, making very hard to implement complex prompts or few-shot ones.

In the next article we will see how to put everything together, from data cleaning to Exploratory Document Analysis up to the RAG chain with Re-Ranking.

If this story provided value and you like the topics consider subscribing to Medium to unlock more resources. Medium is a big community with high quality content: you can certainly find here what you need.

  1. Sign up for a Medium membership using my link — ($5/month to read unlimited Medium stories)
  2. Follow me on Medium
  3. Get my Free Ebook to Learn how to start to Build Your Own ai
  4. Highlight what you want to remember and if you have doubts or suggestions simply drop a comment to the article: I will promptly reply to you
  5. Read my latest articles https://medium.com/@fabio.matricardi

Don you want to read more? Here some topics

Artificial Intelligence
Python
Local Gpt
Open Source
Buildyourownai
Recommended from ReadMedium