Free AI web copilot to create summaries, insights and extended knowledge, download it at here

Abstract

n&utm_medium=referral&utm_campaign=image&utm_content=2521144">tjevans</a> from <a href="https://pixabay.com//?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=2521144">Pixabay</a></figcaption></figure><h1 id="ce3b">Prototype setup on Google Colab</h1><p id="f8fd">As you can imagine I don’t have any special or advanced tools: so I decided to go for Google Colab for prototyping. I mean, even my free tier allows me to have one GPU instance, and it is gold!</p><h2 id="a9f7">The first decision: what embedding?</h2><p id="ff93">I did some researches on the <a href="https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Fhuggingface.co%2Fspaces%2Fmteb%2Fleaderboard">Hugging Face Embeddings Leaderboard</a> and from some really cool Medium articles: I decided to use <a href="https://huggingface.co/intfloat/e5-base-v2"><i>intfloat/e5-base-v2</i></a>, that has amazing scores with QnA, fast inference and it is not heavy (438 Mb).</p><p id="9460">🤔 Maybe there is a better or faster embedding… and what about the chunks size?</p><p id="7b7b">The chunks size were another turning point: wrongly (I suppose now) I picked up 1000 and overlap 50 for the CharacterTextSplitter. I wanted to preserve the context, but maybe RecursiveCharacterTextSplitter was better and enough at least to preserve the Sentence meaning.</p><div id="5a43"><pre>Load and <span class="hljs-built_in">split</span> generated <span class="hljs-keyword">in</span> 0:00:33.426102</pre></div><h2 id="9ed0">Second one: what vector store database?</h2><p id="3353">I followed the tutorials on LangChain: Chroma was promising and it can be persistent (saved in the hard disk) and you can add more documents, little by little.</p><p id="8abe">So chunks 1000/50 with Chroma and e5-base-v2 for 5 pdf documents, on Google Colab with only CPU:</p><div id="c8c7"><pre><span class="hljs-title class_">Vector</span> db generated <span class="hljs-keyword">in</span> <span class="hljs-number">0</span>:<span class="hljs-number">04</span>:<span class="hljs-number">01</span><span class="hljs-number">.836045</span></pre></div><p id="8356">🤔 Do you think a different VectorDB can be better?</p><figure id="8d5e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*PLXcaqwmxhPTuRq_tHHxZA.jpeg"><figcaption>Image by <a href="https://pixabay.com/users/pixxlteufel-117549/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=5291766">Micha</a> from <a href="https://pixabay.com//?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=5291766">Pixabay</a></figcaption></figure><h2 id="206a">The third decision: what LLM?</h2><p id="77c3">To run light and with a normal computer in my hands I decided to go with quantized models, so at least I can leverage a 3b, a 7b or even 13 Billion parameter model.</p><p id="0878">To have flexibility with Langchain I decided to use LlamaCpp with <a href="https://huggingface.co/TheBloke/orca_mini_3B-GGML">Orca-mini-3b in the q4.0 format</a>, and <a href="https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML">Llama2–7b-Chat in the q4.1 format</a>.</p><div id="7a9b"><pre><span class="hljs-keyword">from</span> langchain.llms <span class="hljs-keyword">import</span> LlamaCpp <span class="hljs-keyword">from</span> langchain <span class="hljs-keyword">import</span> PromptTemplate, LLMChain llm = LlamaCpp( model_path=<span class="hljs-string">"/content/orca-mini-3b.ggmlv3.q4_0.bin"</span>, n_ctx=<span class="hljs-number">2048</span>, temperature=<span class="hljs-number">0.7</span>, top_k=<span class="hljs-number">50</span>, top_p=<span class="hljs-number">1</span>, )</pre></div><p id="0097">I found Llama2 answers quite coherent and the output was sticking to the requests (for example the markdown format for the tables…)</p><p id="73ad">🤔 Is there any better option?</p><h2 id="5c90">Last decision: Context window and Prompt</h2><p id="2f52">You may have guessed from the code above that I set the context window to 2048. This is the number of tokens that includes the ones in the prompt and the answer. I tried also 4096 with Llama2 but the generation time was almost killing me.</p><p id="40be">At this point the prompt is also relevant: we must always think about the room available for the reply, particularly if it is a structured reply (like asking for a comparison table in markdown format…)</p><p id="8cfb">I used two different templates: one for Orca and one for Llama2 because they come with their own basic Template structure. I tweaked it a bit to include the question and the context:</p><div id="10d4"><pre>templateOrca = <span class="hljs-string">"""

System:

You are an AI assistant that follows instruction extremely well. Help as much as you can.

User:

{question}

Input:

{context}

Response:

"""</span>

templateLlama2 = <span class="hljs-string">"""[INST] <&l

Options

t;SYS>> You are a helpful, respectful and honest assistant. Answer exactly in few words from the context <</SYS>>

Answer the question below from context below : {context}

{question} [/INST] """</span></pre></div><p id="829c">🤔 Maybe here there is a big room of improvements? What do you think would be a better prompt?</p><figure id="261c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*-iTXRz_mRElubHkjKd3n-Q.png"><figcaption>Comparing the results</figcaption></figure><h1 id="7c3d">Final remarks… until you help me out</h1><p id="e426">I am a practical guy. So I looked for a way to benchmark the results. I mean, I don’t know anything about the domain knowledge of these documents, but I could use as a KPI the generation time and the instruction following.</p><ul><li>Llama2–7b is more accurate in following the instructions and providing a specific output format</li><li>Llama2–7b generation time is 3 times longer than Orca3b.</li><li>If we increase the context window to 4096 the same question above takes <b>56 minutes</b> to be replied!</li><li>using chunks of 1000 characters means that we cannot feed more than 3 results coming from the similarity search</li><li>Similarity is not always congruent to Relevance to the topic</li></ul><p id="1fc6">Sad reality: MichaelGPT is not really working that good… 😕</p><p id="cd87">Help me to figure out how we can do great things with Open Source LLMs. The literature is not really helping us: it is focused mainly on ChatGPT.</p><p id="dfef">If this story provided value and you wish to show a little support, you could:</p><ol><li>Write a comment with your suggestions to how to fix MichaelGPT</li><li>Sign up for a Medium membership using <a href="https://medium.com/@fabio.matricardi/membership">my link</a> — ($5/month to read unlimited Medium stories)</li><li>Follow me on Medium</li><li>Read my latest articles <a href="https://medium.com/@fabio.matricardi">https://medium.com/@fabio.matricardi</a></li></ol><p id="555a">Meantime you can check:</p><div id="0700" class="link-block"> <a href="https://artificialcorner.com/12-things-i-wish-i-knew-before-starting-to-work-with-hugging-face-llm-fb726ff6b95"> <div> <div> <h2>12 things I wish I knew before starting to work with Hugging Face LLM</h2> <div><h3>Insights and Tips for Navigating the Hugging Face LLM Landscape</h3></div> <div><p>artificialcorner.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*SeHtxb3ZPm4drX208st0Dw.jpeg)"></div> </div> </div> </a> </div><div id="7320" class="link-block"> <a href="https://artificialcorner.com/run-faster-run-lighter-the-art-of-running-quantized-models-on-your-laptop-6d8095c23322"> <div> <div> <h2>Run Faster, Run Lighter: The Art of Running Quantized Models on Your Laptop</h2> <div><h3>Learn how to run any GGML quantized model on Your Laptop and break free from your limits</h3></div> <div><p>artificialcorner.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*96oDmNryS7NDrry544aZ3Q.jpeg)"></div> </div> </div> </a> </div><div id="b652" class="link-block"> <a href="https://artificialcorner.com/the-rise-of-curious-ai-exploring-a-world-where-machines-pose-the-questions-24a784823598"> <div> <div> <h2>The Rise of Curious AI: Exploring a World Where Machines Pose the Questions</h2> <div><h3>Learn how to use free LLM for Question Generation starting from your documents.</h3></div> <div><p>artificialcorner.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*ZwPE1kXuXyR5_nZS.png)"></div> </div> </div> </a> </div><p id="845d"><i>(all images, unless otherwise noted, are by the author)</i></p><h2 id="73e0">WRITER at MLearning.ai /AI Agents LLM / Good-Bad AI Art / Sensory</h2><div id="493a" class="link-block"> <a href="https://readmedium.com/mlearning-ai-submission-suggestions-b51e2b130bfb"> <div> <div> <h2>Mlearning.ai Submission Suggestions</h2> <div><h3>How to become a writer on Mlearning.ai</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*6xCb1sNpjadaSBuVLPTFQQ.png)"></div> </div> </div> </a> </div></article></body>

Let’s fix MichaelGPT: can you help me?

When you want to build your own AI but it drives you crazy: let’s explore the results and the problems in building your Open Source LLM application.

💡 Disclaimer: this is an experimental article! Sharing the process and the technical issues I faced (but I think any of you may face too…) I would like to get suggestions and help from you too. It can be a Medium community enthusiasts effort to solve a common problem: can we perform as good as OpenAI with Open Source only LLMs?

Introduction

It all started by chance. “Hey Fabio, can you help me out with a LLM project?” Michael C. Carroll dropped me an innocent email, after reading one of my articles and joining the Medium community.

I thought is was going to be an easy feat: I mean, let’s set up an easy chatbot with Streamlit and a RAG strategy (Retrieval Augmented Generation).

After I run some tests I can confidently say that the answer is… actually I don’t know!

How can I deal with many documents?
How can I be certain that the answers are correct?
What if the questions refer to more than one document at the time?
How long it takes to create the vector index?

And so on… with these kind of questions that I cannot ask any AI… where do I have to start?

My plans where to keep documents and RAG strategy on Michael computer, and move the AI inference somewhere else with an API call (it can even be a Serverless GPU like Beam.cloud)

Project goals

Mike is an AI enthusiast. Following many of our Medium articles decided that he wanted to build his own AI.

His project is quite ambitious: “Fabio I have 200 Gb of documents and I would like to be able to ask questions to all of them!”

Before even considering the hardware requirements, the database performances and so on I suggested we proceed step by step: “Sure Mike, let’s start with few documents. Send them to me and give me few questions as an example…”

I mean if I cannot get it right with 5 documents… how can we deal with 100k?

Mike sent me the documents… and the questions

Example of Question: Compare and contrast the Events of Default in the “Third Amendment to Amended and Restated Credit Agreement” with the “Multicurrency Revolving Credit Facility” and then create a table showing the differences in each.

The results below may look amazing, I know…

Results (markdown rendered… with few adjustments…)

In summary, the key differences between the Events of Default in the “Third Amendment to Amended and Restated Credit Agreement” and the Multicurrency Revolving Credit Facility are:

- The Schedules and Exhibits to the Credit Agreement shall not be amended, modified, supplemented or otherwise affected in the “Third Amendment to Amended and Restated Credit Agreement.” In contrast, the Multicurrency Revolving Credit Facility may be increased from time to time by the Lender upon written notice to the Borrower, subject to certain conditions.

- The Borrower has the right to request an increase in the Multicurrency Revolving Credit Facility from time to time in the Multicurrency Revolving Credit Facility, while the Lender may elect to increase the Multicurrency Revolving Credit Facility from time to time subject to the Borrower’s consent and satisfaction of certain conditions.

- The Loan Documents may be amended, modified, or supplemented from time to time in the Multicurrency Revolving Credit Facility, while the Schedules and Exhibits to the Credit Agreement shall not be amended, modified, supplemented or otherwise affected in the “Third Amendment to Amended and Restated Credit Agreement.

Prototype setup on Google Colab

As you can imagine I don’t have any special or advanced tools: so I decided to go for Google Colab for prototyping. I mean, even my free tier allows me to have one GPU instance, and it is gold!

The first decision: what embedding?

I did some researches on the Hugging Face Embeddings Leaderboard and from some really cool Medium articles: I decided to use intfloat/e5-base-v2, that has amazing scores with QnA, fast inference and it is not heavy (438 Mb).

🤔 Maybe there is a better or faster embedding… and what about the chunks size?

The chunks size were another turning point: wrongly (I suppose now) I picked up 1000 and overlap 50 for the CharacterTextSplitter. I wanted to preserve the context, but maybe RecursiveCharacterTextSplitter was better and enough at least to preserve the Sentence meaning.

Load and split generated in 0:00:33.426102

Second one: what vector store database?

I followed the tutorials on LangChain: Chroma was promising and it can be persistent (saved in the hard disk) and you can add more documents, little by little.

So chunks 1000/50 with Chroma and e5-base-v2 for 5 pdf documents, on Google Colab with only CPU:

Vector db generated in 0:04:01.836045

🤔 Do you think a different VectorDB can be better?

The third decision: what LLM?

To run light and with a normal computer in my hands I decided to go with quantized models, so at least I can leverage a 3b, a 7b or even 13 Billion parameter model.

To have flexibility with Langchain I decided to use LlamaCpp with Orca-mini-3b in the q4.0 format, and Llama2–7b-Chat in the q4.1 format.

from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
llm = LlamaCpp(
    model_path="/content/orca-mini-3b.ggmlv3.q4_0.bin",
    n_ctx=2048,
    temperature=0.7,
    top_k=50,
    top_p=1,
)

I found Llama2 answers quite coherent and the output was sticking to the requests (for example the markdown format for the tables…)

🤔 Is there any better option?

Last decision: Context window and Prompt

You may have guessed from the code above that I set the context window to 2048. This is the number of tokens that includes the ones in the prompt and the answer. I tried also 4096 with Llama2 but the generation time was almost killing me.

At this point the prompt is also relevant: we must always think about the room available for the reply, particularly if it is a structured reply (like asking for a comparison table in markdown format…)

I used two different templates: one for Orca and one for Llama2 because they come with their own basic Template structure. I tweaked it a bit to include the question and the context:

templateOrca = """
### System:
You are an AI assistant that follows instruction extremely well. Help as much as you can.

### User:
{question}

### Input:
{context}

### Response:
"""

templateLlama2 = """[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Answer exactly in few words from the context
<</SYS>>

Answer the question below from context below :
{context}

{question} [/INST]
"""

🤔 Maybe here there is a big room of improvements? What do you think would be a better prompt?

Final remarks… until you help me out

I am a practical guy. So I looked for a way to benchmark the results. I mean, I don’t know anything about the domain knowledge of these documents, but I could use as a KPI the generation time and the instruction following.

Llama2–7b is more accurate in following the instructions and providing a specific output format
Llama2–7b generation time is 3 times longer than Orca3b.
If we increase the context window to 4096 the same question above takes 56 minutes to be replied!
using chunks of 1000 characters means that we cannot feed more than 3 results coming from the similarity search
Similarity is not always congruent to Relevance to the topic

Sad reality: MichaelGPT is not really working that good… 😕

Help me to figure out how we can do great things with Open Source LLMs. The literature is not really helping us: it is focused mainly on ChatGPT.

If this story provided value and you wish to show a little support, you could:

Write a comment with your suggestions to how to fix MichaelGPT
Sign up for a Medium membership using my link — ($5/month to read unlimited Medium stories)
Follow me on Medium
Read my latest articles https://medium.com/@fabio.matricardi

Meantime you can check:

12 things I wish I knew before starting to work with Hugging Face LLM

Insights and Tips for Navigating the Hugging Face LLM Landscape

artificialcorner.com

Run Faster, Run Lighter: The Art of Running Quantized Models on Your Laptop

Learn how to run any GGML quantized model on Your Laptop and break free from your limits

artificialcorner.com

The Rise of Curious AI: Exploring a World Where Machines Pose the Questions

Learn how to use free LLM for Question Generation starting from your documents.

artificialcorner.com

(all images, unless otherwise noted, are by the author)

WRITER at MLearning.ai /AI Agents LLM / Good-Bad AI Art / Sensory

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com