How to Quickly Test a New LLM Without Wasting Time
Sometimes you simply need to test them yourself
In the past week there has been a huge load of announcements in the Artificial Intelligence community:
and more others.
Many enthusiasts, like me too, started to look around for information and first impressions on these models. As usual, there are a lot of skeptic reviews, based on really silly performance matrix. I read somewhere on Reddit that Gemma is not a good model… because it cannot properly reply to the question “what is heavier, 1 kg of feathers or 2 pounds of feathers?”
Well, considering that I am not that confident people would have replied correctly to that question, why don’t we test the models ourselves?
Remember: the best State Of The Art judge is only you!
🎰 You can see the final result in the It is time to test section!
Cosmo-1b
Cosmo is a new model released by Hugging Face. This is a 1.8B model trained on Cosmopedia synthetic dataset.
The official HuggingFace Hub Model Card states as follows:
The training corpus consisted of 30B tokens, 25B of which are synthetic from Cosmopedia. Since we didn’t explore the synthetic generation of code, we augmented the dataset with 5B tokens of non-synthetic sources like the
code-python-0.60-to-1.00
andweb-0.50-to-1.00
subsets of AutoMathText. We also added 1M files from The Stack's Jupyter Notebooks, converted to script.
I picked up this model because it is already compatible with llama-cpp-python, and this means that literally anyone can run this model on any computer, required only 2 Gb of RAM.
I tried today Gemma-2B with llama-cpp-python version 0.2.46, and I got an error trying to load the model. I will keep you informed as soon as I succeed 😉
📣 Update on 23/02/2024 > Llama-cpp-python 0.2.47 already supports Gemma models! I will write also about it!
How to run the model?
You need only python installed, that is all.
The following package must be installed with pip, only this one is enough:
pip install llama-cpp-python==0.2.47
And the code is not complicated at all. I will lead you step by step.
Consider that it will be a simple textual interface, but with memory capabilities (the model will remember the conversation 🧠🥳)
0.Download the model GGUF file
You need a virtual environment and a GGUF file.
mkdir cosmo
cd cosmo
python -m venv venv
venv\Scripts\activate
Download the GGUF file from the tsunemoto/cosmo-1b-GGUF repository on Hugging Face: the file must be the cosmo-1b.Q4_K_M.gguf. Download it into a subfolder called model
.
1.Import libraries and declare the model
from llama_cpp import Llama
mp = "model\cosmo-1b.Q4_K_M.gguf"
cosmo = Llama(
model_path=mp,
n_gpu_layers=0,
temperature=0.1,
top_p = 0.5,
n_ctx=2048,
max_tokens=350,
repeat_penalty=1.7,
stop=["</s>","[/INST]","/INST"],
verbose=False,
streaming=True,
chat_format="mistral-instruct",
)
mp
is our model path (remember I told you to download the file into the model
subfolder…). We instantiate a Llama object that will load the quantized model. Note that we declare to use the chat_format based on mistral-instruct chat_format=”mistral-instruct”
; This sould not come as a surprise since it is clearly stated in the Model Card that…
2.Loop over the user input and the model replies
Llama-CPP has a completion method called .create_chat_completion()
that is really convenient: it allows us to avoid formatting the prompt with the special tokens (you can refer to this article for further info) and give the instruction in the standard messages format:
# For the assistant messages
{"role": "assistant", "content": "How may I help you today?"}
# For the user messages
{"role": "user", "content": user_input}
messages
is a list of instructions that follow the mentioned above schema. We append the user instructions and send the messages to the model.
After that we append the model reply and we repeat the process with every new user input.
messages = [
{"role": "system", "content": "You are a helpful assistant.",},
{"role": "assistant", "content": "How may I help you today?"}
]
looppa = True
print("\033[1;33m CHAT WITH COSMO-1.8B HuggingFACE LLM")
print("\033[1;35m ------------------------------------")
while looppa:
print("\033[91;1m")
user_input = input("> ")
if user_input.lower() in ['quit','exit']:
looppa = False
print("\033[0mBYE BYE!")
break
else:
messages.append(
{"role": "user", "content": user_input}
)
res = cosmo.create_chat_completion(
messages=messages)["choices"][0]["message"]["content"]
messages.append(
{"role": "assistant", "content": res}
)
print("\033[92;1m")
print(res)
We call the completion method with the instruction:
res = cosmo.create_chat_completion(
messages=messages)["choices"][0]["message"]["content"]
Note that here as well we take only the response text from the structure data coming from the llama.cpp inference with [“choices”][0][“message”][“content”]
.
In case you are worried… here the entire code (really few lines 😉)
from llama_cpp import Llama
mp = "model\cosmo-1b.Q4_K_M.gguf"
cosmo = Llama(
model_path=mp,
n_gpu_layers=0,
temperature=0.1,
top_p = 0.5,
n_ctx=2048,
max_tokens=350,
repeat_penalty=1.7,
stop=["</s>","[/INST]","/INST"],
verbose=False,
streaming=True,
chat_format="mistral-instruct",
)
messages = [
{"role": "system", "content": "You are a helpful assistant.",},
{"role": "assistant", "content": "How may I help you today?"}
]
looppa = True
print("\033[1;33m CHAT WITH COSMO-1.8B HuggingFACE LLM")
print("\033[1;35m ------------------------------------")
while looppa:
print("\033[91;1m")
user_input = input("> ")
if user_input.lower() in ['quit','exit']:
looppa = False
print("\033[0mBYE BYE!")
break
else:
messages.append(
{"role": "user", "content": user_input}
)
res = cosmo.create_chat_completion(
messages=messages)["choices"][0]["message"]["content"]
messages.append(
{"role": "assistant", "content": res}
)
print("\033[92;1m")
print(res)
It is time to test
Save the python file and from the terminal, with the venv
active run
python yourfile.py
In my case I called it cosmo1b.py
Bonus track… ANSI escape codes
Are you wondering what are those strange print statements in the code, like print(“\033[92;1m”)?
Well, ANSI escape sequences are a standard for in-band signaling to control cursor location, color, font styling, and other options on video text terminals and terminal emulators. Certain sequences of bytes, most starting with an ASCII escape character and a bracket character, are embedded into text. The terminal interprets these sequences as commands, rather than text to display verbatim.
Although hardware text terminals have become increasingly rare in the 21st century — Operating systems now have a really sophisticated and user friendly interface, the relevance of the ANSI standard persists because a great majority of terminal emulators and command consoles interpret at least a portion of the ANSI standard.
So this means that without using any special python library we can send Escape Codes to the Terminal and set colors for our print()
statements
So according to the explanation found on git hub …
our print(“\033[32;1m”)
means set foreground color to green, bold
Conclusions
There are new models every week. Many will tell you to stop wasting your time running after any of them.
I disagree! You cannot always reply on other people judgement to decide on a Large Language Model. Moreover, Small Language Models (below 3 Billion parameters) are really promising, and can run on our CPU with really little inference time.
For such Tiny/Small models parameter tuning may change radically the results, and they may not have high benchmark scores on all the key indicators, but this does not mean they are not good at all or perfect for your specific purposes.
The best way is to judge yourself.
Hope you enjoyed the article. If this story provided value and you wish to show a little support, you could:
- Clap a lot of times for this story
- Highlight the parts more relevant to be remembered (it will be easier for you to find it later, and for me to write better articles)
- Learn how to start to Build Your Own AI, download This Free eBook
- Sign up for a Medium membership using my link — ($5/month to read unlimited Medium stories)
- Follow me on Medium
- Read my latest articles https://medium.com/@fabio.matricardi
Here some few more resources:
Stackademic 🎓
Thank you for reading until the end. Before you go:
- Please consider clapping and following the writer! 👏
- Follow us X | LinkedIn | YouTube | Discord
- Visit our other platforms: In Plain English | CoFeed | Venture | Cubed
- More content at Stackademic.com