Summary

Vectara's Hallucination Evaluation Model ranks OpenAI's GPT-4 as the leading large language model (LLM) with the highest accuracy and lowest hallucination rate, closely followed by Meta's LLaMA.

Abstract

The Vectara platform, specializing in generative AI features for product builders, has released a leaderboard evaluating the performance of various LLMs using a model trained to detect hallucinations. GPT-4 stands out with a 97.0% accuracy rate and a 3.0% hallucination rate, followed by GPT-3.5 and LLaMA 2 70B. The leaderboard results are based on summarization accuracy of 1000 short documents from the CNN / Daily Mail Corpus, with a temperature of 0 used during LLM calls. This evaluation is particularly relevant for LLMs used in Retrieval-Augmented Generation (RAG) systems. Despite GPT-4's lead, the AI field remains highly competitive, with ongoing research to reduce hallucination rates and improve accuracy.

Opinions

The introduction of attention mechanisms in 2014 and the transformer model in 2017 were pivotal in the development of LLMs.
ChatGPT, powered by GPT-3.5 and GPT-4, has significantly impacted the chatbot industry since its release in November 2022.
Vectara's platform is commended for its ability to empower companies with little AI expertise to integrate advanced AI features into their products.
The methodology for the leaderboard is comprehensive, considering both accuracy and hallucination rates across various models, including GPT-4, GPT-3.5, LLaMA, and Google Palm.
The leaderboard highlights the importance of Chain of Thought Prompting in mitigating hallucinations in LLMs.
There is a consensus that while GPT-4 is currently the most accurate LLM, there is still significant room for improvement in the field.
The leaderboard and associated resources provide valuable insights for AI researchers and developers working to enhance the reliability of LLMs.

Public LLM leaderboard Computed using Vectara’s Hallucination Evaluation Model

OpenAI’s GPT-4 Turbo currently holds the crown, but Meta’s LLaMA is hot on its heels

Prelude

Large Language Models(LLM) have taken the NLP community AI community the whole world by storm! LLMs are black box AI systems that use deep learning on extremely large datasets to understand and generate new text. Modern LLMs began taking shape in 2014 when the attention mechanism — a machine learning technique designed to mimic human cognitive attention — was introduced in a research paper titled “Neural Machine Translation by Jointly Learning to Align and Translate.” In 2017, that attention mechanism was honed with the introduction of the transformer model in another paper, “Attention Is All You Need.”

Emerging from OpenAI’s labs on November 30, 2022, ChatGPT revolutionized the chatbot landscape, harnessing the power of LLMs like GPT-3.5 and GPT-4. Yet, the race for AI supremacy is far from over, with a myriad of LLMs from rival players striving for dominance. Today, I want to share more with you guys which LLMs are leading the charge and which are playing catch-up.

Language model sizes 2023–2024 optimal language model size highlights, Source: https://lifearchitect.ai/models/

So, what is Vectara?

Vectara is an end-to-end platform for product builders to embed powerful generative AI features into their applications with extraordinary results. It is a platform for companies with moderate to no AI experience that solves use cases, including conversational AI, question/answering, semantic app search, and research & analysis. Vectara is also language agnostic, which means that it can search for information in multiple languages.

Find out more in their website → https://vectara.com/

Methodology

A model was trained to detect hallucinations in LLM outputs.
1000 short documents were summarized by each LLM.
The accuracy and hallucination rate for each model was computed.
The rate at which each model refused to respond to the prompt is detailed.
The documents were taken from the CNN / Daily Mail Corpus.
A temperature of 0 was used when calling the LLMs.
Summarization accuracy was evaluated instead of overall factual accuracy.
Determining hallucinations is impossible to do for any ad hoc question.
LLMs are increasingly used in RAG pipelines to answer user queries.
This leaderboard is a good indicator for the accuracy of the models when used in RAG systems.

In the context of large language models (LLMs), a hallucination rate refers to the percentage of times an LLM produces incorrect or misleading information when summarizing a document. It’s like the LLM adding “fake facts” or making up information that’s not supported by the original text. The higher the hallucination rate, the more likely the LLM is to provide inaccurate or misleading information.

Let’s spot the hallucination in action.

ChatGPT hallucinating for the question “what is heavier: kilo of water or a kilo of air?”

Chain of Thought Prompting can help mitigate the model hallucination

As you see, the AI chatbot may falsely give the answer…

Model Result

Based on Vectara’s Hallucination Evaluation Model, the table provides the accuracy and hallucination rate of different large language models. To be specific, The accuracy rate is the percentage of the model’s responses that are correct, while the hallucination rate is the percentage of responses that are incorrect.

In other words, you may view it as:

Hallicination Rate = 1 - Accuracy

GPT-4 has the highest accuracy rate, at 97.0%, followed by GPT-3.5 at 96.5%, and Llama 2 70B at 94.9%. GPT-4 also has the lowest hallucination rate, at 3.0%, followed by GPT-3.5 at 3.5% and Llama 2 70B at 5.1%.

The other models in the table have lower accuracy rates and higher hallucination rates. For example, Google Palm has an accuracy rate of 87.9% and a hallucination rate of 12.1%, while Google Palm-Chat has an accuracy rate of 72.8% and a hallucination rate of 27.2%.