Summary

Anthropic's Claude 3 outperforms GPT-4 and Gemini 1.0 Ultra in various benchmarks, including accuracy, multimodal capabilities, and context window size, while offering competitive pricing and enhanced AI safety features.

Abstract

Anthropic has released its Claude 3 series of AI models—Opus, Sonnet, and Haiku—which have demonstrated superior performance over OpenAI's GPT-4 and Gemini 1.0 Ultra. The models exhibit better accuracy in knowledge and reasoning tasks, improved multimodal capabilities for processing text and visual inputs, and a larger context window of up to 200k tokens. Claude 3's advanced vision capabilities are on par with leading AI, and its pricing is economical at $15 per million tokens. The models are designed with a focus on responsible design and AI safety, aligning with the US Executive Order on AI. Claude 3 represents a significant step towards more efficient, safe, and cost-effective large language models (LLMs).

Opinions

The author suggests that Claude 3's performance in benchmarks indicates a significant advancement in the AI industry.
The multimodal capabilities of Claude 3 are seen as a major improvement, allowing for more complex data analysis.
The extended context window of Claude 3 is highlighted as a key differentiator, addressing a limitation in previous models like GPT-4.
Claude 3's pricing is perceived as part of a trend towards more affordable AI usage.
The author notes the importance of AI safety and trustworthiness, acknowledging Claude 3's alignment with US AI policy, but also pointing out the challenge of completely eliminating biases.
The author expresses interest in future benchmarks comparing Claude 3 to Gemini 1.5, hinting at potential areas where Gemini might excel.
The article concludes with an optimistic view of the trajectory of LLM development, emphasizing the progress in making AI faster, safer, smarter, and cheaper.

Anthropic’s Claude 3 Beats GPT-4 Across Main Metrics

The race is on, and it should not be over anytime soon.

Every week, even every day, new breakthroughs are happening in the AI space, and here is a new one.

Anthropic just released (On February 4th, 2024), its Claude 3 series, a new family of models: Opus, Sonnet, and Haiku.

The models are available on test, on Claude.ai but also via API.

So what is new about this new release, on how does it outperform OpenAI ChatGPT4?

Performance

The benchmark made with Claude 3 (Opus) shows better accuracy against GPT4 on Undergraduate level knowledge (86,8% vs 86,4%), Graduate level reasoning (50,4% vs 35,7%), Grade school math (95% vs 92%), Math problem solving (60,1% vs 52,9%), Multilingual Math (90,7% vs 74,5%), Code (84,9% vs 67%), Reasoning over text (83,1% vs 80,9%) and so on.

It also beats Gemini 1.0 Ultra on the same benchmarks.

Here is the full benchmark matrix shared by Anthropic.

Claude 3 benchmark vs main LLMs like GPT4 and Gemini — Table from Anthropic

Capabilities

The Claude 3 models offer multimodal capabilities, enabling them to understand both text and visual inputs. This feature is crucial for analyzing and processing complex, unstructured information in a variety of formats. It allows for a more comprehensive understanding of data, regardless of its presentation.

Claude 3 models feature advanced vision capabilities, equal to leading AI, for processing diverse visual formats like photos and diagrams. This enhancement is especially beneficial for using data in formats like PDFs and slides.

Here is the benchmark provided by Anthropic:

Claude 3 benchmark vs GPT4 and Gemini LLMs on various capabilities

Extended Context Window

I think this is a major advancement in LLMs. Where GPT4 is still on relatively small context windows (32k tokens available to public. 128k tokens announced), Gemini 1.5 on other hand went up to 1M token capabilities (and announced up to 10M tokens). Claude 3, comes with standard 200k tokens window (that is almost 800 pages) and 1M tokens announced for enterprise / specific needs usage.

The Needle In A Haystack Test

Claude 3 models can remember and use a lot of information very well. Even with context windows with 200k tokens it passes well the Needle In A Haystack Test. Indeed, these models are great at finding specific details in a huge amount of data. They can almost always find the right information, and sometimes they even notice when something doesn’t quite fit right in the data they’re given.

To be noted that Gemini 1.5 also showed great results for such tests as well.

Pricing

Claude Opus is priced at $15 per million tokens, which also shows that the price per 1k token is continuing to drastically decrease, and that this trend is continuing.

Responsible Design and AI Safety

Even if it is difficult to assess this at early stage, Anthropic announces making Claude 3 models neutral and trustworthy, addressing AI biases and risks like misinformation and privacy issues with its “Constitutional AI” framework.

Despite progress in reducing biases, eliminating them completely is hard. Claude 3 is safer and less biased than earlier models, but still at AI Safety Level 2, with ongoing efforts to monitor and enhance safety and neutrality.

Anthropic shared that they validated their model against US Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence, however there was no mention of the EU AI Act.

Closing Thoughts

We continue to move towards Faster, Safer, Smarter and Cheaper LLMs. Claude 3 is no exception. It shows great benchmark results compared to GPT4, showing better accuracy overally, bigger context windows and a very competitive pricing. It would be interesting however to benchmark against Gemini 1.5 which might beat Claude 3 on some lines.

If you found this article useful, please clap and share your thoughts.

I regularly write about AI and Data, feel free to follow me :

On Medium : https://medium.com/@AhmedF
On LinkedIn : https://www.linkedin.com/in/ahmedfessi/
On Twitter : https://twitter.com/ahmedfessi
On Udemy (my courses are also available on Udemy for Business): https://www.udemy.com/user/ahmedfessi/