Free AI web copilot to create summaries, insights and extended knowledge, download it at here

6331

Abstract

lled “uncertainty-routed chain-of-thought”, which will select the majority vote on 32 CoT samples based on threshold / greedy sample choices.</p></blockquote><figure id="be0b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*S307FArH7H7wWZgjHt1Xow.png"><figcaption>Source: <a href="https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf">Gemini Technical Report</a></figcaption></figure><blockquote id="43dc"><p><b>Qualitative Example 1: Mathematics with Calculus</b></p></blockquote><blockquote id="77c1"><p>The model can understand a calculus problem and answer correctly with step-by-step explanations with LaTeX expressions.</p></blockquote><figure id="b09b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*zvi4F-Ey95s_9wgTBz8Hhg.png"><figcaption>Example from <a href="https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf">Gemini Technical Report</a></figcaption></figure><blockquote id="04a6"><p><b>Qualitative Example 2: Multi-step reasoning and mathematics</b></p></blockquote><blockquote id="e30c"><p>The model follows the instructions to show where the numbers come from and answer the question given in the task.</p></blockquote><figure id="6310"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*NDT2v7TvE9_Dk8EKQPdIZg.png"><figcaption>Example from <a href="https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf">Gemini Technical Report</a></figcaption></figure><h2 id="43a1">3.2. Trends in Capabilities with Different Model Sizes</h2><p id="fca1">By comparing the performance of various types of task domains with all Gemini models, we can observe that <b>Gemini Ultra</b> has outperformed all six capabilities mentioned in the chart, especially in very complex tasks like Math/Science. For <b>Gemini Pro</b>, it is the most optimal choice when we consider inferencing resources and performance issues.</p><figure id="76e0"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*jemPBpAPIXRA1e9oQF5cqA.png"><figcaption>Source: <a href="https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf">Gemini Technical Report</a></figcaption></figure><p id="b051">When we deep dive into <b>Gemini Nano’s</b> capabilities, although not all the tasks are well-performed like Germini Pro, the accuracy in some of the benchmarks is comparable or much better than existing LLMs. <b>Take the MATH dataset as an example</b>:</p><ul><li><b>LLaMA-2 70B</b>: 13.5% (4-shot)</li><li><b>Grok 1 33B</b>: 23.9% (4-shot)</li><li><b>Gemini Nano 1(1.8B) / 2(3.25B)</b>: 13.5% / 22.8% (4-shot)</li></ul><figure id="050f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*PpnOUzXfxAVaWg-mZLZ9Eg.png"><figcaption>Source: <a href="https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf">Gemini Technical Report</a></figcaption></figure><h2 id="fd28">3.3. Multilinguality</h2><p id="0a8b">The Gemini models excel in multilingual tasks, including machine translation. In the WMT 23 translation benchmark, Gemini Ultra stands out for its proficiency in both high and low-resource languages, surpassing other models like GPT-4 and PaLM 2, demonstrating strong performance in very low-resource languages.</p><figure id="a78b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*wc4FTaE6zfleIRnyZmL4ww.png"><figcaption>Source: <a href="https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf">Gemini Technical Report</a></figcaption></figure><h2 id="8e02">3.4. Complex Reasoning Systems</h2><p id="652a">Apart from Gemini, AlphaCode 2 is also launched at the same time to demonstrate how we can build reasoning systems to deal with complex multi-step problems. It conducts an extensive search over possible programs, followed by filtering, clustering, and reranking to select the most promising code candidates. AlphaCode 2 solved 43% of problems across 12 contests with Codeforces, a significant improvement over its predecessor, AlphaCode, which solved 25%. AlphaCode 2 places in the 85th percentile among competitors, whereas AlphaCode is in the 46th.</p><figure id="0b8f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*omwPRwDpbNAFwnAkrIYf-g.png"><figcaption>Source: <a href="https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf">AlphaCode2 Technical Report</a></figcaption></figure><h1 id="c2d5">4. Evaluation of Gemini: Multimodal Capabilities</h1><h2 id="8448">4.1. Image Understanding</h2><p id="c0f9"><b>Gemini Ultra</b> shows superior performance over other SOTA approaches across a variety of image understanding tasks, even without fine-tuning and using any OCR tools. The tasks cover a wide spectrum of capabilities: from high-level object recognition (VQAv2), fine-grained transcription (TextVQA, DocVQA), and chart understanding (ChartQA, InfographicVQA), to multimodal reasoning (Ai2D, MathVista, MMMU). <b>Gemini Ultra</b> shows a consistent edge in zero-shot settings — where models generate answers without prior exposure to specific task data.</p><figure id="56c8"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*_h4H6X9aTO0K47e3fvn3DA.png"><figcaption>Source: <a href="https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf">Gemini Technical Report</a></figcaption></figure><blockquote id="b528"><p><b>Qualitative Example 3: Chart understanding and reasoning</b></p></blockquote><blockquote id="2673"><p>The model can read the text well, understand the interesting points of the data, as well as follow the instructions to create a markdown table.</p></blockquote><figure id="e25c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*0ET7Wf_bkCUmZ3RdNcUfLw.png"><figcaption>Example from <a href="https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf">Gemini Technical Report</a></figcaption></figure><blockquote id="a3ca"><p><b>Qualitative Example 4: Geometrical Reasoning</b></p></blockquote><blockquote id="e1fd"><p>The model can solve geometry problems by retrieving the necessary information from the image and showing reasoning steps.</p></blockquote><figure id="3176"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*sAFDahX9bJEWJkHb-9AWbA.png"><figcaption></figcaption></figure><h2 id="ca2c"

Options

4.2. Video Understanding</h2><p id="e6f7">The Gemini models are tested to understand and interpret videos, performing exceptionally well on various video captioning and question-answering tasks. These models were given <b>16 frames</b> from each video to analyze. The Gemini Ultra model showed top performance in both few-shot learning and zero-shot learning. The tests confirmed its capability to comprehend and reason about the content over time, such as understanding a sequence of actions in a soccer game.</p><figure id="0ac3"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*mATnyzIxlSYWLMwwrc1_iQ.png"><figcaption>Source: <a href="https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf">Gemini Technical Report</a></figcaption></figure><blockquote id="5559"><p><b>Qualitative Example 5: Video Understanding of the Soccer Scene</b></p></blockquote><blockquote id="380c"><p>The model can understand what’s happening in the video and also respond to the questions to provide suggestions on improving the shooting technique.</p></blockquote><figure id="bed4"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*rge0fGaA7laa2mdp4QqU9A.png"><figcaption>Example from <a href="https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf">Gemini Technical Report</a></figcaption></figure><h2 id="e018">4.3. Image Generation</h2><p id="882c">As Gemini is a unified model with both text and image decoders, it can directly generate images from prompts without needing a descriptive intermediary, enabling complex interleaved sequences of image and text outputs.</p><blockquote id="e1a1"><p><b>Qualitative Example 6: Few-shots Learning with Image-Text Pairs</b></p></blockquote><blockquote id="323c"><p>In this example, there is a 1-shot setting prompt, showing two ideas of creating a blue cat and a blue dog with yellow ears from the yarn. With this given information, the model can generate creative suggestions with two new colours, pink and green. It is noted that the images are AI-generated by the image decoders and the style is similar to the 1-shot setting prompt.</p></blockquote><figure id="09ae"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*d_2qSeK5Vkvu1KJu-ExUGg.png"><figcaption>Example from <a href="https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf">Gemini Technical Report</a></figcaption></figure><blockquote id="c8d4"><p><b>Qualitative Example 7: Interleaved Image and Text Generation</b></p></blockquote><blockquote id="93dc"><p>This is an instruction to create a blog post with a few pictures of the dog. The images are generated successfully based on the given text context (posing happily at different landmarks).</p></blockquote><figure id="b879"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*GXr5Fd-ZWZNitV0OnEhbcQ.png"><figcaption>Example from <a href="https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf">Gemini Technical Report</a></figcaption></figure><h2 id="3380">4.4. Audio Understanding</h2><p id="ed72">The <b>Gemini Nano-1</b> and <b>Gemini Pro</b> models were evaluated on public benchmarks for automatic speech recognition (ASR) and speech translation tasks, showing significant performance gains over the Universal Speech Model (USM) and various iterations of Whisper. These benchmarks included different benchmarks, with metrics in <b>word error rate (WER)</b> and <b>BLEU scores</b> for translation. Notably, Gemini Pro excelled across all tasks, particularly in FLEURS, benefiting from being trained on its dataset. Even without FLEURS data, Gemini Pro still surpassed Whisper’s performance. While Gemini Nano-1 also outperformed USM and Whisper in most cases, it fell short of the FLEURS benchmark. Gemini Ultra has not yet been evaluated on audio tasks, but there are expectations for its superior performance due to the increased model scale.</p><figure id="acc8"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*MHrEWxRvx-NHcz5ItDDp5A.png"><figcaption>Source: <a href="https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf">Gemini Technical Report</a></figcaption></figure><blockquote id="82a7"><p><b>Qualitative Example 8: Transcriptions</b></p></blockquote><blockquote id="eb66"><p>Audio 1: <a href="https://storage.googleapis.com/deepmind-media/gemini/fleurs1.wav">storage.googleapis.com/deepmind-media/gemini/fleurs1.wav</a>| Audio 2: <a href="https://storage.googleapis.com/deepmind-media/gemini/fleurs2.wav">storage.googleapis.com/deepmind-media/gemini/fleurs2.wav</a>

In this example, even though some of the words are not spoken clearly by the woman, Gemini Pro can correctly recognise all the rare words and proper nouns.</p></blockquote><figure id="3934"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*gCEcKjDrdsdv0i-qOZ94KA.png"><figcaption>Example from <a href="https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf">Gemini Technical Report</a></figcaption></figure><h2 id="01e9">4.5. Modality Combination</h2><p id="178d">Apart from interleaved with texts and images (the most common combination for users to prompt), we can also prompt with images and audio natively.</p><blockquote id="0501"><p><b>Qualitative Example 9: Audio-visual Input</b></p></blockquote><blockquote id="0143"><p>The provided cooking scenarios show the process of making an omelette. The users ask Gemini questions with audio and images. The model can understand the questions spoken in the audio and the corresponding images, to respond to users in the text format.</p></blockquote><figure id="77aa"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*g0M9Vt7b1A8hzxXVjPkQOQ.png"><figcaption>Example from <a href="https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf">Gemini Technical Report</a></figcaption></figure><h1 id="772c">5. Conclusion</h1><p id="156b">Gemini is a ground-breaking model which supports native multimodality in a single, unified model. It also achieves SOTA performance in different benchmarks. With the launch of Google AI Studio (i.e. re-branded version of MediaPipe) and Gemini APIs, I’m so excited to see the unlimited possibilities of using it to create innovative tech solutions for our society.</p></article></body>

[Paper Summary] Overview of Google’s First Multimodal Model: Gemini

Source: Introducing Gemini: Google’s most capable AI model yet (blog. google)

On 6 December 2023, Google finally launched their new multimodal model, Gemini. This is a very competitive model which achieves new SOTA performance on different evaluation benchmarks. To provide the latest information, this article will summarize the technical report of Gemini and showcase some interesting demos illustrated by Google.

1. What is Gemini?

Gemini is a set of multimodal models which can handle text, image, audio, and video data and perform various tasks across these modalities, such as language understanding, image recognition, video understanding, and audio processing. Gemini models come in three sizes: Ultra, Pro, and Nano to suit different applications, from complex reasoning to on-device deployment.

Ultra: The best model to perform highly complex tasks, like reasoning and multimodal tasks. It will be available on Bard Advanced in January 2024.
Pro: Well-balanced between performance and resource optimization. Good enough to perform most of the tasks. Bard is currently using Gemini Pro for text prompt queries.
Nano: Tailor-made for on-device usage. There are two parameter sizes: 1.8B (Nano-1) and 3.25B (Nano-2). Thanks to the 4-bit quantization and other hardware optimization, Android (currently Pixel 8 Pro) users can use it in offline settings. In terms of app development, AICore SDK is provided for developers to use this model, or even do fine-tuning with LoRA for domain-specific tasks.

2. How Google Trained Gemini?

2.1. How the Traditional Vision-Language Model Works?

In the current visual-language model research, most of the visual abilities are powered by a connection module to link open-source LLMs (like LLaMA) and vision encoders (like OpenAI CLIP). Usually, visual embeddings are converted to text embeddings with feature alignment, and then concatenate with the text input embeddings. Therefore, LLMs can understand the image context with those virtual tokens.

Original Method of Language-Vision Models: Combining Vision Encoder and LLMs

2.2. Gemini: A Unified Embedding Space for All Modalities

In contrast, for Gemini, models are multimodal from the beginning and can natively output images using discrete image tokens. This means that Gemini has a unified embedding space to represent multimodal data.

For audio data, Gemini supports audio features from the Universal Speech Model (USM) to understand audio contexts without converting to texts by speech recognition. For video data, Gemini handles it into a sequence of images which can be interleaved naturally with text or audio for the model input.

On the other hand, Gemini supports 32K context length, which is 4 times longer than the PaLM2 model.

2.3. Training Dataset & Infrastructure

According to the technical report, there are not many details mentioning the training dataset. However, it stated that it is “trained on a dataset that is both multimodal and multilingual”, with different modalities like web documents, code, images, audio and videos. Also, based on the previous Deepmind research paper “Training Compute-Optimal Large Language Models”, smaller models (Gemini Pro & Nano) are fed with significantly more tokens to minimize the performance difference to larger models.

For training infrastructure, TPUv5e and TPUv4 are used in the training process.

3. Evaluation of Gemini: Text-Only Capabilities

The technical report spends the most number of pages on evaluating the model performance in different dimensions. Here we will select some of them.

3.1. Academic Benchmarks

In this part, the report discusses the performance of Gemini Pro and Ultra, compared to other existing models like GPT-3.5 and GPT-4 across various academic benchmarks. Gemini Ultra can outperform other models and even surpass human expert performance in MMLU, achieving 90.04% accuracy (check notes below). The model’s success is attributed to its use of a chain-of-thought prompting approach combined with self-consistency, particularly in areas requiring specialist knowledge and reasoning.

Gemini Ultra also demonstrates strong performance in mathematics, coding, and reading comprehension, with impressive results in benchmarks like GSM8K, HumanEval, and HellaSwag. The paper acknowledges some minor issues with data contamination but emphasizes the importance of robust and nuanced evaluation benchmarks.

Notes: For MMLU performance with 90.04% accuracy, the experiment uses an approach called “uncertainty-routed chain-of-thought”, which will select the majority vote on 32 CoT samples based on threshold / greedy sample choices.

Qualitative Example 1: Mathematics with Calculus

The model can understand a calculus problem and answer correctly with step-by-step explanations with LaTeX expressions.

Qualitative Example 2: Multi-step reasoning and mathematics

The model follows the instructions to show where the numbers come from and answer the question given in the task.

3.2. Trends in Capabilities with Different Model Sizes

By comparing the performance of various types of task domains with all Gemini models, we can observe that Gemini Ultra has outperformed all six capabilities mentioned in the chart, especially in very complex tasks like Math/Science. For Gemini Pro, it is the most optimal choice when we consider inferencing resources and performance issues.

When we deep dive into Gemini Nano’s capabilities, although not all the tasks are well-performed like Germini Pro, the accuracy in some of the benchmarks is comparable or much better than existing LLMs. Take the MATH dataset as an example:

LLaMA-2 70B: 13.5% (4-shot)
Grok 1 33B: 23.9% (4-shot)
Gemini Nano 1(1.8B) / 2(3.25B): 13.5% / 22.8% (4-shot)

3.3. Multilinguality

The Gemini models excel in multilingual tasks, including machine translation. In the WMT 23 translation benchmark, Gemini Ultra stands out for its proficiency in both high and low-resource languages, surpassing other models like GPT-4 and PaLM 2, demonstrating strong performance in very low-resource languages.

3.4. Complex Reasoning Systems

Apart from Gemini, AlphaCode 2 is also launched at the same time to demonstrate how we can build reasoning systems to deal with complex multi-step problems. It conducts an extensive search over possible programs, followed by filtering, clustering, and reranking to select the most promising code candidates. AlphaCode 2 solved 43% of problems across 12 contests with Codeforces, a significant improvement over its predecessor, AlphaCode, which solved 25%. AlphaCode 2 places in the 85th percentile among competitors, whereas AlphaCode is in the 46th.

4. Evaluation of Gemini: Multimodal Capabilities

4.1. Image Understanding

Gemini Ultra shows superior performance over other SOTA approaches across a variety of image understanding tasks, even without fine-tuning and using any OCR tools. The tasks cover a wide spectrum of capabilities: from high-level object recognition (VQAv2), fine-grained transcription (TextVQA, DocVQA), and chart understanding (ChartQA, InfographicVQA), to multimodal reasoning (Ai2D, MathVista, MMMU). Gemini Ultra shows a consistent edge in zero-shot settings — where models generate answers without prior exposure to specific task data.

Qualitative Example 3: Chart understanding and reasoning

The model can read the text well, understand the interesting points of the data, as well as follow the instructions to create a markdown table.

Qualitative Example 4: Geometrical Reasoning

The model can solve geometry problems by retrieving the necessary information from the image and showing reasoning steps.

4.2. Video Understanding

The Gemini models are tested to understand and interpret videos, performing exceptionally well on various video captioning and question-answering tasks. These models were given 16 frames from each video to analyze. The Gemini Ultra model showed top performance in both few-shot learning and zero-shot learning. The tests confirmed its capability to comprehend and reason about the content over time, such as understanding a sequence of actions in a soccer game.

Qualitative Example 5: Video Understanding of the Soccer Scene

The model can understand what’s happening in the video and also respond to the questions to provide suggestions on improving the shooting technique.

4.3. Image Generation

As Gemini is a unified model with both text and image decoders, it can directly generate images from prompts without needing a descriptive intermediary, enabling complex interleaved sequences of image and text outputs.

Qualitative Example 6: Few-shots Learning with Image-Text Pairs

In this example, there is a 1-shot setting prompt, showing two ideas of creating a blue cat and a blue dog with yellow ears from the yarn. With this given information, the model can generate creative suggestions with two new colours, pink and green. It is noted that the images are AI-generated by the image decoders and the style is similar to the 1-shot setting prompt.

Qualitative Example 7: Interleaved Image and Text Generation

This is an instruction to create a blog post with a few pictures of the dog. The images are generated successfully based on the given text context (posing happily at different landmarks).

4.4. Audio Understanding

The Gemini Nano-1 and Gemini Pro models were evaluated on public benchmarks for automatic speech recognition (ASR) and speech translation tasks, showing significant performance gains over the Universal Speech Model (USM) and various iterations of Whisper. These benchmarks included different benchmarks, with metrics in word error rate (WER) and BLEU scores for translation. Notably, Gemini Pro excelled across all tasks, particularly in FLEURS, benefiting from being trained on its dataset. Even without FLEURS data, Gemini Pro still surpassed Whisper’s performance. While Gemini Nano-1 also outperformed USM and Whisper in most cases, it fell short of the FLEURS benchmark. Gemini Ultra has not yet been evaluated on audio tasks, but there are expectations for its superior performance due to the increased model scale.

Qualitative Example 8: Transcriptions

Audio 1: storage.googleapis.com/deepmind-media/gemini/fleurs1.wav| Audio 2: storage.googleapis.com/deepmind-media/gemini/fleurs2.wav In this example, even though some of the words are not spoken clearly by the woman, Gemini Pro can correctly recognise all the rare words and proper nouns.

4.5. Modality Combination

Apart from interleaved with texts and images (the most common combination for users to prompt), we can also prompt with images and audio natively.

Qualitative Example 9: Audio-visual Input

The provided cooking scenarios show the process of making an omelette. The users ask Gemini questions with audio and images. The model can understand the questions spoken in the audio and the corresponding images, to respond to users in the text format.

5. Conclusion

Gemini is a ground-breaking model which supports native multimodality in a single, unified model. It also achieves SOTA performance in different benchmarks. With the launch of Google AI Studio (i.e. re-branded version of MediaPipe) and Gemini APIs, I’m so excited to see the unlimited possibilities of using it to create innovative tech solutions for our society.