avatarAI TutorMaster

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

7534

Abstract

e concatenated as the hybrid representation</li></ul><h2 id="7270">Language Model</h2><ul><li>An encoder-decoder transformer (Vicuna) is used as the language model backbone</li><li>Image features Z undergo an additional transformation to match text token dimensions</li><li>The hybrid region representations are injected into the input text sequence</li><li>For grounding, coordinate bounding boxes are directly generated in the text output</li></ul><h2 id="5353">Training Objectives</h2><ul><li>Ferret is trained end-to-end on the GRIT dataset via instruction tuning</li><li>The model learns to comprehend spatial semantics from referred input regions</li><li>It learns to ground relevant objects in output by generating coordinates</li><li>Auxiliary losses may be applied to different modules to facilitate learning</li></ul><h1 id="0913">Superior Performance across Diverse Tasks</h1><p id="79ef">Extensive experiments demonstrate Ferret’s strong improvements over previous models like DALL-E and GLIP on:</p><ul><li><b>Input referring:</b> 20% higher accuracy in classifying referred objects</li><li><b>Output grounding:</b> State-of-the-art on visual grounding and grounded captioning benchmarks</li><li><b>Conversation:</b> 20.4% higher on visual dialog tasks needing referring and grounding compared to leading MLLMs</li></ul><p id="d5e7"><b>Remarkably, Ferret also greatly reduces object hallucination issues faced by generative ML models. It represents a big step towards reliable and controllable multimodal AI.</b></p><figure id="9439"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*YzuUOmGO5Knj2XF9ZUJuYw.png"><figcaption>source — <a href="https://arxiv.org/pdf/2310.07704v1.pdf">here</a></figcaption></figure><h1 id="0776">FERRET v.s. GPT-4V(ISION): A QUICK GLANCE AT REFERRING & GROUNDING</h1><figure id="3e7a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*DPqyoHZXLmoA4DHUDZ93rA.png"><figcaption>source — <a href="https://arxiv.org/pdf/2310.07704v1.pdf">here</a></figcaption></figure><p id="ceff"><b>Referring</b></p><ul><li>For GPT-4V, referred regions are specified either by coloring them or providing coordinates in text</li><li>But it struggles with precise understanding of small local regions compared to Ferret</li><li>In the motorcycle example, GPT-4V fails to comprehend the ‘shock absorber’ accurately</li><li>Ferret is specialized for fine-grained spatial semantics, outperforming GPT-4V for small details</li></ul><p id="9cae"><b>Grounding</b></p><ul><li>GPT-4V can localize objects when prompted to provide bounding boxes</li><li>But it fails to accurately ground small objects in complex scenes, like traffic lights</li><li>Ferret precisely identifies most objects even in cluttered images, as in the traffic light example</li><li>Specialized techniques like spatial-aware sampling help Ferret’s precision</li></ul><figure id="0488"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*6IMpykkkbATt1az20solqQ.png"><figcaption>source — <a href="https://arxiv.org/pdf/2310.07704v1.pdf">here</a></figcaption></figure><h1 id="5055">Let’s Try</h1><h1 id="a449">Install</h1><ol><li>Clone this repository and navigate to FERRET folder</li></ol><div id="ebf6"><pre>git <span class="hljs-built_in">clone</span> https://github.com/apple/ml-ferret <span class="hljs-built_in">cd</span> ml-ferret</pre></div><p id="12e7">Install Package</p><div id="eeb3"><pre>conda <span class="hljs-built_in">create</span> -n ferret python=<span class="hljs-number">3.10</span> -y conda activate ferret pip install <span class="hljs-comment">--upgrade pip # enable PEP 660 support</span> pip install -e . pip install pycocotools pip install protobuf==<span class="hljs-number">3.20</span><span class="hljs-number">.0</span></pre></div><p id="75a5">Install additional packages for training cases</p><div id="6319"><pre>pip install ninja pip install flash-attn <span class="hljs-attr">--no-build-isolation</span></pre></div><h1 id="b5d5">Train</h1><p id="77ec">FERRET is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the <code>per_device_train_batch_size</code> and increase the <code>gradient_accumulation_steps</code> accordingly. Always keep the global batch size the same: <code>per_device_train_batch_size</code> x <code>gradient_accumulation_steps</code> x <code>num_gpus</code>.</p><h1 id="e36a">Hyperparameters</h1><p id="b385">We use a similar set of hyperparameters as LLaVA(Vicuna) in finetuning.</p><p id="0170">HyperparameterGlobal Batch SizeLearning rateEpochsMax lengthWeight decayFERRET-7B1282e-5320480FERRET-13B1282e-5320480</p><h1 id="c981">Prepare Vicuna checkpoint and LLaVA’s projector</h1><p id="0a2b">Before you start, prepare our base model Vicuna, which is an instruction-tuned chatbot. Please download its weights following the instructions <a href="https://github.com/lm-sys/FastChat#model-weights">here</a>. Vicuna v1.3 is used in FERRET.</p><p id="2b32">Then download LLaVA’s first-stage pre-trained projector weight (<a href="https://huggingface.co/liuhaotian/llava-336px-pretrain-vicuna-7b-v1.3">7B</a>, <a href="https://huggingface.co/liuhaotian/llava-336px-pretrain-vicuna-13b-v1.3">13B</a>).</p><h1 id="7785">FERRET Training</h1><p id="45c8">The scripts are provided (<a href="https://github.com/apple/ml-ferret/blob/main/experiments/ferret_7b_train.sh">7B</a>, <a href="https://github.com/apple/ml-ferret/blob/main/experiments/ferret_13b_train.sh">13B</a>).</p><h1 id="9957">Evaluation</h1><p id="f90e">Please see this <a href="https://github.com/apple/ml-ferret/blob/main/EVAL.md">doc</a> for the details.</p><h1 id="9d1b">Checkpoints</h1><p id="53f7">We extracted the <code>delta</code> between our pre-trained model and Vicuna. Please first download weights of Vicuna following the <a href="https://github.com/apple/ml-ferret#prepare-vicuna-checkpoint-and-llavas-projector">previous instruction</a>. Then download our prepared offsets of weights: <a href="https://docs-assets.developer.apple.com/ml-research/models/ferret/ferret-7b/ferret-7b-delta.zip">7B</a>, <a href="https://docs-assets.developer.apple.com/ml-research/models/ferret/ferret-13b/ferret-13b-delta.zip">13B</a> using <code>wget</code> or <code>curl</code>, and unzip the downloaded offsets. Lastly, apply the offset to the Vicuna's weight by running the following script:</p><div id="aabf"><pre><span class="hljs-comment"># 7B</span> python3 -m ferret.model.apply_delta
--base ./model/vicuna-7b-v1-3
--target ./model/ferret-7b-v1-3
--delta path/to/ferret-7b-delta <span class="hljs-comment"># 13B</span> python3 -m ferret.model.apply_delta
--base ./model/vicuna-13b-v1-3
--target ./model/ferret-13b-v1-3
--delta path/to/ferret-13b-delta</pre></div><p id="4bdc">Notices: Apple’s rights in the attached weight differentials are hereby licensed under the CC-BY-NC license. Apple makes no representations with regards to LLaMa or any other third party software, which are subject to their own terms.</p><p id="1651">Please refer to the next section about how to set up a local demo with pre-trained weight.</p><h1 id="104b">Demo</h1><p id="2bcc">To run our demo, you need to train FERRET and use the checkpoints locally. Gradio web UI is used. Please run the following commands one by one.</p><h2 id="d3ff">Launch a controller</h2><div id="ca40"><pre>python -m ferret<span class="hljs-selector-class">.serve</span><span class="hljs-selector-class">.controller</span> <span class="hljs-attr">--host</span> <

Options

span class="hljs-number">0.0</span>.<span class="hljs-number">0.0</span> <span class="hljs-attr">--port</span> <span class="hljs-number">10000</span></pre></div><h2 id="b7c2">Launch a gradio web server.</h2><div id="3c02"><pre>python -m ferret.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload --add_region_feature</pre></div><h2 id="cdf5">Launch a model worker</h2><p id="6384">This is the worker that load the ckpt and do the inference on the GPU. Each worker is responsible for a single model specified in <code>--model-path</code>.</p><div id="8251"><pre>CUDA_VISIBLE_DEVICES=0 python -m ferret.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path ./checkpoints/FERRET-13B-v0 --add_region_feature</pre></div><p id="fd42">Wait until the process finishes loading the model and you see “Uvicorn running on …”. Now, refresh your Gradio web UI, and you will see the model you just launched in the model list.</p><figure id="8bff"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*gCy4uNFk04hk05C8.png"><figcaption></figcaption></figure><p id="2cf2">Example of Ferret Interactive Demo.</p><h1 id="7c32">Conclusion</h1><p id="c4a2">With techniques like hybrid region representation and spatial-aware sampling, Ferret sets a new bar for multimodal AI on precise spatial understanding. The large GRIT dataset was key in training these capabilities.</p><p id="5c7c"><b>This work is an impressive demonstration of Apple’s AI research investments into impactful generative models. Integrating referring and grounding abilities in ML systems could enable more natural and interpretable human-AI interaction.</b></p><h1 id="abaf">Github Link:</h1><div id="be99" class="link-block"> <a href="https://github.com/apple/ml-ferret"> <div> <div> <h2>GitHub - apple/ml-ferret</h2> <div><h3>Contribute to apple/ml-ferret development by creating an account on GitHub.</h3></div> <div><p>github.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*I1p_8oCtOy20KLID)"></div> </div> </div> </a> </div><p id="1149"><b>References</b></p><ol><li>Ferret: Refer and Ground Anything Anywhere at Any Granularity. Haoxuan You, Haotian Zhang, Zhe Gan, et al. arXiv preprint arXiv:2310.07704, 2022.</li><li>Large Language Models are Few-Shot Learners. Tom Brown, Benjamin Mann, Nick Ryder, et al. Advances in Neural Information Processing Systems, 2020.</li><li>Flamingo: a visual language model for few-shot learning. Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, et al. arXiv preprint arXiv:2204.14198, 2022.</li><li>Uniter: Universal Image-Text Representation Learning. Yen-Chun Chen, Linjie Li, Licheng Yu, et al. European Conference on Computer Vision (ECCV), 2020.</li><li>Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. Keqin Chen, Zhao Zhang, Weili Zeng, et al. arXiv preprint arXiv:2306.15195, 2023.</li><li>Kosmos-2: A Calibration of Large Language Models for Spatial Common Sense. Gucheng Li, Siyu Ren, Zhe Gan, et al. arXiv preprint arXiv:2302.13969, 2023.</li><li>LLaVA: Towards All-Purpose Vision-and-Language Assistance. Xiujun Li, Yi Ren, Xu Tan, et al. arXiv preprint arXiv:2302.05132, 2023.</li><li>Unifying Grounding and Vision-and-Language Understanding. Gengyu Wang, Liqun Chen, Zhe Gan, et al. arXiv preprint arXiv:2302.00294, 2023.</li><li>Exploring the Limits of Large Vision-and-Language Models. Gucheng Li, Bin Bi, Xu Tan. arXiv preprint arXiv:2302.05267, 2023.</li></ol><h2 id="20e3">If you are interested, you can go through these:</h2><div id="76be" class="link-block"> <a href="https://levelup.gitconnected.com/large-language-model-based-agents-a-comprehensive-overview-6adc1fb39b80"> <div> <div> <h2>Large Language Model Based “Agents”: A Comprehensive Overview</h2> <div><h3>A Deep Dive into the Capabilities, Applications, and Future Prospects of “Agents” Driven by Large Language Models.</h3></div> <div><p>levelup.gitconnected.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*gfK49rPNDlBMRCc3LLTm3Q.png)"></div> </div> </div> </a> </div><div id="28e2" class="link-block"> <a href="https://levelup.gitconnected.com/mathcoder-llm-mathematical-ai-revolution-better-then-wizardmath-llm-4829f2c89574"> <div> <div> <h2>“MathCoder” LLM: Mathematical AI Revolution (Better then WizardMath LLM!)</h2> <div><h3>Enhance AI’s Math Skills with Seamless Code Integration and Advanced Reasoning</h3></div> <div><p>levelup.gitconnected.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*nq8N5JIlLw5P6gwlLxg2tA.jpeg)"></div> </div> </div> </a> </div><div id="7ccf" class="link-block"> <a href="https://levelup.gitconnected.com/dreamgaussian-discover-how-to-make-3d-content-creation-instant-easy-and-ultra-realistic-a87f095db19c"> <div> <div> <h2>DreamGaussian: Discover How to make 3D Content Creation Instant, Easy, and Ultra-Realistic!</h2> <div><h3>Step into the Future: Create Stunning 3D Models from Text or Images in Minutes — No Expertise Required!</h3></div> <div><p>levelup.gitconnected.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*64eJ0_azeQtTT5___xmxfA.png)"></div> </div> </div> </a> </div><div id="55f8" class="link-block"> <a href="https://levelup.gitconnected.com/mistral-7b-large-language-model-small-but-powerfull-better-then-llama2-13b-d289102f7afc"> <div> <div> <h2>Mistral 7B Large Language Model — Small but Powerfull( Better then LLAMA2 13B!)</h2> <div><h3>Discover the Mistral 7B LLM Unique Architecture and Unmatched Performance in GGLU(CPU) and GPU version</h3></div> <div><p>levelup.gitconnected.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*S_Szct117tHq-erDv9UGgw.png)"></div> </div> </div> </a> </div><div id="b66b" class="link-block"> <a href="https://levelup.gitconnected.com/meet-gpt-4v-ai-just-got-way-smarter-e06afd923874"> <div> <div> <h2>Meet GPT-4V: AI Just Got way Smarter!</h2> <div><h3>Step into the Future: How Multimodal AI is Changing Everything — Be Among the First to Explore! OPENAI Chat-V(ision)</h3></div> <div><p>levelup.gitconnected.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*kAe6NMY--xQj-zudlH507Q.png)"></div> </div> </div> </a> </div></article></body>

Ferret LLM: Apple Entry in Race of “Multimodal AI” Technology

Step into the future where AI understands images and language like never before, thanks to hybrid region representation

Introduction

Apple AI researchers have introduced Ferret, an innovative multimodal large language model (MLLM) specialized for referring and grounding in images. Published on arXiv, this paper marks the first time Apple has openly shared research on a large generative AI model.

Referring and grounding are critical spatial reasoning skills for AI systems. Referring involves comprehending semantics of a specified image region, while grounding means localizing objects or areas based on textual descriptions. Humans seamlessly integrate referring and grounding when communicating about images. However, existing ML models lack such detailed spatial understanding.

To address this limitation, Ferret employs novel techniques like a hybrid region representation and spatial-aware visual sampling. Trained on a large instruction tuning dataset called GRIT, Ferret achieves significant gains over previous models in referring, grounding and conversational tasks. Remarkably, it also mitigates object hallucination issues faced by generative models.

As one of the first few attempts at integrating multimodality in large language models by the tech giant, Ferret highlights Apple AI’s investments into impactful AI research and its effects across Apple products and services.

Keywords: Apple Ferret AI, Multimodal Large Language Model (MLLM), Referring and Grounding, Spatial Reasoning, GRIT Dataset, Hybrid Region Representation, Spatial-Aware Visual Sampling, Apple AI Research, Innovative AI, Generative AI Model, Image Semantics, Object Localization, Visual Information, Language Semantics, Free-Form Region Shapes, Spatial Understanding, Object Hallucination, Multimodal AI, Ground and Refer Instruction Tuning, Visual Grounding, Grounded Captioning

source — here

Challenges in Precise Spatial Understanding

Humans can easily point to or describe specific objects, areas or relationships in an image during communication. We take such referring and grounding abilities for granted in daily visual conversations.

However, most AI systems today lack the detailed spatial understanding required for pinpoint referring and grounding. For instance, DALL-E can generate realistic images from text prompts but cannot accurately comprehend local regions within an image.

Referring and grounding require properly associating visual information with language semantics at a fine-grained level. But existing works have focused on learning them individually using separate models.

Humans seamlessly transfer knowledge between referring and grounding tasks. Current multimodal AI models also cannot handle the free-form region shapes humans use, like dots, strokes or complex polygons. The models are limited to just points and bounding boxes.

The Apple researchers identified three key limitations in existing models:

  • Inability to unify referring and grounding within one framework
  • Lack of support for diverse free-form region shapes
  • Lack of open-vocabulary, instruction-following and robustness

Hybrid Region Representation in Ferret

To overcome these challenges, the researchers propose Ferret, a novel MLLM architecture specialized for referring and grounding.

The key innovation in Ferret is a hybrid region representation that combines discrete coordinates with continuous visual features.

For coordinates, Ferret uses direct natural language numerals (like “100, 200, 500, 600”). The coordinates are quantized into discrete bins.

For free-form shapes, a spatial-aware visual sampler extracts continuous features. It samples points in the region mask and propagates neighborhood information to handle varying sparse shapes.

The discrete and continuous representations are combined as the hybrid input for referring regions. This allows Ferret to handle points, boxes or free-form shapes seamlessly.

For grounding, Ferret directly generates coordinate bounding boxes in text output for detected objects. The model implicitly aligns object names with their locations.

source — here

Enabling Open-Vocabulary, Robustness via GRIT

The researchers collected a large Ground and Refer Instruction Tuning (GRIT) dataset with 1.1M examples to train Ferret’s capabilities. GRIT includes:

  • Data converted from existing datasets into instruction-following formats
  • 34k human-annotated conversations for open-vocabulary tuning
  • 95k negative samples mined to improve model robustness

It covers objects, relationships, region descriptions and reasoning across input/output combinations.

GRIT’s conversational data and negative mining are critical to make Ferret open-vocabulary, instruction-following and robust.

Here is an explanation of the model architecture for Ferret:

source — here

Image Encoder

  • Ferret first uses a pre-trained CLIP-ViT-L/14 model to encode the input image into a feature map Z ∈ RH×W×C
  • CLIP-ViT-L/14 is a convolutional vision transformer trained on large amounts of image-text pairs
  • It extracts rich semantic visual features from the image

Spatial-Aware Visual Sampler

  • To handle irregularly shaped regions, a spatial-aware visual sampler is proposed
  • It samples points inside the region mask and propagates information from neighbors
  • This accounts for varying sparsity across complex free-form shapes
  • It uses techniques like farthest point sampling and gathering local neighbors
  • Outputs a feature f that summarizes the continuous visual features of the region

Hybrid Region Representation

  • Discrete coordinates are expressed in natural language numerals (e.g. “100, 200, 500, 900”)
  • Continuous visual features f are extracted using the spatial-aware sampler
  • For each region, its coordinates and visual feature f are concatenated as the hybrid representation

Language Model

  • An encoder-decoder transformer (Vicuna) is used as the language model backbone
  • Image features Z undergo an additional transformation to match text token dimensions
  • The hybrid region representations are injected into the input text sequence
  • For grounding, coordinate bounding boxes are directly generated in the text output

Training Objectives

  • Ferret is trained end-to-end on the GRIT dataset via instruction tuning
  • The model learns to comprehend spatial semantics from referred input regions
  • It learns to ground relevant objects in output by generating coordinates
  • Auxiliary losses may be applied to different modules to facilitate learning

Superior Performance across Diverse Tasks

Extensive experiments demonstrate Ferret’s strong improvements over previous models like DALL-E and GLIP on:

  • Input referring: 20% higher accuracy in classifying referred objects
  • Output grounding: State-of-the-art on visual grounding and grounded captioning benchmarks
  • Conversation: 20.4% higher on visual dialog tasks needing referring and grounding compared to leading MLLMs

Remarkably, Ferret also greatly reduces object hallucination issues faced by generative ML models. It represents a big step towards reliable and controllable multimodal AI.

source — here

FERRET v.s. GPT-4V(ISION): A QUICK GLANCE AT REFERRING & GROUNDING

source — here

Referring

  • For GPT-4V, referred regions are specified either by coloring them or providing coordinates in text
  • But it struggles with precise understanding of small local regions compared to Ferret
  • In the motorcycle example, GPT-4V fails to comprehend the ‘shock absorber’ accurately
  • Ferret is specialized for fine-grained spatial semantics, outperforming GPT-4V for small details

Grounding

  • GPT-4V can localize objects when prompted to provide bounding boxes
  • But it fails to accurately ground small objects in complex scenes, like traffic lights
  • Ferret precisely identifies most objects even in cluttered images, as in the traffic light example
  • Specialized techniques like spatial-aware sampling help Ferret’s precision
source — here

Let’s Try

Install

  1. Clone this repository and navigate to FERRET folder
git clone https://github.com/apple/ml-ferret
cd ml-ferret

Install Package

conda create -n ferret python=3.10 -y
conda activate ferret
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install pycocotools
pip install protobuf==3.20.0

Install additional packages for training cases

pip install ninja
pip install flash-attn --no-build-isolation

Train

FERRET is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Hyperparameters

We use a similar set of hyperparameters as LLaVA(Vicuna) in finetuning.

HyperparameterGlobal Batch SizeLearning rateEpochsMax lengthWeight decayFERRET-7B1282e-5320480FERRET-13B1282e-5320480

Prepare Vicuna checkpoint and LLaVA’s projector

Before you start, prepare our base model Vicuna, which is an instruction-tuned chatbot. Please download its weights following the instructions here. Vicuna v1.3 is used in FERRET.

Then download LLaVA’s first-stage pre-trained projector weight (7B, 13B).

FERRET Training

The scripts are provided (7B, 13B).

Evaluation

Please see this doc for the details.

Checkpoints

We extracted the delta between our pre-trained model and Vicuna. Please first download weights of Vicuna following the previous instruction. Then download our prepared offsets of weights: 7B, 13B using wget or curl, and unzip the downloaded offsets. Lastly, apply the offset to the Vicuna's weight by running the following script:

# 7B
python3 -m ferret.model.apply_delta \
    --base ./model/vicuna-7b-v1-3 \
    --target ./model/ferret-7b-v1-3 \
    --delta path/to/ferret-7b-delta
# 13B
python3 -m ferret.model.apply_delta \
    --base ./model/vicuna-13b-v1-3 \
    --target ./model/ferret-13b-v1-3 \
    --delta path/to/ferret-13b-delta

Notices: Apple’s rights in the attached weight differentials are hereby licensed under the CC-BY-NC license. Apple makes no representations with regards to LLaMa or any other third party software, which are subject to their own terms.

Please refer to the next section about how to set up a local demo with pre-trained weight.

Demo

To run our demo, you need to train FERRET and use the checkpoints locally. Gradio web UI is used. Please run the following commands one by one.

Launch a controller

python -m ferret.serve.controller --host 0.0.0.0 --port 10000

Launch a gradio web server.

python -m ferret.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload --add_region_feature

Launch a model worker

This is the worker that load the ckpt and do the inference on the GPU. Each worker is responsible for a single model specified in --model-path.

CUDA_VISIBLE_DEVICES=0 python -m ferret.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path ./checkpoints/FERRET-13B-v0 --add_region_feature

Wait until the process finishes loading the model and you see “Uvicorn running on …”. Now, refresh your Gradio web UI, and you will see the model you just launched in the model list.

Example of Ferret Interactive Demo.

Conclusion

With techniques like hybrid region representation and spatial-aware sampling, Ferret sets a new bar for multimodal AI on precise spatial understanding. The large GRIT dataset was key in training these capabilities.

This work is an impressive demonstration of Apple’s AI research investments into impactful generative models. Integrating referring and grounding abilities in ML systems could enable more natural and interpretable human-AI interaction.

Github Link:

References

  1. Ferret: Refer and Ground Anything Anywhere at Any Granularity. Haoxuan You, Haotian Zhang, Zhe Gan, et al. arXiv preprint arXiv:2310.07704, 2022.
  2. Large Language Models are Few-Shot Learners. Tom Brown, Benjamin Mann, Nick Ryder, et al. Advances in Neural Information Processing Systems, 2020.
  3. Flamingo: a visual language model for few-shot learning. Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, et al. arXiv preprint arXiv:2204.14198, 2022.
  4. Uniter: Universal Image-Text Representation Learning. Yen-Chun Chen, Linjie Li, Licheng Yu, et al. European Conference on Computer Vision (ECCV), 2020.
  5. Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. Keqin Chen, Zhao Zhang, Weili Zeng, et al. arXiv preprint arXiv:2306.15195, 2023.
  6. Kosmos-2: A Calibration of Large Language Models for Spatial Common Sense. Gucheng Li, Siyu Ren, Zhe Gan, et al. arXiv preprint arXiv:2302.13969, 2023.
  7. LLaVA: Towards All-Purpose Vision-and-Language Assistance. Xiujun Li, Yi Ren, Xu Tan, et al. arXiv preprint arXiv:2302.05132, 2023.
  8. Unifying Grounding and Vision-and-Language Understanding. Gengyu Wang, Liqun Chen, Zhe Gan, et al. arXiv preprint arXiv:2302.00294, 2023.
  9. Exploring the Limits of Large Vision-and-Language Models. Gucheng Li, Bin Bi, Xu Tan. arXiv preprint arXiv:2302.05267, 2023.

If you are interested, you can go through these:

Large Language Models
Multimodal
Chatbots
Apple
Artificial Intelligence
Recommended from ReadMedium