avatarSM Raiyyan

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

2975

Abstract

icts of various Visual Foundation Models</li><li>Turns various visual information, such as png images, depth images, and mask matrix, into language format to aid ChatGPT in understanding.</li></ol><p id="e88f">ChatGPT may use these VFMs iteratively and learn from their replies by integrating the Prompt Manager, which allows it to do so until it either fully meets the needs of the users or reaches the end state.</p><p id="c4bf">Consider this scenario: A user uploads a picture of a yellow flower along with a complex language request, such as “please build a red flower conditioned on the predicted depth of this image and then construct it like a cartoon, step by step.” With the Prompt Manager, Visual ChatGPT starts the execution of connected Visual Foundation Models. To be more specific, it uses a depth estimation model first to identify the depth information, a depth-to-image model next to create a figure of a red flower using the depth information, and finally a style transfer VFM based on a Stable Diffusion model to turn the appearance of this image into a cartoon.</p><figure id="9ca8"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*JA7X3ReMpqocIi72ZOQqig.jpeg"><figcaption>Overview of Visual ChatGPT</figcaption></figure><p id="210c">The Prompt Manager serves as a <b>dispatcher</b> for ChatGPT in the processing chain described above by providing the visual representations and monitoring the information transformation. Visual ChatGPT will suspend the pipeline’s execution after gathering “cartoon” hints from Prompt Manager and displaying the results. By choosing “<i>god model</i>” from a number of different small models, with text serving as the universal interface, it would be able to implement multimodality while running the source using Pyreverse.</p><p id="115e">Some highlights that stood out to me are —</p><ol><li><i>Provide responses that are coherent and relevant to the topic at hand.</i></li><li><i>When talking about images, Visual ChatGPT is very strict to the file name and will never fabricate nonexistent files.</i></li><li><i>Visual ChatGPT is able to use tools in a sequence, and is loyal to the tool observation outputs rather than faking the image content and image file name.</i></li><li><i>Visual ChatGPT should use tools to finish following tasks, rather than directly imagine from the description.</i></li><li><i>Very strict to the filename correctness and will never fake a file name if it does not exist. You will remember to provide the image file name loyally if it’s provided in the last tool observation.</i></li></ol><p id="4b17">So, what do you think about Visual chatGPT? Let’s discuss in the comments section :)</p><figure id="678d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*ZcGZ1JHeJa_3yI9s.gif"><figcaption></figcaption></figure><p id="ef44">Paper — <a href="https://arxiv.org/pdf/2303.04671.pdf">Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Model

Options

s</a></p><p id="3685">GitHub — <a href="https://github.com/microsoft/visual-chatgpt">Visual ChatGPT</a></p><p id="2e64">My other writings —</p><p id="01df"><a href="https://medium.com/@smraiyyan/list/awesome-chatgpt-prompts-6b5f19244ab3">https://medium.com/@smraiyyan/list/awesome-chatgpt-prompts-6b5f19244ab3</a></p><div id="cb3f" class="link-block"> <a href="https://medium.com/@smraiyyan/list/6b5f19244ab3"> <div> <div> <h2>Awesome chatGPT prompts</h2> <div><h3>All Parts</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*b3f0cd4b7ad00416684fe891da04c21dd5fe543e.jpeg)"></div> </div> </div> </a> </div><div id="5ace" class="link-block"> <a href="https://readmedium.com/13-ai-enabled-websites-that-will-change-the-way-you-work-a02107a9e4c1"> <div> <div> <h2>13 AI-Enabled Websites That Will Change the Way You Work</h2> <div><h3>There are already tons of AI powered websites which make our life easy but it is a bit tedious to just find the best…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*SiFuE57mq-oRDniyZG4N7A.png)"></div> </div> </div> </a> </div><div id="4568" class="link-block"> <a href="https://ai.plainenglish.io/generating-images-with-chatgpt-a61ff310d72"> <div> <div> <h2>Generating Images with chatGPT</h2> <div><h3>Every day, we’ve been greeted with a new use case of chatGPT. And this specific one to generate high quality images has…</h3></div> <div><p>ai.plainenglish.io</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*0YTSmFexMl6as7nlQrCxfg.gif)"></div> </div> </div> </a> </div><p id="897e"><i>More content at <a href="https://plainenglish.io/"><b>PlainEnglish.io</b></a>. Sign up for our <a href="http://newsletter.plainenglish.io/"><b>free weekly newsletter</b></a>. Join our <a href="https://discord.gg/GtDtUAvyhW"><b>Discord</b></a> community and follow us on <a href="https://twitter.com/inPlainEngHQ"><b>Twitter</b></a></i>, <a href="https://www.linkedin.com/company/inplainenglish/"><b><i>LinkedIn</i></b></a><i> and<b> <a href="https://www.youtube.com/channel/UCtipWUghju290NWcn8jhyAw">YouTube</a>.</b></i></p><p id="cef3"><b><i>Learn how to build awareness and adoption for your startup with <a href="https://circuit.ooo/?utm=publication-post-cta">Circuit</a></i></b><i>.</i></p></article></body>

Visual chatGPT — Send, Receive, and Edit Images

Recently, the creation of large language models (LLMs), such as T5, BLOOM, and GPT-3, has advanced significantly. Because ChatGPT is trained to hold on to conversational context, reply appropriately to follow-up questions, and provide accurate responses, it represents a significant advancement. Even though ChatGPT is amazing, it can only process visual data because it was only trained using one linguistic modality.

Visual ChatGPT

The ability of Visual Foundation Models (VFMs) to interpret and create complex pictures has demonstrated tremendous potential for computer vision. Due to the limitations imposed by the nature of task specification and the predetermined input-output formats, VFMs are less adaptive than conversational language models in human-machine interaction.

Join the Medium Membership Program for only 5$ to continue learning without limits. I’ll receive a small portion of your membership fee if you use the following link, at no extra cost to you.

A natural solution to developing a ChatGPT-like system that can understand and produce visual content is to train a multimodal conversational model. Nevertheless, building such a system would require a large amount of data and computing power.

Architecture of Visual ChatGPT

A recent Microsoft study suggests a solution for this problem in the form of Visible ChatGPT, which communicates with vision models using text and prompt chaining. Instead of building a brand-new multimodal ChatGPT from scratch, the researchers built Visual ChatGPT on top of ChatGPT and included various VFMs. To fill the gap between ChatGPT and these VFMs, they introduce a Prompt Manager with the following features —

  1. Specifies the input and output formats and informs ChatGPT on the capabilities of each VFM
  2. Handles the histories, priorities, and conflicts of various Visual Foundation Models
  3. Turns various visual information, such as png images, depth images, and mask matrix, into language format to aid ChatGPT in understanding.

ChatGPT may use these VFMs iteratively and learn from their replies by integrating the Prompt Manager, which allows it to do so until it either fully meets the needs of the users or reaches the end state.

Consider this scenario: A user uploads a picture of a yellow flower along with a complex language request, such as “please build a red flower conditioned on the predicted depth of this image and then construct it like a cartoon, step by step.” With the Prompt Manager, Visual ChatGPT starts the execution of connected Visual Foundation Models. To be more specific, it uses a depth estimation model first to identify the depth information, a depth-to-image model next to create a figure of a red flower using the depth information, and finally a style transfer VFM based on a Stable Diffusion model to turn the appearance of this image into a cartoon.

Overview of Visual ChatGPT

The Prompt Manager serves as a dispatcher for ChatGPT in the processing chain described above by providing the visual representations and monitoring the information transformation. Visual ChatGPT will suspend the pipeline’s execution after gathering “cartoon” hints from Prompt Manager and displaying the results. By choosing “god model” from a number of different small models, with text serving as the universal interface, it would be able to implement multimodality while running the source using Pyreverse.

Some highlights that stood out to me are —

  1. Provide responses that are coherent and relevant to the topic at hand.
  2. When talking about images, Visual ChatGPT is very strict to the file name and will never fabricate nonexistent files.
  3. Visual ChatGPT is able to use tools in a sequence, and is loyal to the tool observation outputs rather than faking the image content and image file name.
  4. Visual ChatGPT should use tools to finish following tasks, rather than directly imagine from the description.
  5. Very strict to the filename correctness and will never fake a file name if it does not exist. You will remember to provide the image file name loyally if it’s provided in the last tool observation.

So, what do you think about Visual chatGPT? Let’s discuss in the comments section :)

Paper — Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

GitHub — Visual ChatGPT

My other writings —

https://medium.com/@smraiyyan/list/awesome-chatgpt-prompts-6b5f19244ab3

More content at PlainEnglish.io. Sign up for our free weekly newsletter. Join our Discord community and follow us on Twitter, LinkedIn and YouTube.

Learn how to build awareness and adoption for your startup with Circuit.

ChatGPT
Artificial Intelligence
AI
Machine Learning
Microsoft
Recommended from ReadMedium