nter ChatGPT</h2><p id="a9aa">As with so many things, ChatGPT can make this system of selecting and combining models faster and easier.</p><p id="a3b9">In their paper about HuggingGPT, the authors explain how the system works. A user enters a query into the interface — generally for something ChatGPT can’t do, like transcribing printed text.</p><figure id="9809"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*dAy0req5xHlt9Bn7wZ3gAw.png"><figcaption></figcaption></figure><p id="e1d8">ChatGPT analyzes the query and tries to make sense of it. It then plans out the steps needed to solve the task. For example, in order to read printed text, the model might need to use Optical Character Recognition to transform the printed words into machine-readable text.</p><p id="7920">Once it understands what it needs to do, ChatGPT goes to Hugging Face and selects a relevant model based on the task. It reads the documentation for the model and then writes the necessary code to interface with the model. In our OCR example, it finds an OCR model in Hugging Face’s database that will help it decode the printed words on a page.</p><p id="d0b7">It can do this iteratively, finding multiple models that help get it closer to a solution and coding them together on the fly. In one example from the paper, HuggingGPT strings together three models in order to count the number of zebras in a photo.</p><figure id="0694"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*pnbOhL7f8OpsbktN6G7HIg.png"><figcaption></figcaption></figure><p id="d0be">Once HuggingGPT has found the models it needs, it connects everything up and solves the task.</p><p id="33b0">This strung-together model allows the system to do everything from reading printed text to creating a video based on a text prompt.</p><p id="53ae">Instead of needing to build a model that can perform all the necessary functions to complete a task, engineers can let HuggingGPT build the model on the fly and solve the task itself.</p><h2 id="f97b">Following the Brain</h2><p id="a479">As a trained neuroscientist, the HuggingGPT paper immediately made me think of another powerful system that achieves impressive feats by stringing together multiple specialized modules: the human brain.</p><p id="fd57">Most people think of the brain as one entity. In reality, though, it’s a mishmash of multiple <a href="https://www.pnas.org/doi/10.1073/pnas.1510619112">highly specialized modules</a> that all work together to get complex tasks done.</p><p id="4ab5">The <a href="https://memory.ucsf.edu/symptoms/executive-functions#:~:text=Anatomy%20of%20Executive%20Functions&text=The%20executive%20system%20involves%20the,our%20closest%20nonhuman%20primate%20relatives.">prefrontal cortex</a> is the conductor that keeps everything working together. Much like ChatGPT in the HuggingGPT model, the prefrontal cortex is responsible for understanding tasks, planning a process for executing them, and farming out parts of that process to other, specialized modules.</p><p id="5012">Crucially, the prefrontal cortex does relatively little specialized processing itself. When it wants to help you see something in the world, it communicates with the <a href="https://www.ncbi.nlm.nih.gov/books/NBK482504/">Primary Visual Cortex</a> on the brain’s occipital lobe. To hear things, it works with the <a href="https://www.ncbi.nlm.nih.gov/books/NBK10900/#:~:text=The%20primary%20auditory%20cortex%20(A1,contains%20a%20precise%20tonotopic%20map.">Auditory Cortex</a> on the temporal lobe.</p><p id="3e54">Individual parts of the brain can be incredibly specialized. The <a href="https://link.springer.com/referenceworkentry/10.1007/978-1-4419-1698-3_641#:~:text=Definition,objects)%20for%20typically%20developing%20individuals.">Fusiform Face Area</a> is a dedicated region whose job is to, you guessed it, process faces. Likewise, deeper brain regions like the brain stem and cerebellum control essential processes like breathing and your heartbeat, with little to no input from the “conductor” up in the prefrontal cortex.</p><p id="3700">HuggingGPT feels similar. Training a new language model to handle highly varied tasks like reading printed text, counting, and producing human-readable language output would take ages.</p><p id=
Options
"60b9">Instead, HuggingGPT cleverly sidesteps the problem. Much like the brain, it uses one model trained for high-level understanding and processing to direct the actions of many low-level modules that perform the mundane tasks required to get things done.</p><h2 id="e0d8">Stumbling Towards AGI</h2><p id="c11b">HuggingGPT isn’t elegant or parsimonious. It doesn’t constitute a technological advancement, much less a breakthrough. Rather, it’s a clever, messy <a href="https://www.techopedia.com/definition/3825/kludge#:~:text=12%20January%2C%202017-,What%20Does%20Kludge%20Mean%3F,clunky%20or%20disjointed%20IT%20systems.">kludge</a> — a duct-tape-and-string model assembled from spare parts, loosely bound together with some code and processes.</p><p id="6353">But then, so is the human brain.</p><p id="cbeb">Again, most people think of the brain as elegant. But it isn’t —the brain is the mother of all kludges.</p><p id="4e1b">Our eyes, for example, <a href="https://theconversation.com/look-your-eyes-are-wired-backwards-heres-why-38319">are backwards</a>. This makes our vision far worse than other creatures (octopuses’ eyes are the right way around, and they see much better than us.) It just turns out to be easier to grow backwards eyes than forwards ones, so our flawed ones have stuck around.</p><p id="e0d3">Likewise, our memories are designed to recall general information, not specifics. If I ask you what you had for breakfast yesterday, your brain will <a href="https://www.apa.org/monitor/2009/07-08/brain">call up memories of lots of random breakfasts</a> you’ve had in the past, as well as your thoughts, feelings, and aspirations around the concept of “breakfast.” These things are not useful for solving the task at hand.</p><p id="90fe">Still, the kludgey nature of the brain hasn’t prevented it from excelling as an organ for thinking, reasoning, and processing the world. Rather than being a liability, the brain’s hacked-together architecture is a huge asset.</p><p id="8917">Because the brain’s modules can be used in a variety of different ways, they’re not locked into one way of understanding the world.</p><p id="c457">The same brain modules that once helped us chuck a spear at a running antelope or dodge a saber tooth tiger can now help us drive through a congested city, or avoid a baseball flying at our faces. Kludgey, modular things are inefficient and inelegant. But they’re also adaptable.</p><p id="b630">That’s why, if and when AGI arrives, it will likely look more like HuggingGPT than a single, unified AI model that can do everything for everyone.</p><p id="af0f">HuggingGPT has many of the same advantages as the human brain. Yes, it’s messy and a bit hacked together. But that messiness provides a great deal of flexibility. As new basic models come out, for example, HuggingGPT could easily integrate them into its set of capabilities without needing to be retrained, just by reading their documentation and writing a relevant connector.</p><p id="ec6c">It can also call up only the modules that are actually needed to perform a task, which avoids wasting costly processing time. Using the prefrontal cortex to manage something simple, like your gut’s motility, would be a waste. Your brain avoids this waste by using the prefrontal cortex for things it excels at, while farming out the legwork of bodily functions to other brain regions.</p><p id="0d59">Likewise, HuggingGPT can call up fast, low-level modules to perform basic tasks like doing math or transcribing text, thus freeing up its high-level (and computationally intensive) LLM for more complex ones. This would result in a much more efficient model than a single LLM that could solve any problem.</p><p id="6ae4">With its spaghetti architecture of random modules, HuggingGPT isn’t pretty. But like the human brain, there’s a certain beauty in its messy, inelegant utility. And as anyone who has used their equally messy, kludgey brain to play a sport, learn an instrument, or speak a language can attest, there’s a certain undeniable power in such an architecture, too.</p><p id="88bf"><i>Did you find this analysis helpful?<a href="https://medium.com/subscribe/@tomsmith585"> Sign up for my Medium newsletter</a> to get my stories in your inbox daily for free</i></p></article></body>
HuggingGPT is a Messy, Beautiful Stumble Towards Artificial General Intelligence
It works much like the human brain
Illustration by the author with AI tools
As AI researchers work toward the holy grail of building an Artificial General Intelligence — an AI system that can handle any task — most people likely expect that if such a system ever arrives, it will be elegant and beautifully designed.
One might imagine that a truly intelligent machine would leverage some new breakthrough technology or interface with the real world in some novel way.
A new paper from Microsoft suggests otherwise. It shares a model called HuggingGPT — a spaghetti-like grab bag of AI models borrowed from an open-source repository and loosely strung together on the fly.
HuggingGPT is a messy, inelegant kludge. But its brain-like architecture just might be a major step towards real-life AGI.
Gathering Models
In the hype surrounding ChatGPT, many people forget that AI has been around in various forms for almost 100 years.
In that time, researchers have developed some incredibly impressive AI models. In the past, though, most of those models were highly domain-specific.
You could create a fantastic model to recommend movies based on a person’s watching habits, but it probably wouldn’t write a great blog post. Large Language Models (LLMs) like ChatGPT are impressive in their generality —they’re not specific to a single domain, and can handle nearly any kind of language-based task with ease.
Still, if you need to solve a highly specific problem, there are often better AI models available than ChatGPT. LLMs are great at interpreting text, but there’s a lot they can’t do. Ask ChatGPT a rudimentary math problem, and you’ll see what I mean.
HuggingGPT, a new model described in Microsoft’s paper, seeks to take the best parts of LLMs and mash them up with, basically, every other publicly released AI model that humans have created and published.
Getting Specific
AI is a famously collaborative and open field. Many AI scientists, even at big companies, insist on publishing the models they create. That means that when one company makes an AI breakthrough, the model they’ve built (if not the data on which it was trained) is often available to everyone.
Hugging Face, a website, has emerged to collect and share those highly-specialized models. The site provides the source code for newly published models and lets people experiment with the models in live demos, verifying or testing their capabilities without the need to write a lot of code.
Engineers at companies like Facebook and Google often head to Hugging Face to find new models for their projects, much as coders borrow liberally from Stack Overflow.
The website is a grab bag of useful, free components. Those freely-available components allow engineers to avoid reinventing the wheel when they need an AI capability that’s important, but not core to their product.
Enter ChatGPT
As with so many things, ChatGPT can make this system of selecting and combining models faster and easier.
In their paper about HuggingGPT, the authors explain how the system works. A user enters a query into the interface — generally for something ChatGPT can’t do, like transcribing printed text.
ChatGPT analyzes the query and tries to make sense of it. It then plans out the steps needed to solve the task. For example, in order to read printed text, the model might need to use Optical Character Recognition to transform the printed words into machine-readable text.
Once it understands what it needs to do, ChatGPT goes to Hugging Face and selects a relevant model based on the task. It reads the documentation for the model and then writes the necessary code to interface with the model. In our OCR example, it finds an OCR model in Hugging Face’s database that will help it decode the printed words on a page.
It can do this iteratively, finding multiple models that help get it closer to a solution and coding them together on the fly. In one example from the paper, HuggingGPT strings together three models in order to count the number of zebras in a photo.
Once HuggingGPT has found the models it needs, it connects everything up and solves the task.
This strung-together model allows the system to do everything from reading printed text to creating a video based on a text prompt.
Instead of needing to build a model that can perform all the necessary functions to complete a task, engineers can let HuggingGPT build the model on the fly and solve the task itself.
Following the Brain
As a trained neuroscientist, the HuggingGPT paper immediately made me think of another powerful system that achieves impressive feats by stringing together multiple specialized modules: the human brain.
Most people think of the brain as one entity. In reality, though, it’s a mishmash of multiple highly specialized modules that all work together to get complex tasks done.
The prefrontal cortex is the conductor that keeps everything working together. Much like ChatGPT in the HuggingGPT model, the prefrontal cortex is responsible for understanding tasks, planning a process for executing them, and farming out parts of that process to other, specialized modules.
Crucially, the prefrontal cortex does relatively little specialized processing itself. When it wants to help you see something in the world, it communicates with the Primary Visual Cortex on the brain’s occipital lobe. To hear things, it works with the Auditory Cortex on the temporal lobe.
Individual parts of the brain can be incredibly specialized. The Fusiform Face Area is a dedicated region whose job is to, you guessed it, process faces. Likewise, deeper brain regions like the brain stem and cerebellum control essential processes like breathing and your heartbeat, with little to no input from the “conductor” up in the prefrontal cortex.
HuggingGPT feels similar. Training a new language model to handle highly varied tasks like reading printed text, counting, and producing human-readable language output would take ages.
Instead, HuggingGPT cleverly sidesteps the problem. Much like the brain, it uses one model trained for high-level understanding and processing to direct the actions of many low-level modules that perform the mundane tasks required to get things done.
Stumbling Towards AGI
HuggingGPT isn’t elegant or parsimonious. It doesn’t constitute a technological advancement, much less a breakthrough. Rather, it’s a clever, messy kludge — a duct-tape-and-string model assembled from spare parts, loosely bound together with some code and processes.
But then, so is the human brain.
Again, most people think of the brain as elegant. But it isn’t —the brain is the mother of all kludges.
Our eyes, for example, are backwards. This makes our vision far worse than other creatures (octopuses’ eyes are the right way around, and they see much better than us.) It just turns out to be easier to grow backwards eyes than forwards ones, so our flawed ones have stuck around.
Likewise, our memories are designed to recall general information, not specifics. If I ask you what you had for breakfast yesterday, your brain will call up memories of lots of random breakfasts you’ve had in the past, as well as your thoughts, feelings, and aspirations around the concept of “breakfast.” These things are not useful for solving the task at hand.
Still, the kludgey nature of the brain hasn’t prevented it from excelling as an organ for thinking, reasoning, and processing the world. Rather than being a liability, the brain’s hacked-together architecture is a huge asset.
Because the brain’s modules can be used in a variety of different ways, they’re not locked into one way of understanding the world.
The same brain modules that once helped us chuck a spear at a running antelope or dodge a saber tooth tiger can now help us drive through a congested city, or avoid a baseball flying at our faces. Kludgey, modular things are inefficient and inelegant. But they’re also adaptable.
That’s why, if and when AGI arrives, it will likely look more like HuggingGPT than a single, unified AI model that can do everything for everyone.
HuggingGPT has many of the same advantages as the human brain. Yes, it’s messy and a bit hacked together. But that messiness provides a great deal of flexibility. As new basic models come out, for example, HuggingGPT could easily integrate them into its set of capabilities without needing to be retrained, just by reading their documentation and writing a relevant connector.
It can also call up only the modules that are actually needed to perform a task, which avoids wasting costly processing time. Using the prefrontal cortex to manage something simple, like your gut’s motility, would be a waste. Your brain avoids this waste by using the prefrontal cortex for things it excels at, while farming out the legwork of bodily functions to other brain regions.
Likewise, HuggingGPT can call up fast, low-level modules to perform basic tasks like doing math or transcribing text, thus freeing up its high-level (and computationally intensive) LLM for more complex ones. This would result in a much more efficient model than a single LLM that could solve any problem.
With its spaghetti architecture of random modules, HuggingGPT isn’t pretty. But like the human brain, there’s a certain beauty in its messy, inelegant utility. And as anyone who has used their equally messy, kludgey brain to play a sport, learn an instrument, or speak a language can attest, there’s a certain undeniable power in such an architecture, too.