The Busy Person Intro to LLMs

After getting inspired by Andrej Karpathy’s recent video on the same topic, I decided to challenge myself to write a blog covering the general development of LLMs in the last year or so. By no means is this a complete guide to LLMs, but the topic is so big that it would need an entire series to cover all the major developments in LLMs. So, without further ado, let’s delve deep into the fascinating world of LLMs and try to give you a quick tour of the fascinating world of LLMs.

What are LLMs?
Types of LLMs
Model Training
Network Dreams
How does it work?
Training an assistant
Reinforced Learning Human Feedback (RLHF)
Current SOTA LLMs
LLM Scaling Laws
Thinking Systems
Custom LLMs
LLM-OS similarities
Jailbreaks
Conclusion

Check out the other two parts of this series:

A detailed look at Self-Attention and Transformers

A deep intuition behind Transformer architecture.

medium.com

Can LLMs really reason and plan?

LLMs are great, but can they really plan?

medium.com

What are LLMs?

LLMs or Large Language Models is a type of generative AI that has been trained on a massive scale of data (petabytes of data) and can produce novel responses to any type of question, that’s why the name generative. These models are based on the Transformer architecture and require very big GPU-based data centers. FYI, it took around 100 million USD to train ChatGPT. These models are massive, they can’t fit on any single server; their parameter counts go in the order of trillions.

One way to think about these models is to consider them as an idea-generation machine. They can generate or give an approximate answer to any textual query, even if it has not seen similar stuff in the past. Given the size of the model data, these models have somehow captured the essence of language, in some cases they even learned languages that were not even part of the training data.

Some of you may ask, how is this possible? This is called Emergent capabilities. It has been shown in research that these models might develop completely new capabilities and capacities as we increase their size.

Note: There are some doubts about LLMs' emergent capabilities as well.

If you want to know more details about Transformer architecture: click here

Types of LLMs

Currently, we have three types of LLMs:

Closed Source: Models like ChatGPT or GPT-4 can only be accessed as a service through the API provided by OpenAI.
Open Source, closed architecture: Like Llama 2, the weight files of these models are open-sourced, meaning we can run them locally, and use them to train smaller models, but their architecture is still not public.
Fully open source: T5, BERT, and GPT-2 are a few of the fully open-sourced models, having both their weights and architecture open-sourced.

Compared to Fully open-sourced models, Closed or partial Open-source models are much bigger in size.

But, what exactly are these models, how do they look like, and what is their structure?

If we talk about models like Llama 2, 70B, it is a model with 70 billion parameters. There are two files that we need to use this model. A weight file or parameter file (140 GB of basic matrices) and a 500-line C code that run these weights or parameter files.

The reason why the parameter file is 140 GB is because the precision of weight matrices is Float16, which means 2 bytes, meaning 140 GB for a 70 billion parameters model.

These two files are completely self-contained packages, we don’t need anything else, just compile the C code, that points towards this parameter file and we are good to go to generate text on any kind of question locally (without internet or anything).

Model training

Model training is not as straightforward as Model Inference, for inference we can use a local machine with decent GPU, but training LLMs is massively expensive.

Llama 2 which has around 10 TB of training data, it was gathered by crawling all over the internet. It needed 6000 GPUs running for 12 days costing 2 Million USD to train it.

Basically, it compressed this huge chunk of internet (10 GB) into a 140 GB of parameter file. You can think of it as zipping of the internet into a single file but with one big difference, it is a lossy compression, unlike lossless compression of zipping.

LLM compressing the internet to model parameters (Img Src)

Just so you know current SOTA in LLMs is maybe 10 or even 100 times bigger than this.

But why are we calling training of LLMs as compression of the internet?

The simplest way to understand LLMs is that they are the next word prediction machine, and it can be mathematically shown that predicting the next word is very similar to compressing that data.

Compression is not the only way to think about LLMs, we can also think about it as finding the conditional probability, predicting the next token (token can be thought of similar to a word) given a sequence of words, like in the below example, finding the probability of predicting store, given I went to the. LLMs can be thought of as a model that learns all these conditional probabilities.

However, this view of conditional probability is very simplistic, and a lot of researchers don’t agree with that. For instance, Illya Sutskever (creator of ChatGPT) thinks that in order to predict the next words, these models have built internal world models like humans do to understand concepts.

But how? He believes that since the internet is a representation of human experience, and these models are being trained on massive amounts of data, somehow these models have built a detailed understanding of abstract concepts. That’s why they have shown the ability to score better than humans in a lot of cases and even surpass the Turing Test.

Illya’s position is highly contested by other Scientists like VP of META Yann LeCun. As of now, there is no consensus on how and why LLMs are giving superb performance in some cases showing almost sentient behavior and at times failing miserably to do even basic stuff.

Here’s a great article highlighting this: click here

Network dreams

Since the LLMs are trained on raw internet text, they come across different types of textual data. For instance, in the below diagram, we can see the three different types of output dreamed by the network.

But, why I’m calling it a dream?

Because if you zoom in on the details of this, you will realize it is creating fake ISBN numbers and that type of stuff. It has seen so much of that type of data, so it thinks that whenever it says ISBN, it needs to be followed by a random number.

Network dreams different types of internet documents (Img Src)

Another important thing to remember here is that these models are called autoregressive types of models. These models do not produce the entire texts in one go, but they predict one word at a time. Whatever word is predicted, that word is also taken as input to predict the next words and this process continues; it is summarized beautifully in the below diagram.

Auto-Regressive nature of LLMs (Img Src)

Some of you may wonder, then how does it know when to stop producing the next word? These models are fine-tuned to predict special tokens like <BREAK>. The moment this token is produced, the model stops producing any text.

Karpathy on Hallucinations:

How does it work?

As mentioned above, LLMs use Transformer architecture to do their magic.

A detailed look at Self-Attention and Transformers

A deep intuition behind Transformer architecture.

medium.com

We know precisely how transformer architecture works, we know all the mathematical operations, yet

Why are we saying that despite knowing how these models work, why are we not sure about what they learned? The answer to this lies in something called the reversal curse [3].

For instance, if we ask these models, who is Elon Musk’s mother? It answers correctly, but when we ask, who is Maye Musk’s (Elon’s mother) son, it fails. And that’s why we say we don’t know what it learned.

To understand in more detail this behavior of LLMs, there is a completely new field emerging right now called Mechanistic Interpretability. Here’s my series on Mechanistic Interpretability [4]. Do check it out if you want to know the best research area in the overall AI field.

Mechanistic interpretability in the context of DL involves delving into the inner workings of these models to understand how individual components contribute to overall behavior. To put it technically, we want to elucidate the function of each neuron, layer, and pathway within the network with respect to how they process inputs and affect the final output.

Simply put, we somehow need to reverse-engineer the weights of the trained neural networks and convert them into a large binary file or Python code that, later on, we can use to know the limits of what the given model can’t do. It’s easier said than done; doing this is extremely hard; even small models of a few layers forget about models like GPT.

Training an Assistant

What we talked till now is called pre-training, giving our models to just predict document-like answers, but that’s not what we want. We want an assistant-like model.

Training LLMs have several components, the below diagram perfectly summarizes the entire LLM training pipeline.

So the use of the pre-training is to give a rough understanding of the language to the world. It learned grammatical rules and other rules a language might follow, but to make the network produce factually correct and coherent answers, we do something called, Supervised Fine-tuning.

Supervised Fine-tuning is a very time-consuming process, where we get actual people to write a set of questions and answers. We expose the model to this human-written content, and by doing so, the model learns to start behaving like an assistant.

Note: The pre-training stage has much more data but of low quality, compared to low volume and high-quality text of Supervised Fine-tuning. This step makes the model behave more like an assistant rather than producing entire documents. In short, it changes model behavior from producing internet documents to question-answer pairs.

To learn more about Supervised Fine-tuning: click here

Reinforced Learning Human Feedback (RLHF)

As mentioned above getting the data for the Supervised Fine-tuning stage is very costly and time-consuming. For instance, it is much easier for humans to identify a good paragraph than write a good paragraph.

So this third stage is an optional stage, where we use the LLM itself to generate multiple answers to the same question and then select the best answers, mixed with some more edits with human data labelers, we fine-tune the model even further to get it more accurate. The below image summarizes the full process of RLHF.

Read the full article from OpenAI on RLHF-based Instruct GPT: click here

To learn more about Supervised Fine-tuning: click here

There is a brand new research called RLAIF [7] where we completely replace humans and use AI itself to self-critique its own response.

Note: Despite all this training and fine-tuning LLMs hallucinate quite a lot.

Current SOTA LLMs

Different LLMs can produce completely different results, thus evaluating them becomes quite tricky. There are some standard tests, yet, it is impossible to tell when an LLM is going to hallucinate.

Look at the below screenshot, I asked the same question to both the LLMs, and their answers were wildly different. I asked both the LLMs to explain to me something an LLM can’t explain.

This question in itself is contradictory, in the case of Model B, it started talking about consciousness and other abstract ideas, and Model A gave a simple two-line response.

Do you see the problem? None of them are wrong, yet they are very very different.

Comparing LLMs responses (Img Src: Belongs to author)

Given below is the ranking of different LLMs on this benchmark [10].

Different LLMs leader board ranking (Img Src)

LLM scaling laws

It has been shown by many researchers that LLMs accuracy is remarkably smooth and well-behaved function. It means that we might need not to invent a new algorithm to reach AGI, we can just keep scaling these networks and keep getting better and more intelligent models.

This theory has worked well till now, but that doesn’t mean that it can’t hit the wall. It is very much possible that after a certain point, accuracy will not be increased no matter how much we scale these systems.

A few scientists believe that we can’t achieve AGI by just using LLMs and Backpropagation. All the emergent intelligent behavior might not be a direct result of more computing. But as of now, we can neither confirm nor deny any of the positions. Scaling laws are working as of present.

AI assistant using tools

The newer version of ChatGPT called GPT-4 can not only produce text, but it can also use different tools. For instance, this new version has been given the capability to search the internet and use that information to give more up-to-date and better answers.

Earlier versions of GPT could only answer things till the last date of their training data, usually a year behind the present.

How does GPT know when to use a browser?

Let Karpathy answer this himself:

How GPT decides to use the internet (Img Src)

GPT 4-V can now even create Images, and understand sound and images directly. This new breed of LLMs is becoming more and more Multi-Modal (ability to take different types of Inputs).

Thinking Systems

There is this idea that our brain works in two modes, one where it just retrieves information and the other where it has to think consciously. If someone asks you 2+2, we don’t actually think, we just retrieve this information, but if someone asks 1089x12, we have to consciously calculate it.

Another way to think about this System 1 and System 2 type of intelligence is blitz chess, where you play instinctively, compared to regular chess, where you analyze everything in detail.

So we believe that current LLMs are type 1 intelligence, that’s why every word is taking almost the same time to be predicted.

So, the idea is, can we make these LLMs self-reflect and move towards a type 2 intelligent system? Luckily, there has been some progress in this area as well, like the Chain of Thought, Tree of Thoughts, Graph of Thought, and the most recent one called Everything of Thoughts [11].

Here’s an interesting blog on the same topic: click here

Note: Because everything is so open-ended in language space, it is very tough to train LLMs in a reward function setting like Alpha Go, where it self-improved itself just by creating multiple agents.

Custom LLMs

Recently GPT released something called a GPT store, where we can give custom instructions to slightly modify the behavior of LLMs, where we can give our LLMs a type of personality. For instance, recent Grok AI from Twitter also has this option, where it can give very sassy responses.

Personally, I have tested out the GPT store, not the GROK AI, but honestly speaking I’m not that impressed by this personality thing, it’s a hit-and-miss kind of a thing.

Giving custom instruction to GPT (Img Src; Belong to author)

LLM-OS similarities

LLMs started as a generative AI model that just produces the next word given a set of tokens. But in the past few months, their capabilities have been extended to so many things, that they can hardly be called Language based models only.

LLMs have evolved to do all these things:

They are getting closer and closer to an Operating system. We can think of context window as RAM memory. We can think of retrieval Augmented generation as Hard Disk memory, where it pulls up relevant information from a Database and uses that to generate new text. We now even have an app store like Google App Store.

Jailbreaks

Just like any other piece of software, LLMs have not yet reached a place where they are very secure.

ChatGPT can be fooled with different types of attacks. By shifting the tone of the question in the below answer the user was able to ask GPT about how to make Napalms, now the answer might not be correct but that doesn’t mean it can’t be used to produce bio weapons and other types of dangerous stuff.

Avoiding this type of situation is the biggest challenge in AI safety.

Another example of Jailbreak is Claude being fooled by Base 64 encoding of the same input text.

Claude was fooled by the base64 version of the same text (Img Src)

Other type of attack is Prompt Injection, in the below diagram, GPT accessed a website and that website was injected with this malicious prompt to show a fraud link to the user.

GPT guiding user towards a fraud link (Img Src)

This type of attack are already a big problem, but the bigger problem is attackers were able to extract private data. For instance a lot of companies banned using ChatGPT because the system released their internal data. which got stored in GPT servers when those people asked their questions to GPT.

Not only that, a recent researcher even extracted training data from GPT.

To learn more about this latest jailbreak: click here

Conclusion

The field of LLMs is still evolving, there are a lot of cool, exciting, and a bit dangerous things happening in the field. LLM security is a big risk, and my personal belief is that these models should not be released without proper safeguarding (that’s just my personal belief). Next time, we will try to get a more detailed look at what GPTs can and can’t do.

Check out the other two parts of this series: Part 1 and Part 3

Well, that’s it for this blog. I hope you had fun, Thanks to Andrey Karpathy for making an awesome video on the topic of LLMs: original video

It takes a lot of effort to write such content, please consider supporting and follow me at https://vishal-ai.medium.com/

Check out my other articles as well. I write about a lot of SOTA AI.