avatarSimone Tedeschi

Summary

AirLLM is a Python package that optimizes inference memory usage, allowing 70B LLMs to run inference on a single 4GB GPU card without degrading model performance.

Abstract

AirLLM is a Python package that enables running large language models (LLMs) with high memory requirements on a single 4GB GPU card without compromising performance. The package works by splitting the original LLM into smaller sub-models, each containing one or a few layers, and loading them on demand during inference. This way, only the necessary sub-models are kept in memory at any given time, reducing memory usage. AirLLM also applies block-wise quantization to compress the sub-models further, reducing disk loading time and memory usage. The package supports most of the top models in the Hugging Face open LLM leaderboard, such as Platypus2, LLaMa2, Mistral, Mixtral, SOLAR, StellarBright, and more. Using AirLLM is simple and intuitive, and it provides access to state-of-the-art LLMs, low memory requirements, and easy usage. However, loading data sequentially from slower storage such as disk I/O increases the latency of the inference process.

Bullet points

  • AirLLM is a Python package that optimizes inference memory usage, allowing 70B LLMs to run inference on a single 4GB GPU card without degrading model performance.
  • The package works by splitting the original LLM into smaller sub-models, each containing one or a few layers, and loading them on demand during inference.
  • AirLLM also applies block-wise quantization to compress the sub-models further, reducing disk loading time and memory usage.
  • The package supports most of the top models in the Hugging Face open LLM leaderboard, such as Platypus2, LLaMa2, Mistral, Mixtral, SOLAR, StellarBright, and more.
  • Using AirLLM is simple and intuitive, and it provides access to state-of-the-art LLMs, low memory requirements, and easy usage.
  • However, loading data sequentially from slower storage such as disk I/O increases the latency of the inference process.
Photo by Sharon Pittaway on Unsplash

How to Run 70B LLMs on a Single 4GB GPU

Have you ever dreamed of using the state-of-the-art large language models (LLMs) for your natural language processing (NLP) tasks, but felt frustrated by the high memory requirements? If so, you might be interested in AirLLM, a Python package that optimizes inference memory usage, allowing 70B LLMs to run inference on a single 4GB GPU card. No quantization, distillation, pruning or other model compression techniques that would result in degraded model performance are needed.

What is AirLLM and how does it work?

Large language models (LLMs) are computationally expensive and require a lot of memory to train and run. The reason for this is that LLMs have a large number of layers — a 70B model can have more than 80 layers. However, during inference, each layer in a language model is independent and relies only on the output of the previous layer. Therefore, it is not necessary to keep all layers in GPU memory. Instead, we can load only the necessary layers from disk when executing that layer, do all the calculations, and then completely free the memory after. This way, the GPU memory required for a single layer is only about the parameter size of that transformer layer, i.e. 1/80 of the full model, or ~2GB.

The main idea behind AirLLM is indeed to split the original LLM into smaller sub-models, each containing one or a few layers, and load them on demand during inference. This way, only the necessary sub-models are kept in memory at any given time, and the rest are stored on disk. It also applies block-wise quantization to compress the sub-models further, reducing the disk loading time and the memory usage.

AirLLM supports most of the top models in the Hugging Face open LLM leaderboard, such as Platypus2, LLaMa2, Mistral, Mixtral, SOLAR, StellarBright and more.

If you are interested in exploring LLMs and Generative AI further, you can check out my previous articles for more insights and examples.

How to use AirLLM?

Using AirLLM is very simple and intuitive. You just need to install the airllm pip package, and then use the AutoModel class to load the LLM of your choice from the Hugging Face hub or a local path. You can then perform inference similar to a regular transformer model, using the generate method. For example, the following code snippet shows how to use AirLLM to load and use the Platypus2–70B-instruct model, which can answer natural language questions and follow instructions.

pip install airllm
from airllm import AutoModel

MAX_LENGTH = 128

# load the model from the Hugging Face hub
model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct")

# or load the model from a local path
# model = AutoModel.from_pretrained("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

# prepare the input text
input_text = [
    'What is the capital of United States?',
]

# tokenize the input text
input_tokens = model.tokenizer(input_text,
    return_tensors="pt",
    return_attention_mask=False,
    truncation=True,
    max_length=MAX_LENGTH,
    padding=False)

# generate the output text
generation_output = model.generate(
    input_tokens['input_ids'].cuda(),
    max_new_tokens=20,
    use_cache=True,
    return_dict_in_generate=True)

# decode the output text
output = model.tokenizer.decode(generation_output.sequences[0])

# print the output text
print(output)

The output of this code snippet is:

What is the capital of United States?
The capital of the United States is Washington, D.C.

Note that during the first inference, AirLLM will decompose and save the original LLM layer-wise, so make sure you have enough disk space. After that, AirLLM will load the sub-models on demand and perform inference faster and with less memory.

What are the benefits of using AirLLM?

By using AirLLM, you have the following advantages:

  • Access to the state-of-the-art LLMs: You can use the most advanced LLMs for your NLP tasks, such as question answering, text generation, text summarization, text classification, and more. You can choose from a variety of models that suit your needs and preferences, such as domain-specific, multilingual, or instruction-tuned models.
  • Low memory requirements: You don’t need to worry about out-of-memory errors or expensive cloud computing resources. You can run inference on a single 4GB GPU card, or even on a CPU or a Mac device.
  • Easy and intuitive usage: You can use AirLLM as a drop-in replacement for the regular transformer models, with minimal code changes.

What are the drawbacks of using AirLLM?

As previously discussed, AirLLM loads only the necessary layer from disk when executing that layer, and then completely frees the memory. However, loading data sequentially from slower storage such as disk I/O increases the latency of the inference process. If the SSD reads at 4GB/s and the model has 80Gb, then you’ll wait 20 seconds for the generation of just one token, and, for each token, you need a full pass.

Colab Notebook

The following notebook shows how to run Platypus, LLaMa2, Mistral and other LLMs with AirLLM:

Want to learn more about AirLLM?

If you are interested in learning more about AirLLM, you can check out the official GitHub repository, where you can find the source code, the installation instructions, the configuration options, the full list of supported models, FAQs, and more.

Thanks for reading and happy coding!

This story is published on Generative AI. Connect with us on LinkedIn and follow Zeniteq to stay in the loop with the latest AI stories. Let’s shape the future of AI together!

Large Language Models
Artificial Intelligence
Machine Learning
NLP
Deep Learning
Recommended from ReadMedium