If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. <> $instruction [/INST] """ text_token = generate_text(textenc, model, prompt; max_length, k, temperature) for (i, token) in enumerate(text_token) m = match(r"<0x([0-9A-F]{2})>", token) if !isnothing(m) c = parse(UInt8, m.captures[]; base=16) text_token[i] = StringView([c]) end end gen_text = replace(join(text_token), '▁'=>' ') println(gen_text)

Walking Through Text Generation with Llama-2 Using Transformers.jl

No medium membership? Read this article for free here.

Julia is no longer a niche language. It has flourished, thanks to a vibrant community that consistently releases new packages across a variety of fields. Among these packages, Transformers.jl has captured my attention for its simplicity and elegant way of approaching complex NLP tasks, such as text generation.

Peter, the main contributor of Transformers.jl, has prepared two very well-structured example Jupyter notebooks that serve as a primer for engaging with the Dolly and the Llama2 models in real-time. In this article, I offer a detailed walkthrough of the Llama-2 Jupyter notebook example with detailed commentary that extends beyond the scope of Peter’s original Jupyter notebook. To set the stage, let’s first explore an overview of Llama-2’s capabilities and limitations.

Llama-2 Model Overview

The Llama-2 Model stands at the forefront of the latest generation of open-source Large Language Models (LLMs) pre-trained and fine-tuned to excel in text generation. It comes in different sizes, with the 7B parameter model being particularly optimized for dialogue/chat applications. Developed by Meta, Llama-2 models are designed to be safe in interactions.

The model was trained in 2023 exclusively on data available to the public, ensuring that no Meta user data was included. The pre-training corpus consisted of 2 trillion tokens, while the fine-tuning process incorporated over one million human-annotated examples for better performance and safety alignment.

Performance and Safety

In terms of safety benchmarks, the Llama-2 models showed impressive results, with the 7B model achieving a 57.04% score on TruthfulQA and 0% on ToxiGen, reflecting a low likelihood of generating toxic and harmful content.

Intended Use and Limitations

However, despite these advances, Llama-2 comes with its set of limitations. Llama-2 is built for commercial and research applications in English and does not support other languages(!) Moreover, while testing has shown promising results, it’s important to note that Llama-2, like any AI model, can produce inaccurate or biased content, necessitating cautious use.

Let’s move on to explore the code!

The Llama2 Full Example Code

In the cell code below, you’ll find the complete code from Peter’s Llama2 example Jupyter notebook.

using Transformers
using CUDA
using Transformers.HuggingFace
using Flux
using StatsBase
using Transformers.TextEncoders
using StringViews
using HuggingFaceApi

CUDA.devices()

CUDA.devices(1)

CUDA.allowscalar(false)
enable_gpu(true)

access_token = ""

# This will save the access token to the disk, then all call to 
# download file from huggingface hub will use this token.
using HuggingFaceApi
HuggingFaceApi.save_token(access_token)

# or call those `load` function with `auth_token` keyword argument
# like this:
HuggingFace.load_tokenizer("meta-llama/Llama-2-7b-chat-hf"; auth_token = access_token)



textenc = hgf"meta-llama/Llama-2-7b-chat-hf:tokenizer"
model = todevice(hgf"meta-llama/Llama-2-7b-chat-hf:ForCausalLM") # move to gpu with `todevice` (or `Flux.gpu`)



function temp_softmax(logits; temperature = 1.2)
    return softmax(logits ./ temperature)
end

function top_k_sample(probs; k = 1)
    sorted = sort(probs, rev = true)
    indexes = partialsortperm(probs, 1:k, rev=true)
    index = sample(indexes, ProbabilityWeights(sorted[1:k]), 1)
    return index
end



function generate_text(textenc, model, context = ""; max_length = 512, k = 1, temperature = 1.2, ends = textenc.endsym)
    encoded = encode(textenc, context).token
    ids = encoded.onehots
    ends_id = lookup(textenc.vocab, ends)
    for i in 1:max_length
        input = (; token = encoded) |> todevice
        outputs = model(input)
        logits = @view outputs.logit[:, end, 1]
        probs = temp_softmax(logits; temperature)
        new_id = top_k_sample(collect(probs); k)[1]
        push!(ids, new_id)
        new_id == ends_id && break
    end
    return decode(textenc, encoded)
end



function generate(textenc, model, instruction; max_length = 512, k = 1, temperature = 1.2)
    prompt = """
    [INST] <>
    You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

    If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
    <>

    $instruction [/INST]
    
    """
    text_token = generate_text(textenc, model, prompt; max_length, k, temperature)
    for (i, token) in enumerate(text_token)
        m = match(r"<0x([0-9A-F]{2})>", token)
        if !isnothing(m)
            c = parse(UInt8, m.captures[]; base=16)
            text_token[i] = StringView([c])
        end
    end
    gen_text = replace(join(text_token), '▁'=>' ')
    println(gen_text)
end

generate(textenc, model, "Can you explain to me briefly what is the Julia programming language?")

Importing Packages

Here are all the necessary packages, including the two Transformers.jl modules, HuggingFace and TextEncoders, utilized in Peter’s notebook. I will reference them as they become relevant in the code below.

using Transformers
using CUDA
using Transformers.HuggingFace
using Flux
using StatsBase
using Transformers.TextEncoders
using StringViews
using HuggingFaceApi

Setting up the GPU devices

CUDA.devices()
CUDA.device!(1)

The first line lists all available GPU devices of your machine or service and the second line selects the first one.

Disabling Scalar Indexing

CUDA.allowscalar(false)
enable_gpu(true)

To leverage the full potential of Machine Learning, as is usually done in these use cases, Peter employs CUDA, a parallel computing platform and API model created by NVIDIA. This section of the code is aimed at optimizing the settings for the GPU(s):

CUDA.allowscalar(false): This command ensures that our operations are vectorized. In essence, by disabling scalar indexing, which means work on individual data elements (scalars), computations are performed on multiple data elements on vectors or matrices simultaneously. This is ideal for the GPU's architecture, because GPUs are designed to perform operations in parallel on large sets of data, and scalar operations can lead to performance bottlenecks.
enable_gpu(true): With this function call, we direct our computing tasks to the GPU.

In summary, these lines of code set up our environment to utilize GPU acceleration, enabling parallel processing for demanding tasks like text generation.

Accessing the Llama-2 Model with an Access Token

access_token = ""
HuggingFaceApi.save_token(access_token)

This snippet sets up authentication with the HuggingFace API.

access_token = "": Here is where you’ll insert your unique personal access token from HuggingFace. This token is a unique identifier that grants access to the API, acting as a password.
HuggingFaceApi.save_token(access_token): This line calls the method save_token() from the module named HuggingFaceApi, passing in the access_token variable as an argument. The purpose of this method is to save the access token in the configuration of the API client. This is necessary because the token tells the HuggingFace service who is accessing the API and ensures that the user has the correct permissions.

Without proper authentication, the API would reject your requests to ensure the security and privacy of its users and resources.

Loading the Model and Tokenizer

textenc = hgf"meta-llama/Llama-2-7b-chat-hf:tokenizer"
model = todevice(hgf"meta-llama/Llama-2-7b-chat-hf:ForCausalLM")

The tokenizer and the model are loaded from HuggingFace's model hub, which is a well-known repository of pre-trained models. The identifier meta-llama/Llama-2-7b-chat-hf:tokenizer specifies which tokenizer to load. todevice() moves the model to the selected GPU device. The ForCausalLM suffix suggests that the Llama-2 model is a type of LLM designed for generating text based on a given input (i.e., causal language modeling).

Helper Functions for Text Generation

The temp_softmax() softmax function

function temp_softmax(logits; temperature = 1.2)
    return softmax(logits ./ temperature)
end

temp_softmax() is a custom softmax function with a temperature parameter scaling the logits of the model to adjust the creativity of the model's outputs. The softmax function is a common tool used in Machine Learning to convert logits (raw prediction scores) into probabilities that sum to 1 and tunes the model’s prediction confidence. Above 1, it promotes varied responses; below 1, it favors more certain predictions.

The top_k_sample() function

function top_k_sample(probs; k = 1)
    sorted = sort(probs, rev = true)
    indexes = partialsortperm(probs, 1:k, rev=true)
    index = sample(indexes, ProbabilityWeights(sorted[1:k]), 1)
    return index
end

The top_k_sample() function implements top-k sampling, which restricts the model's choices to the k most likely next tokens. When k is set to 1, the function performs greedy sampling, always choosing the most likely next token. If k is larger, the sampling becomes more varied as it will choose from the top k most likely next tokens.

Both of these functions are techniques for controlling the output of generative models to balance between randomness (or creativity) and the likelihood of the text. The use of these methods can significantly affect the style of the text generated by the model.

Generating Text

function generate_text(textenc, model, context = ""; max_length = 512, k = 1, temperature = 1.2, ends = textenc.endsym)
    encoded = encode(textenc, context).token
    ids = encoded.onehots
    ends_id = lookup(textenc.vocab, ends)
    for i in 1:max_length
        input = (; token = encoded) |> todevice
        outputs = model(input)
        logits = @view outputs.logit[:, end, 1]
        probs = temp_softmax(logits; temperature)
        new_id = top_k_sample(collect(probs); k)[1]
        push!(ids, new_id)
        new_id == ends_id && break
    end
    return decode(textenc, encoded)
end

The generate_text()function generates text by taking a prompt as input and running the model to produce output, respecting a maximum length and using the above helper functions for sampling.

Let's break down the function to understand how it works:

generate_text(textenc, model, context = ""; max_length = 512, k = 1, temperature = 1.2, ends = textenc.endsym): The signature of the function has several parameters:

textenc: The tokenizer for encoding text to a format the model can process.
model: The actual language model that will generate the predictions.
context: An optional parameter that serves as a starting point or prompt for the text generation.
max_length: The maximum length of the generated text sequence.
k: The parameter for the top-k sampling strategy.
temperature: The temperature for setting the softmax function.
ends: The end-of-sequence symbol used to determine when to stop generating text.

2. encoded = encode(textenc, context).token: The input context is encoded into a format that the model can understand (usually a list of token IDs).

3. ids = encoded.onehots: The encoded tokens are further processed into one-hot encoded vectors, although this is not a typical step for language models which usually operate directly on token IDs.

4. ends_id = lookup(textenc.vocab, ends): This looks up the token ID for the end-of-sequence symbol in the tokenizer's vocabulary.

5. for i in 1:max_length: The function enters a loop that will continue until it either reaches the maximum length for the generated text or it encounters the end-of-sequence symbol.

6. input = (; token = encoded) |> todevice: The current input is prepared for the model, which includes the tokens generated so far. It is then moved to the device, the GPU, for computation.

7. outputs = model(input): The model generates logits (raw output scores) for the next token in the sequence based on the input.

8. logits = @view outputs.logit[:, end, 1]: A slice of the logits for the last token in the sequence is extracted.

9. probs = temp_softmax(logits; temperature): The custom softmax function defined above is applied to the logits using the specified temperature, to get a probability distribution.

10. new_id = top_k_sample(collect(probs); k)[1]: The top-k sampling function is applied to the probabilities to select the next token ID.

11. push!(ids, new_id): The new token ID is added to the list of generated token IDs.

12. new_id == ends_id && break: If the generated token is the end-of-sequence symbol, the loop is exited.

13. return decode(textenc, encoded): Finally, the list of token IDs is decoded back into text and returned as the output of the function.

Essentially, the function generates one token at a time and appends it to the output until it reaches the maximum length or the end-of-sequence symbol.

Processing the Output

The final code snippet of the notebook defines a new custom function generate() that formats the generated text and prints the output.

function generate(textenc, model, instruction; max_length = 512, k = 1, temperature = 1.2)
    prompt = """
    [INST] <>
    You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

    If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
    <>

    $instruction [/INST]
    
    """
    text_token = generate_text(textenc, model, prompt; max_length, k, temperature)
    for (i, token) in enumerate(text_token)
        m = match(r"<0x([0-9A-F]{2})>", token)
        if !isnothing(m)
            c = parse(UInt8, m.captures[]; base=16)
            text_token[i] = StringView([c])
        end
    end
    gen_text = replace(join(text_token), '▁'=>' ')
    println(gen_text)
end

Example Generation

generate(textenc, model, "Can you explain to me briefly what is the Julia programming language?")

The last line of the notebook executes the custom generation function generate(), demonstrating a practical application of the model. And there you have it! The final piece of code brings the Llama-2 model to life, showcasing its practical utility. Execute the generate() function and watch the model unfolding its responses throughout the dialogue. Happy coding!

If you’re interested in deepening your understanding of programming concepts in Julia, R and Python, consider clapping this post, following me here and even subscribing to my YouTube channel. I regularly share content that could be a valuable resource for your learning journey. Essentially, I channel my passion for these languages into sharing insights from my own learning and usage journey.