Prompt Engineering for LLaMA 2 Models
An In-Depth Guide

Introduction
Prompt engineering for LLaMA 2 models is an evolving field that combines technical proficiency with creative application. This guide aims to provide a comprehensive understanding of how to effectively engage with these advanced AI models.
System Prompts: Setting the Scene
System prompts play a pivotal role in shaping the responses of LLaMA 2 models and guiding them through conversations. These prompts provide a context or persona for the model to follow, facilitating a more consistent theme or style in its responses. Whether you aim to maintain a specific character, steer the conversation in a particular direction, or prompt the model for professional responses, system prompts are an essential tool in achieving these goals.
Maintaining Context and Persona
System prompts excel in maintaining a consistent context or persona for the model. By setting a system prompt like “You are a pirate,” the model interprets and maintains this persona throughout the conversation without the need for constant reminders. This feature is particularly valuable when you want the model to adhere to a specific role or character, ensuring a coherent and engaging interaction.
Professional Responses
System prompts also enable you to direct the model towards professional responses. For instance, if you need technical information or want the model to provide expert-level insights, you can set an appropriate system prompt. This guides the model to generate responses that align with the desired level of professionalism and expertise.
Conciseness Matters
It’s important to keep system prompts concise, as they contribute to the model’s context window. The context window refers to the maximum amount of information the model can retain and consider during a conversation. By keeping prompts brief and to the point, you maximize the available context for a more coherent dialogue.
Examples for Clarity
Providing examples alongside system prompts is highly beneficial. An example clarifies the desired outcome and helps the LLaMA 2 model understand your expectations more clearly. This is especially crucial when you aim to obtain structured text or JSON responses from the model.
System Prompt Example
In practice, using system prompts involves specifying a prompt that defines the context or role for the model. For example:
<s>[INST]<<SYS>>
You are a company name idea engine. Provided a company description,
respond with a company name idea.<</SYS>>In this example, the system prompt establishes the context for the model to act as a company name idea generator. Users can then engage in a conversation with the model, including their queries within <s>[INST]{user_message}[/INST] tags to receive responses in line with the defined context.
A full example, to generate ideas for the company description “A Boston burger company” would look like:
<s>[INST]<<SYS>>
You are a company name idea engine. Provided a company description,
respond with a company name idea.<</SYS>>
A Boston burger company [/INST]Ghost Attention (GAtt): Enhancing Memory Retention
Ghost Attention (GAtt) is a crucial mechanism designed to improve memory retention and ensure that LLaMA 2 models maintain their focus on the initial instructions throughout a conversation that spans multiple dialogue turns. This mechanism serves the purpose of preventing the model from forgetting or losing track of the initial context or instruction provided by the user.
The fundamental concept behind Ghost Attention is to synthetically concatenate the original instruction with all subsequent user messages. In simpler terms, it continuously reminds the model of the user’s original input, making it an integral part of every turn in the conversation. By doing this, Ghost Attention effectively reinforces the context and ensures that the model’s responses remain closely tied to the initial instructions.
This mechanism is especially valuable in situations where conversations become complex, lengthy, or involve various topics. It helps the model to stay on track, understand the user’s intent, and provide relevant and coherent responses. Ghost Attention is a significant advancement in enhancing the overall conversational experience by addressing the challenge of maintaining context and memory in extended dialogues.
Chat Prompts: Structuring Dialogues
In chat applications, structuring user inputs is vital for clarity. Using [INST] [/INST] tags to demarcate user input helps the model differentiate between its responses and the user's messages. Model replies remain untagged, maintaining a clear conversation flow.
Example
In the example below, you can see a system prompt along with an example input and output message to show the LLM more clearly what we are looking for. Providing examples is especially important when trying to get structured text or JSON back from an LLM.
In the example below, you would replace {user_message} with your query:
<s>[INST]<<SYS>>
You are a company name idea engine. Provided a company description,
respond with a company name idea.<</SYS>>
Boston Hot Dog Stand[/INST] Bean Town Buns 'n Dogs</s>
<s>[INST]{user_message}[/INST]The <s>[INST]{user_message}[/INST] part starts a new exchange with the LLM.
Context Windows: Understanding Limitations
LLaMA 2 models have a context window of 4096 tokens, equivalent to about 3000 words. This acts as the model’s short-term memory. Exceeding this limit results in the loss of earlier parts of the conversation.
In a lengthy conversation, older messages may be lost as new inputs are added. It’s crucial to truncate the conversation or focus on key points to stay within the context window.
Weight Variants: Tailoring the Model to the Task
LLaMA 2 comes in different sizes (7B, 13B, 70B), each suited for different tasks. The 7B model is fast but basic, suitable for simple tasks like summarization. The 13B model offers a balance between speed and comprehension, ideal for creative endeavors. The 70B model, being the most informed, is perfect for in-depth tasks.
For crafting a complex story, the 13B model would be more suitable than the 7B model due to its better understanding of nuance.
LLaMA 2 Chat vs. Base Variants
Meta released two LLaMA 2 weight sets: chat and base. The chat model is fine-tuned for dialogue, while the base model has its own merits, such as being open-source, allowing full control over the model’s weights and code. This also means your data isn’t transmitted or retained on external servers, and you can use LLaMA 2 offline.
Useful Tips for Effective Prompting
- Temperature Adjustment: Varying the temperature setting adjusts the randomness of responses. A higher temperature leads to more creative outputs, while a lower setting results in more direct and specific answers.
- Tool Information: Informing LLaMA about potential tools can lead to surprising results. LLaMA 2 demonstrates capabilities in areas where even models like ChatGPT may falter.
- Prompt Structure and Truncation: Use
[INST] [/INST]tags to structure chat prompts and truncate prompts that exceed the context window. - System Prompts: Use system prompts to direct LLaMA in response to specific tasks or themes.
- Choosing the Right Model: For factual questions, the 70B variant of LLaMA 2 can be more effective than models like GPT 3.5 due to its open-source nature and flexibility.
Conclusion
Prompt engineering for LLaMA 2 models is a skill that combines understanding the model’s technical capabilities with creative input structuring. By mastering these techniques, you can effectively leverage LLaMA 2 models for a wide range of applications, from technical support to creative writing.
Stackademic
Thank you for reading until the end. Before you go:
- Please consider clapping and following the writer! 👏
- Follow us on Twitter(X), LinkedIn, and YouTube.
- Visit Stackademic.com to find out more about how we are democratizing free programming education around the world.






