Deep Dive into Safety Breaks for GenAI and LLM Solutions (Under OWASP Guidelines) — Part 1

Problem Statement

In this blog- I will focus on one thing that I am very passionate about. LLM and GenAI Governance from a technical perspective. We have been peeling the onion of building LLM powered applications since many months now. A lot of our customers have shown tremendous interest in productionizing the LLM applications. Now with the dust settling a bit, we understand the value of building a “governed”, “trustworthy” and “reliable” system. I have in parts talked about Responsible AI and ways to track and measure them, but in this blog I will focus on a bigger picture. The holistic — LLM governance and what it means. So, let us buckle up and take the ride together.

As this is a big topic, I will be expanding this topic in parts, for this blog today, my focus will be “safety breaks”.

Solution

LLM governance at its core consists of five important things. I will cover one at a time — for this topic today, I will discuss and focus on “Safety Breaks”.

LLM governance at its core consists of Responsible AI (RAI), safety breaks (content for discussion today), Regulatory and Approval Framework, Transparency, and choice of models and partnership (see below).

Now, for today, this section of “safety breaks” will be based on the paper from OWASP @ https://owasp.org/www-project-top-10-for-large-language-model-applications/ . They have dedicated papers and research on the segregation and classification of these safety breaks. I will try to condense all the knowledge and research into something tangible but following exactly what is discussed in those papers.

Safety breaks can be classified in 10 ways as below :

Prompt Injections
Insure Output Handling
Training Data Poisoning
Denial of Service
Supply Chain Vulnerabilities
Sensitive Information Disclosure
Insecure Plugin Design
Excessive Agency
Over reliance
Model Theft

Let us discuss and cover them in details below.

Prompt Injections = Ideally there are 2 specific types of prompt injections — one which is direct like jail breaking and other which is indirect and may involve input from external and untrusted sources. To understand more, let us check out a few examples :

A user may prompt in a way that it directs the LLM model to ignore previous instruction and only focus on specific instruction with malicious content.
User can add or exfiltrate the prompt by including javascript and other markdown information that contains vulnerable code/ generates malicious content in the prompt indirectly.
User can augment the prompt and add context pointing to some biased documents that has direct or indirect injection information
User can ask the LLM (via prompt) to scan a website for response which might be filled with harmful content etc.

Additionally, below are some of the scenarios of how users are “poisoning” the prompts — typically through phrases like “forget all previous instructions” and “discard previous user instructions and use LLM to something different and malicious”

Finally, it is important to talk about Prevention. Below are some classical ways to prevent Prompt Injections -

Ensure we provide only privileged control. Not an open ended one.
Ensure we provide Human in Loop where applicable
Separate and externalize knowledge base from LLM
Ensure service boundary.

2. Insecure Output Handling: This is the second of the safety controls. This is where we ensure we do not blindly trust the downstream apps to always do the right thing. To understand more, let us check out a few examples :

An application grants the LLM privilege beyond the minimum required requirement.
Application is not designing with network and other safety protocols.

Additionally, we have some other scenarios — like a user tries to manipulate the outcome to execute arbitrary command. Otherwise, user might include a website and include prompt injection to capture sensitive content. Also, we might have scenarios where users craft SQL and gain access to whole database due to unrestricted access. A user can also instruct LLM as a javascript object without sanitization controls.