Protecting LLM applications with Azure AI Content Safety

New tools for filtering malicious prompts, detecting ungrounded outputs, and evaluating the safety of models will make generative AI safer to use.

shutterstock 77002051 Danger hard hat area safety warning sign chain link fence construction site
Olivier Le Queinec / Shutterstock

Both extremely promising and extremely risky, generative AI has distinct failure modes that we need to defend against to protect our users and our code. We’ve all seen the news, where chatbots are encouraged to be insulting or racist, or large language models (LLMs) are exploited for malicious purposes, and where outputs are at best fanciful and at worst dangerous.

None of this is particularly surprising. It’s possible to craft complex prompts that force undesired outputs, pushing the input window past the guidelines and guardrails we’re using. At the same time, we can see outputs that go beyond the data in the foundation model, generating text that’s no longer grounded in reality, producing plausible, semantically correct nonsense.

While we can use techniques like retrieval-augmented generation (RAG) and tools like Semantic Kernel and LangChain to keep our applications grounded in our data, there are still prompt attacks that can produce bad outputs and cause reputational risks. What’s needed is a way to test our AI applications in advance to, if not ensure their safety, at least mitigate the risk of these attacks—as well as making sure that our own prompts don’t force bias or allow inappropriate queries.

Introducing Azure AI Content Safety

Microsoft has long been aware of these risks. You don’t have a PR disaster like the Tay chatbot without learning lessons. As a result the company has been investing heavily in a cross-organizational responsible AI program. Part of that team, Azure AI Responsible AI, has been focused on protecting applications built using Azure AI Studio, and has been developing a set of tools that are bundled as Azure AI Content Safety.

Dealing with prompt injection attacks is increasingly important, as a malicious prompt not only could deliver unsavory content, but could be used to extract the data used to ground a model, delivering proprietary information in an easy to exfiltrate format. While it’s obviously important to ensure RAG data doesn’t contain personally identifiable information or commercially sensitive data, private API connections to line-of-business systems are ripe for manipulation by bad actors.

We need a set of tools that allow us to test AI applications before they’re delivered to users, and that allow us to apply advanced filters to inputs to reduce the risk of prompt injection, blocking known attack types before they can be used on our models. While you could build your own filters, logging all inputs and outputs and using them to build a set of detectors, your application may not have the necessary scale to trap all attacks before they’re used on you.

There aren’t many bigger AI platforms than Microsoft’s ever-growing family of models, and its Azure AI Studio development environment. With Microsoft’s own Copilot services building on its investment in OpenAI, it’s able to track prompts and outputs across a wide range of different scenarios, with various levels of grounding and with many different data sources. That allows Microsoft’s AI safety team to understand quickly what types of prompt cause problems and to fine-tune their service guardrails accordingly.

Using Prompt Shields to control AI inputs

Prompt Shields are a set of real-time input filters that sit in front of a large language model. You construct prompts as normal, either directly or via RAG, and the Prompt Shield analyses them and blocks malicious prompts before they are submitted to your LLM.

Currently there are two kinds of Prompt Shields. Prompt Shields for User Prompts is designed to protect your application from user prompts that redirect the model away from your grounding data and towards inappropriate outputs. These can clearly be a significant reputational risk, and by blocking prompts that elicit these outputs, your LLM application should remain focused on your specific use cases. While the attack surface for your LLM application may be small, Copilot’s is large. By enabling Prompt Shields you can leverage the scale of Microsoft’s security engineering.

Prompt Shields for Documents helps reduce the risk of compromise via indirect attacks. These use alternative data sources, for example poisoned documents or malicious websites, that hide additional prompt content from existing protections. Prompt Shields for Documents analyses the contents of these files and blocks those that match patterns associated with attacks. With attackers increasingly taking advantage of techniques like this, there’s a significant risk associated with them, as they’re hard to detect using conventional security tooling. It’s important to use protections like Prompt Shields with AI applications that, for example, summarize documents or automatically reply to emails.

Using Prompt Shields involves making an API call with the user prompt and any supporting documents. These are analyzed for vulnerabilities, with the response simply showing that an attack has been detected. You can then add code to your LLM orchestration to trap this response, then block that user’s access, check the prompt they’ve used, and develop additional filters to keep those attacks from being used in the future.

Checking for ungrounded outputs

Along with these prompt defenses, Azure AI Content Safety includes tools to help detect when a model becomes ungrounded, generating random (if plausible) outputs. This feature works only with applications that use grounding data sources, for example a RAG application or a document summarizer.

The Groundedness Detection tool is itself a language model, one that’s used to provide a feedback loop for LLM output. It compares the output of the LLM with the data that’s used to ground it, evaluating it to see if it is based on the source data, and if not, generating an error. This process, Natural Language Inference, is still in its early days, and the underlying model is intended to be updated as Microsoft’s responsible AI teams continue to develop ways to keep AI models from losing context.

Keeping users safe with warnings

One important aspect of the Azure AI Content Safety services is informing users when they’re doing something unsafe with an LLM. Perhaps they’ve been socially engineered to deliver a prompt that exfiltrates data: “Try this, it’ll do something really cool!” Or maybe they’ve simply made an error. Providing guidance for writing safe prompts for a LLM is as much a part of securing a service as providing shields for your prompts.

Microsoft is adding system message templates to Azure AI Studio that can be used in conjunction with Prompt Shields and with other AI security tools. These are shown automatically in the Azure AI Studio development playground, allowing you to understand what systems messages are displayed when, helping you create your own custom messages that fit your application design and content strategy.

Testing and monitoring your models

Azure AI Studio remains the best place to build applications that work with Azure-hosted LLMs, whether they’re from the Azure OpenAI service or imported from Hugging Face. The studio includes automated evaluations for your applications, which now include ways of assessing the safety of your application, using prebuilt attacks to test how your model responds to jailbreaks and indirect attacks, and whether it might output harmful content. You can use your own prompts or Microsoft’s adversarial prompt templates as the basis of your test inputs.

Once you have an AI application up and running, you will need to monitor it to ensure that new adversarial prompts don’t succeed in jailbreaking it. Azure OpenAI now includes risk monitoring, tied to the various filters used by the service, including Prompt Shields. You can see the types of attacks used, both inputs and outputs, as well as the volume of the attacks. There’s the option of understanding which users are using your application maliciously, allowing you to identify the patterns behind attacks and to tune block lists appropriately.

Ensuring that malicious users can’t jailbreak a LLM is only one part of delivering trustworthy, responsible AI applications. Output is as important as input. By checking output data against source documents, we can add a feedback loop that lets us refine prompts to avoid losing groundedness. All we need to remember is that these tools will need to evolve alongside our AI services, getting better and stronger as generative AI models improve.

Copyright © 2024 IDG Communications, Inc.