Jailbreaking risk for AI

Last updated: Feb 07, 2025

Multi-category

Inference risks

New to generative AI

Description

A jailbreaking attack attempts to break through the guardrails that are established in the model to perform restricted actions.

Why is jailbreaking a concern for foundation models?

Jailbreaking attacks can be used to alter model behavior and benefit the attacker. If not properly controlled, business entities can face fines, reputational harm, and other legal consequences.

Background image for risks associated with inference

Example

Bypassing LLM guardrails

A study cited by researchers at Carnegie Mellon University, The Center for AI Safety, and the Bosch Center for AI, claim to have discovered a simple prompt addendum that allowed the researchers to trick models into generating biased, false, and otherwise toxic information. The researchers showed that they might circumvent these guardrails in a more automated way. These attacks were shown to be effective in a wide range of open source products, including ChatGPT, Google Bard, Meta’s LLaMA, Anthropic’s Claude, and others.

Sources:

The New York Times, July 2023

Parent topic: AI risk atlas

We provide examples covered by the press to help explain many of the foundation models' risks. Many of these events covered by the press are either still evolving or have been resolved, and referencing them can help the reader understand the potential risks and work towards mitigations. Highlighting these examples are for illustrative purposes only.

Description

Why is jailbreaking a concern for foundation models?

Related Risks