0 / 0
Jailbreaking risk for AI

Jailbreaking risk for AI

Risks associated with input
Inference
Multi-category
Amplified

Description

An attack that attempts to break through the guardrails established in the model is known as jailbreaking.

Why is jailbreaking a concern for foundation models?

Jailbreaking attacks can be used to alter model behavior and benefit the attacker. If not properly controlled, business entities can face fines, reputational harm, and other legal consequences.

Background image for risks associated with input
Example

Bypassing LLM guardrails

Cited in a study from researchers at Carnegie Mellon University, The Center for AI Safety, and the Bosch Center for AI, claims to have discovered a simple prompt addendum that allowed the researchers to trick models into answering dangerous or sensitive questions and is simple enough to be automated and used for a wide range of commercial and open-source products, including ChatGPT, Google Bard, Meta’s LLaMA, Vicuna, Claude, and others. According to the paper, the researchers were able to use the additions to reliably coax forbidden answers for Vicuna (99%), ChatGPT 3.5 and 4.0 (up to 84%), and PaLM-2 (66%).

Parent topic: AI risk atlas

Generative AI search and answer
These answers are generated by a large language model in watsonx.ai based on content from the product documentation. Learn more