Jailbreaking risk for AI
An attack that attempts to break through the guardrails established in the model is known as jailbreaking.
Why is jailbreaking a concern for foundation models?
Jailbreaking attacks can be used to alter model behavior and benefit the attacker. If not properly controlled, business entities can face fines, reputational harm, and other legal consequences.
Bypassing LLM guardrails
Cited in a study from researchers at Carnegie Mellon University, The Center for AI Safety, and the Bosch Center for AI, claims to have discovered a simple prompt addendum that allowed the researchers to trick models into answering dangerous or sensitive questions and is simple enough to be automated and used for a wide range of commercial and open-source products, including ChatGPT, Google Bard, Meta’s LLaMA, Vicuna, Claude, and others. According to the paper, the researchers were able to use the additions to reliably coax forbidden answers for Vicuna (99%), ChatGPT 3.5 and 4.0 (up to 84%), and PaLM-2 (66%).
Parent topic: AI risk atlas