Researchers have recently uncovered a groundbreaking AI jailbreak technique known as Bad Likert Judge, which significantly enhances the success rate of bypassing large language models' (LLMs) safety mechanisms. This innovative multi-turn (many-shot) attack strategy has been developed by Palo Alto Networks’ Unit 42 team, led by Yongzhe Huang, Yang Ji, Wenjun Hu, Jay Chen, Akshata Rao, and Danny Tsechansky.
What is the 'Bad Likert Judge' Attack?
The Bad Likert Judge technique exploits the LLM’s ability to evaluate the harmfulness of generated content using the Likert scale—a psychometric scale designed to measure agreement or disagreement. In this attack, the LLM is asked to assess the harmfulness of responses based on Likert scale ratings, and then generate content that aligns with various levels of harmfulness. The response with the highest rating is more likely to contain malicious or harmful content.
Boosting Attack Success by Over 60%
This new jailbreak method, tested across a range of state-of-the-art LLMs from Amazon Web Services, Google, Meta, Microsoft, OpenAI, and NVIDIA, showed a remarkable 60% increase in attack success rates (ASR) compared to traditional prompt-based attacks. The attack targets several sensitive categories, including:
- Hate speech
- Harassment
- Self-harm
- Sexual content
- Weapons
- Illegal activities
- Malware generation
- System prompt leakage
By leveraging the LLM's inherent ability to evaluate and judge the harmfulness of responses, the Bad Likert Judge technique can effectively bypass built-in safety guardrails designed to block harmful outputs.
The Importance of Content Filtering
The research highlights the critical role of content filtering in defending against such attacks. The tests showed that content filters could reduce the ASR by an average of 89.2% across all tested models, underlining the importance of comprehensive filtering when deploying LLMs in practical applications.
AI Prompt Injection and Jailbreaking
The rise in popularity of AI models has also led to the emergence of prompt injection—a category of security exploits that aim to subvert an AI's intended behavior through specially crafted instructions (prompts). One subtype, many-shot jailbreaking, takes advantage of an LLM's extended context window to craft a sequence of prompts that gradually nudge the model to generate malicious responses without triggering its internal defenses. Notable examples of this approach include Crescendo and Deceptive Delight.
Real-World Implications and Misleading AI Summaries
This new discovery comes just days after a report from The Guardian revealed that OpenAI's ChatGPT search tool could be manipulated to produce misleading summaries by summarizing web pages containing hidden content. In one instance, the inclusion of fake positive reviews influenced ChatGPT to return a biased, overly favorable assessment of a product, despite its negative reviews.
These vulnerabilities underscore the importance of robust AI security and content filtering mechanisms to protect against manipulation and harmful exploits.
Conclusion
As AI continues to evolve, so do the techniques used to exploit and manipulate its outputs. The Bad Likert Judge jailbreak method is a powerful reminder of the ongoing need for vigilance in AI safety. As AI systems become more integrated into critical real-world applications, ensuring their security and reliability becomes an absolute priority.