{"id":6771,"date":"2025-01-02T08:00:00","date_gmt":"2025-01-02T14:00:00","guid":{"rendered":"https:\/\/www.darkreading.com\/cyberattacks-data-breaches\/bad-likert-judge-jailbreak-bypasses-guardrails-openai-other-llms"},"modified":"2025-01-02T08:00:00","modified_gmt":"2025-01-02T14:00:00","slug":"bad-likert-judge-jailbreak-bypasses-guardrails-of-openai-other-top-llms","status":"publish","type":"post","link":"https:\/\/ddi.mohflo.net\/index.php\/2025\/01\/02\/bad-likert-judge-jailbreak-bypasses-guardrails-of-openai-other-top-llms\/","title":{"rendered":"&#8216;Bad Likert Judge&#8217; Jailbreak Bypasses Guardrails of OpenAI, Other Top LLMs"},"content":{"rendered":"<div class=\"media_block\"><a href=\"https:\/\/i0.wp.com\/eu-images.contentstack.com\/v3\/assets\/blt6d90778a997de1cd\/blt49da23b9d8c07245\/677682a742e93841f89d2f10\/LLM%281800%29_Krot_Studio_Alamy.jpg?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/01\/bad-likert-judge-jailbreak-bypasses-guardrails-of-openai-other-top-llms.png?w=640&#038;ssl=1\" class=\"media_thumbnail\"><\/a><\/div>\n<div><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/01\/bad-likert-judge-jailbreak-bypasses-guardrails-of-openai-other-top-llms.png?w=640&#038;ssl=1\" class=\"ff-og-image-inserted\"><\/div>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">A new jailbreak technique for OpenAI and other <\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"><a class=\"ContentText-BodyTextChunk ContentText-BodyTextChunk_link\" target=\"_blank\" href=\"https:\/\/www.darkreading.com\/vulnerabilities-threats\/llms-are-new-type-insider-adversary\">large language models (LLMs) <\/a><\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">increases the chance that attackers can circumvent cybersecurity guardrails and abuse the system to deliver malicious content.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">Discovered by researchers at Palo Alto Networks&#8217; Unit 42, the so-called Bad Likert Judge attack asks the LLM to act as a judge scoring the harmfulness of a given response using the Likert scale. The psychometric scale, named after its inventor and commonly used in questionnaires, is a rating scale measuring a respondent&#8217;s agreement or disagreement with a statement.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">The jailbreak then asks the LLM to generate responses that contain examples that align with the scales, with the ultimate result being that &#8220;the example that has the highest Likert scale can potentially contain the harmful content,&#8221; Unit 42&#8217;s Yongzhe Huang, Yang Ji, Wenjun Hu, Jay Chen, Akshata Rao, and Danny Tsechansky wrote in a post describing their findings.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">Tests conducted across a range of categories against six state-of-the-art text-generation LLMs from OpenAI, Azure, Google, Amazon Web Services, Meta, and Nvidia revealed that the technique can increase the attack success rate (ASR) by more than 60% compared with plain attack prompts on average, according to the researchers.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">The categories of attacks evaluated in the research involved prompting various inappropriate responses from the system, including: ones promoting bigotry, hate, or prejudice; ones engaging in behavior that harasses an individual or group; ones that encourage suicide or other acts of self-harm; ones that generate inappropriate explicitly sexual material and pornography; ones providing info on how to manufacture, acquire, or use illegal weapons; or ones that promote illegal activities.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">Other categories explored and for which the jailbreak increases the likelihood of attack success include: malware generation or the creation and distribution of malicious software; and system prompt leakage, which could reveal the confidential set of instructions used to guide the LLM.<\/span><\/p>\n<h2 class=\"ContentText ContentText_variant_h2 ContentText_align_left\" data-testid=\"content-text\" id=\"How Bad Likert Judge Works\">How Bad Likert Judge Works<\/h2>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">The first step in the Bad Likert Judge attack involves asking the target LLM to act as a judge to evaluate responses generated by other LLMs, the researchers explained.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">&#8220;To confirm that the LLM can produce harmful content, we provide specific guidelines for the scoring task,&#8221; <\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"><a class=\"ContentText-BodyTextChunk ContentText-BodyTextChunk_link\" target=\"_blank\" href=\"https:\/\/unit42.paloaltonetworks.com\/multi-turn-technique-jailbreaks-llms\/\">they wrote<\/a><\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">. &#8220;For example, one could provide guidelines asking the LLM to evaluate content that may contain information on generating malware.&#8221;<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">Once the first step is properly completed, the LLM should understand the task and the different scales of harmful content, which makes the second step &#8220;straightforward,&#8221; they said. &#8220;Simply ask the LLM to provide different responses corresponding to the various scales,&#8221; the researchers wrote.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">&#8220;After completing step two, the LLM typically generates content that is considered harmful,&#8221; they wrote, adding that in some cases, &#8220;the generated content may not be sufficient to reach the intended harmfulness score for the experiment.&#8221;<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">To address the latter issue, an attacker can ask the LLM to refine the response with the highest score by extending it or adding more details. &#8220;Based on our observations, an additional one or two rounds of follow-up prompts requesting refinement often lead the LLM to produce content containing more harmful information,&#8221; the researchers wrote.<\/span><\/p>\n<h2 class=\"ContentText ContentText_variant_h2 ContentText_align_left\" data-testid=\"content-text\" id=\"Rise of LLM Jailbreaks\">Rise of LLM Jailbreaks<\/h2>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">The <\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"><a class=\"ContentText-BodyTextChunk ContentText-BodyTextChunk_link\" target=\"_blank\" href=\"https:\/\/www.darkreading.com\/vulnerabilities-threats\/llms-raise-efficiency-productivity-of-cybersecurity-teams\">exploding use of LLMs<\/a><\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"> for personal, research, and business purposes has led researchers to test their susceptibility to generate harmful and biased content when prompted in specific ways. Jailbreaks are the term for methods that allow researchers to bypass guardrails put in place by LLM creators to avoid the generation of bad content.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">Security researchers have already identified several types of jailbreaks, according to Unit 42. They include one called <\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"><a class=\"ContentText-BodyTextChunk ContentText-BodyTextChunk_link\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/2401.06373\">persona persuasion<\/a><\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">; a role-playing jailbreak dubbed <\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"><a class=\"ContentText-BodyTextChunk ContentText-BodyTextChunk_link\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2308.03825\">Do Anything Now<\/a><\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">; and token smuggling, which uses encoded words in an attacker&#8217;s input.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">Researchers at Robust Intelligence and Yale University also recently discovered a jailbreak called <\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"><a class=\"ContentText-BodyTextChunk ContentText-BodyTextChunk_link\" target=\"_blank\" href=\"https:\/\/www.darkreading.com\/cyber-risk\/researchers-show-how-to-use-one-llm-to-jailbreak-another\">Tree of Attacks with Pruning (TAP)<\/a><\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">, which involves using an unaligned LLM to &#8220;jailbreak&#8221; another aligned LLM, or to get it to breach its guardrails, quickly and with a high success rate.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">Unit 42 researchers stressed that their jailbreak technique &#8220;targets edge cases and does not necessarily reflect typical LLM use cases.&#8221; This means that &#8220;most AI models are safe and secure when operated responsibly and with caution,&#8221; they wrote.<\/span><\/p>\n<h2 class=\"ContentText ContentText_variant_h2 ContentText_align_left\" data-testid=\"content-text\" id=\"How to Mitigate LLM Jailbreaks\">How to Mitigate LLM Jailbreaks<\/h2>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">However, no LLM matter is completely secure from jailbreaks, the researchers cautioned. The reason that they can undermine the security that OpenAI, Microsoft, Google, and others <\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"><a class=\"ContentText-BodyTextChunk ContentText-BodyTextChunk_link\" target=\"_blank\" href=\"https:\/\/www.darkreading.com\/cybersecurity-operations\/openai-forms-another-safety-committee-after-dismantling-prior-team\">are building into their LLMs<\/a><\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"> is mainly due to the computational limits of language models, they said.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">&#8220;Some prompts require the model to perform computationally intensive tasks, such as generating long-form content or engaging in complex reasoning,&#8221; they wrote. &#8220;These tasks can strain the model&#8217;s resources, potentially causing it to overlook or bypass certain safety guardrails.&#8221;<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">Attackers also can manipulate the model&#8217;s understanding of the conversation&#8217;s context by &#8220;strategically crafting a series of prompts&#8221; that &#8220;gradually steer it toward generating unsafe or inappropriate responses that the model&#8217;s safety guardrails would otherwise prevent,&#8221; they wrote.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">To mitigate the <\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"><a class=\"ContentText-BodyTextChunk ContentText-BodyTextChunk_link\" target=\"_blank\" href=\"https:\/\/www.darkreading.com\/application-security\/researchers-turn-code-completion-llms-into-attack-tools\">risks from jailbreaks<\/a><\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">, the researchers recommend applying content-filtering systems alongside LLMs for jailbreak mitigation. These systems run classification models on both the prompt and the output of the models to detect potentially harmful content.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">&#8220;The results show that content filters can reduce the ASR by an average of 89.2 percentage points across all tested models,&#8221; the researchers wrote. &#8220;This indicates the critical role of implementing comprehensive content filtering as a best practice when deploying LLMs in real-world applications.&#8221;<\/span><\/p>\n<p><a href=\"https:\/\/www.darkreading.com\/cyberattacks-data-breaches\/bad-likert-judge-jailbreak-bypasses-guardrails-openai-other-llms\">Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A new jailbreak technique for OpenAI and other large language<\/p>\n","protected":false},"author":12,"featured_media":6772,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1],"tags":[809],"class_list":["post-6771","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized","tag-dark-reading"],"featured_image_urls":{"full":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/01\/bad-likert-judge-jailbreak-bypasses-guardrails-of-openai-other-top-llms.png?fit=1920%2C1081&ssl=1",1920,1081,false],"thumbnail":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/01\/bad-likert-judge-jailbreak-bypasses-guardrails-of-openai-other-top-llms.png?resize=150%2C150&ssl=1",150,150,true],"medium":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/01\/bad-likert-judge-jailbreak-bypasses-guardrails-of-openai-other-top-llms.png?fit=300%2C169&ssl=1",300,169,true],"medium_large":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/01\/bad-likert-judge-jailbreak-bypasses-guardrails-of-openai-other-top-llms.png?fit=640%2C360&ssl=1",640,360,true],"large":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/01\/bad-likert-judge-jailbreak-bypasses-guardrails-of-openai-other-top-llms.png?fit=640%2C361&ssl=1",640,361,true],"1536x1536":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/01\/bad-likert-judge-jailbreak-bypasses-guardrails-of-openai-other-top-llms.png?fit=1536%2C865&ssl=1",1536,865,true],"2048x2048":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/01\/bad-likert-judge-jailbreak-bypasses-guardrails-of-openai-other-top-llms.png?fit=1920%2C1081&ssl=1",1920,1081,true],"chromenews-featured":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/01\/bad-likert-judge-jailbreak-bypasses-guardrails-of-openai-other-top-llms.png?fit=1024%2C577&ssl=1",1024,577,true],"chromenews-large":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/01\/bad-likert-judge-jailbreak-bypasses-guardrails-of-openai-other-top-llms.png?resize=825%2C575&ssl=1",825,575,true],"chromenews-medium":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/01\/bad-likert-judge-jailbreak-bypasses-guardrails-of-openai-other-top-llms.png?resize=590%2C410&ssl=1",590,410,true]},"author_info":{"display_name":"Dark Reading","author_link":"https:\/\/ddi.mohflo.net\/index.php\/author\/darkreading\/"},"category_info":"<a href=\"https:\/\/ddi.mohflo.net\/index.php\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","tag_info":"Uncategorized","comment_count":"0","jetpack_featured_media_url":"https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/01\/bad-likert-judge-jailbreak-bypasses-guardrails-of-openai-other-top-llms.png?fit=1920%2C1081&ssl=1","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/posts\/6771","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/users\/12"}],"replies":[{"embeddable":true,"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/comments?post=6771"}],"version-history":[{"count":0,"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/posts\/6771\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/media\/6772"}],"wp:attachment":[{"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/media?parent=6771"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/categories?post=6771"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/tags?post=6771"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}