{"id":7188,"date":"2025-02-03T16:13:26","date_gmt":"2025-02-03T22:13:26","guid":{"rendered":"https:\/\/www.darkreading.com\/application-security\/constitutional-classifiers-mitigate-genai-jailbreaks"},"modified":"2025-02-03T16:13:26","modified_gmt":"2025-02-03T22:13:26","slug":"constitutional-classifiers-technique-mitigates-genai-jailbreaks","status":"publish","type":"post","link":"https:\/\/ddi.mohflo.net\/index.php\/2025\/02\/03\/constitutional-classifiers-technique-mitigates-genai-jailbreaks\/","title":{"rendered":"&#8216;Constitutional Classifiers&#8217; Technique Mitigates GenAI Jailbreaks"},"content":{"rendered":"<div class=\"media_block\"><a href=\"https:\/\/i0.wp.com\/eu-images.contentstack.com\/v3\/assets\/blt6d90778a997de1cd\/blt98a3dc89e95749aa\/67a132ac8b852f7a346fc76a\/claude_Tada_Images_shutterstock.pg.jpg?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/constitutional-classifiers-technique-mitigates-genai-jailbreaks.jpg?w=640&#038;ssl=1\" class=\"media_thumbnail\"><\/a><\/div>\n<div><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/constitutional-classifiers-technique-mitigates-genai-jailbreaks.jpg?w=640&#038;ssl=1\" class=\"ff-og-image-inserted\"><\/div>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">Researchers at Anthropic, the company behind the Claude AI assistant, have developed an approach they believe provides a practical, scalable method to make it harder for malicious actors to jailbreak or bypass the built-in safety mechanisms of a range of large language models (LLMs).<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">The approach employs a set of natural language rules \u2014 or a &#8220;constitution&#8221; \u2014 to create categories of permitted and disallowed content in an AI model&#8217;s input and output, and then uses synthetic data to train the model to recognize and apply those content classifiers.<\/span><\/p>\n<h2 class=\"ContentText ContentText_variant_h2 ContentText_align_left\" data-testid=\"content-text\" id=\"&quot;Constitutional Classifiers&quot; Anti-Jailbreak Technique\">&#8220;Constitutional Classifiers&#8221; Anti-Jailbreak Technique<\/h2>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">In a <\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"><a class=\"ContentText-BodyTextChunk ContentText-BodyTextChunk_link\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/2501.18837\">technical paper<\/a><\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"> released this week, the Anthropic researchers said their so-called Constitutional Classifiers approach was as effective against universal jailbreaks, withstanding more than 3,000 hours of human red-teaming by some 183 white-hat hackers through the HackerOne bug bounty program.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">&#8220;These Constitutional Classifiers are input and output classifiers trained on synthetically generated data that filter the overwhelming majority of jailbreaks with minimal over-refusals and without incurring a large compute overhead,&#8221; the researchers said in an related <\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"><a class=\"ContentText-BodyTextChunk ContentText-BodyTextChunk_link\" target=\"_blank\" href=\"https:\/\/www.anthropic.com\/research\/constitutional-classifiers\">blog post<\/a><\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">. They have established a <\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"><a class=\"ContentText-BodyTextChunk ContentText-BodyTextChunk_link\" target=\"_blank\" href=\"https:\/\/claude.ai\/constitutional-classifiers\">demo website<\/a><\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"> where anyone with experience jailbreaking an LLM can try out their system for the next week (Feb. 3 to Feb. 10).<\/span><\/p>\n<p data-component=\"related-article\" class=\"RelatedArticle\"><span data-testid=\"related-article-title\" class=\"RelatedArticle-Title\">Related:<\/span><a class=\"RelatedArticle-RelatedContent\" href=\"https:\/\/www.darkreading.com\/application-security\/ai-malware-deepseek-packages-pypi\" target=\"_self\" data-discover=\"true\">AI Malware Dressed Up as DeepSeek Packages Lurk in PyPi<\/a><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">In the context of generative AI (GenAI) models, a jailbreak is any prompt or set of prompts that causes the model to bypass its built-in content filters, safety mechanisms, and ethical constraints. They typically involve a researcher \u2014 or a bad actor \u2014 crafting specific input sequences, using linguistic tricks and even role-playing scenarios to trick an AI model into escaping its protective guardrails and spewing out potentially dangerous, malicious, and incorrect content.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">The most recent example involves researchers at Wallarm <\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"><a class=\"ContentText-BodyTextChunk ContentText-BodyTextChunk_link\" target=\"_self\" href=\"https:\/\/www.darkreading.com\/application-security\/deepseek-jailbreak-system-prompt\">extracting secrets from DeepSeek<\/a><\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">, the Chinese generative AI tool that recently upended long held notions of just how much compute power is required to power an LLM. Since ChatGPT exploded on the scene in November 2022, there have been multiple other examples including one where researchers used <\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"><a class=\"ContentText-BodyTextChunk ContentText-BodyTextChunk_link\" target=\"_self\" href=\"https:\/\/www.darkreading.com\/cyber-risk\/researchers-show-how-to-use-one-llm-to-jailbreak-another\">one LLM to jailbreak a second<\/a><\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">, another involving <\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"><a class=\"ContentText-BodyTextChunk ContentText-BodyTextChunk_link\" target=\"_self\" href=\"https:\/\/www.darkreading.com\/cyber-risk\/researchers-simple-technique-extract-chatgpt-training-data\">the repetitive use of certain words<\/a><\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"> to get an LLM to spill its training data and another through <\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"><a class=\"ContentText-BodyTextChunk ContentText-BodyTextChunk_link\" target=\"_self\" href=\"https:\/\/www.darkreading.com\/vulnerabilities-threats\/llms-open-manipulation-using-doctored-images-audio\">doctored images and audio<\/a><\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">.<\/span><\/p>\n<h2 class=\"ContentText ContentText_variant_h2 ContentText_align_left\" data-testid=\"content-text\" id=\"Balancing Effectiveness With Efficiency\">Balancing Effectiveness With Efficiency<\/h2>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">In developing the Constitutional Classifiers system, the researchers wanted to ensure a high rate of effectiveness against jailbreaking attempts without drastically impacting the ability for people to extract legitimate information from an AI model. One simplistic example was ensuring the model could distinguish between a prompt asking for a list of common medications or for explaining the properties of household chemicals versus a request on where to acquire a restricted chemical or purifying it. The researchers also wanted to ensure minimal additional computing overhead when using the classifiers.<\/span><\/p>\n<p data-component=\"related-article\" class=\"RelatedArticle\"><span data-testid=\"related-article-title\" class=\"RelatedArticle-Title\">Related:<\/span><a class=\"RelatedArticle-RelatedContent\" href=\"https:\/\/www.darkreading.com\/application-security\/deepseek-jailbreak-system-prompt\" target=\"_self\" data-discover=\"true\">DeepSeek Jailbreak Reveals Its Entire System Prompt<\/a><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">In tests, researchers had a jailbreak success rate of 86% on a version of Claude with no defensive classifiers, compared to 4.4% on one using a Constitutional Classifier. According to the researchers, using the classifier increased refusal rates by less than 1% and compute costs by nearly 24% compared to the unguarded model.<\/span><\/p>\n<h2 class=\"ContentText ContentText_variant_h2 ContentText_align_left\" data-testid=\"content-text\" id=\"LLM Jailbreaks: A Major Threat\">LLM Jailbreaks: A Major Threat<\/h2>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">Jailbreaks have emerged as a major consideration when it comes to making GenAI models with sophisticated scientific capabilities widely available. The concern is that it gives even an unskilled actor the opportunity to &#8220;uplift&#8221; their skills to expert-level capabilities. This can become an especially big problem when it comes to trying to jailbreak LLMs into divulging dangerous chemical, biological, radiological, or nuclear (CBRN) information, the Anthropic researchers noted.<\/span><\/p>\n<p data-component=\"related-article\" class=\"RelatedArticle\"><span data-testid=\"related-article-title\" class=\"RelatedArticle-Title\">Related:<\/span><a class=\"RelatedArticle-RelatedContent\" href=\"https:\/\/www.darkreading.com\/application-security\/code-scanning-tool-s-license-at-heart-of-security-breakup\" target=\"_self\" data-discover=\"true\">Code-Scanning Tool&#8217;s License at Heart of Security Breakup<\/a><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">Their work focused on how to augment an LLM with classifiers that monitor an AI model&#8217;s inputs and outputs and blocks potentially harmful content. Instead of using hard-coded static filtering, they wanted something that would have a more sophisticated understanding of a model&#8217;s guardrails and act as a real-time filter when generating responses or receiving inputs.&nbsp;&#8220;This simple approach is highly effective: in over 3,000 hours of human red teaming on a classifier-guarded system, we observed no successful universal jailbreaks in our target&#8230;domain,&#8221; the researchers wrote. The red-team tests involved the bug bounty hunters trying to obtain answers from Claude AI to a set of harmful questions involving CBRN risks, using thousands of known jailbreaking hacks.<\/span><\/p>\n<p><a href=\"https:\/\/www.darkreading.com\/application-security\/constitutional-classifiers-mitigate-genai-jailbreaks\">Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Researchers at Anthropic, the company behind the Claude AI assistant,<\/p>\n","protected":false},"author":12,"featured_media":7189,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1],"tags":[809],"class_list":["post-7188","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized","tag-dark-reading"],"featured_image_urls":{"full":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/constitutional-classifiers-technique-mitigates-genai-jailbreaks.jpg?fit=1920%2C1080&ssl=1",1920,1080,false],"thumbnail":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/constitutional-classifiers-technique-mitigates-genai-jailbreaks.jpg?resize=150%2C150&ssl=1",150,150,true],"medium":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/constitutional-classifiers-technique-mitigates-genai-jailbreaks.jpg?fit=300%2C169&ssl=1",300,169,true],"medium_large":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/constitutional-classifiers-technique-mitigates-genai-jailbreaks.jpg?fit=640%2C360&ssl=1",640,360,true],"large":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/constitutional-classifiers-technique-mitigates-genai-jailbreaks.jpg?fit=640%2C360&ssl=1",640,360,true],"1536x1536":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/constitutional-classifiers-technique-mitigates-genai-jailbreaks.jpg?fit=1536%2C864&ssl=1",1536,864,true],"2048x2048":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/constitutional-classifiers-technique-mitigates-genai-jailbreaks.jpg?fit=1920%2C1080&ssl=1",1920,1080,true],"chromenews-featured":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/constitutional-classifiers-technique-mitigates-genai-jailbreaks.jpg?fit=1024%2C576&ssl=1",1024,576,true],"chromenews-large":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/constitutional-classifiers-technique-mitigates-genai-jailbreaks.jpg?resize=825%2C575&ssl=1",825,575,true],"chromenews-medium":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/constitutional-classifiers-technique-mitigates-genai-jailbreaks.jpg?resize=590%2C410&ssl=1",590,410,true]},"author_info":{"display_name":"Dark Reading","author_link":"https:\/\/ddi.mohflo.net\/index.php\/author\/darkreading\/"},"category_info":"<a href=\"https:\/\/ddi.mohflo.net\/index.php\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","tag_info":"Uncategorized","comment_count":"0","jetpack_featured_media_url":"https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/constitutional-classifiers-technique-mitigates-genai-jailbreaks.jpg?fit=1920%2C1080&ssl=1","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/posts\/7188","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/users\/12"}],"replies":[{"embeddable":true,"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/comments?post=7188"}],"version-history":[{"count":0,"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/posts\/7188\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/media\/7189"}],"wp:attachment":[{"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/media?parent=7188"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/categories?post=7188"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/tags?post=7188"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}