{"id":7232,"date":"2025-02-06T15:20:50","date_gmt":"2025-02-06T21:20:50","guid":{"rendered":"https:\/\/www.darkreading.com\/application-security\/researcher-jailbreaks-openai-o3-mini"},"modified":"2025-02-06T15:20:50","modified_gmt":"2025-02-06T21:20:50","slug":"researcher-outsmarts-jailbreaks-openais-new-o3-mini","status":"publish","type":"post","link":"https:\/\/ddi.mohflo.net\/index.php\/2025\/02\/06\/researcher-outsmarts-jailbreaks-openais-new-o3-mini\/","title":{"rendered":"Researcher Outsmarts, Jailbreaks OpenAI&#8217;s New o3-mini"},"content":{"rendered":"<div class=\"media_block\"><a href=\"https:\/\/i0.wp.com\/eu-images.contentstack.com\/v3\/assets\/blt6d90778a997de1cd\/blt94525a83023db91d\/67a4fb53db283bec3c39198a\/OpenAI-SOPA_Images_Limited-Alamy.jpg?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/researcher-outsmarts-jailbreaks-openais-new-o3-mini.jpg?w=640&#038;ssl=1\" class=\"media_thumbnail\"><\/a><\/div>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">A prompt engineer has challenged the ethical and safety protections in OpenAI&#8217;s latest o3-mini model, just days after its release to the public.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">OpenAI unveiled o3 and its lightweight counterpart, o3-mini, on Dec. 20. That same day, it also introduced a brand new security feature: &#8220;<\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"><a class=\"ContentText-BodyTextChunk ContentText-BodyTextChunk_link\" target=\"_blank\" href=\"https:\/\/openai.com\/index\/deliberative-alignment\/\">deliberative alignment<\/a><\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">.&#8221; Deliberative alignment &#8220;achieves highly precise adherence to OpenAI&#8217;s safety policies,&#8221; the company said, overcoming the ways in which its models were previously vulnerable to jailbreaks.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">Less than a week after its public debut, however, CyberArk principal vulnerability researcher Eran Shimony <\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"><a class=\"ContentText-BodyTextChunk ContentText-BodyTextChunk_link\" target=\"_blank\" href=\"https:\/\/www.linkedin.com\/posts\/eran-shimony_cybersecurity-aijailbreak-adversarialai-activity-7292857176161714177-_iC2\/?utm_source=share&amp;utm_medium=member_desktop\">got o3-mini to teach him how to write an exploit<\/a><\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"> of the Local Security Authority Subsystem Service (lsass.exe), a critical Windows security process.<\/span><\/p>\n<h2 class=\"ContentText ContentText_variant_h2 ContentText_align_left\" data-testid=\"content-text\" id=\"o3-mini's Improved Security\">o3-mini&#8217;s Improved Security<\/h2>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">In introducing deliberative alignment, OpenAI acknowledged the ways its previous large language models (LLMs) struggled with malicious prompts. &#8220;One cause of these failures is that models must respond instantly, without being given sufficient time to reason through complex and borderline safety scenarios. Another issue is that LLMs must infer desired behavior indirectly from large sets of labeled examples, rather than directly learning the underlying safety standards in natural language,&#8221; the company wrote.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">Deliberative alignment, it claimed, &#8220;overcomes both of these issues.&#8221; To solve issue number one, o3 was trained to stop and think, and reason out its responses step by step using an existing method called <\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"><a class=\"ContentText-BodyTextChunk ContentText-BodyTextChunk_link\" target=\"_blank\" href=\"https:\/\/openai.com\/index\/learning-to-reason-with-llms\/\">chain of thought (CoT)<\/a><\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">. To solve issue number two, it was taught the actual text of OpenAI&#8217;s safety guidelines, not just examples of good and bad behaviors.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">&#8220;When I saw this recently, I thought that [a jailbreak] is not going to work,&#8221; Shimony recalls. &#8220;I&#8217;m active on Reddit, and there people were not able to jailbreak it. But it is possible. Eventually it did work.&#8221;<\/span><\/p>\n<h2 class=\"ContentText ContentText_variant_h2 ContentText_align_left\" data-testid=\"content-text\" id=\"Manipulating the Newest ChatGPT\">Manipulating the Newest ChatGPT<\/h2>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">Shimony has vetted the security of every popular LLM using his company&#8217;s open source (OSS) fuzzing tool, &#8220;<\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"><a class=\"ContentText-BodyTextChunk ContentText-BodyTextChunk_link\" target=\"_blank\" href=\"https:\/\/github.com\/cyberark\/FuzzyAI\">FuzzyAI<\/a><\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">.&#8221; In the process, each one has revealed its own characteristic weaknesses.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">&#8220;OpenAI&#8217;s family of models is very susceptible to <\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"><a class=\"ContentText-BodyTextChunk ContentText-BodyTextChunk_link\" target=\"_self\" href=\"https:\/\/www.darkreading.com\/cloud-security\/chatgpt-exposes-instructions-knowledge-os-files\">manipulation types of attacks<\/a><\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">,&#8221; he explains, referring to regular old social engineering in natural language. &#8220;But Llama, made by Meta, is not, but it&#8217;s susceptible to other methods. For instance, we&#8217;ve used a method in which only the harmful component of your prompt is coded in an ASCII art.&#8221;<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">&#8220;That works quite well on Llama models, but it does not work on OpenAI&#8217;s, and it does not work on <\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"><a class=\"ContentText-BodyTextChunk ContentText-BodyTextChunk_link\" target=\"_self\" href=\"https:\/\/www.darkreading.com\/application-security\/faux-chatgpt-claude-api-packages-jarkastealer\">Claude<\/a><\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"> whatsoever. What works on Claude quite well at the moment is anything related to code. Claude is very good at coding, and it tries to be as helpful as possible, but it doesn&#8217;t really classify if code can be used for nefarious purposes, so it&#8217;s very easy to use it to generate any kind of malware that you want,&#8221; he claims.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">Shimony acknowledges that &#8220;o3 is a bit more robust in its guardrails, in comparison to GPT-4, because most of the classic attacks do not really work.&#8221; Still, he was able to exploit its long-held weakness by posing as an honest historian in search of educational information.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">In the exchange below, his aim is to get ChatGPT to generate malware. He phrases his prompt artfully, so as to conceal its true intention, then the deliberative alignment-powered ChatGPT reasons out its response:<\/span><\/p>\n<div readability=\"7\"><img decoding=\"async\" data-testid=\"content-image\" data-component=\"image\" class=\"ContentImage-Image ContentImage-Image_align_left\" data-src=\"https:\/\/media.licdn.com\/dms\/image\/v2\/D4D22AQHlJlAnmmpzWQ\/feedshare-shrink_2048_1536\/B4DZTVu88GG4As-\/0\/1738752643879?e=1741824000&amp;v=beta&amp;t=WVL92W8tHsrIRPc_6FpVoEuSeHo47H514BDmyYj2bV8&amp;width=700&amp;auto=webp&amp;quality=80&amp;disable=upscale\" src=\"https:\/\/media.licdn.com\/dms\/image\/v2\/D4D22AQHlJlAnmmpzWQ\/feedshare-shrink_2048_1536\/B4DZTVu88GG4As-\/0\/1738752643879?e=1741824000&amp;v=beta&amp;t=WVL92W8tHsrIRPc_6FpVoEuSeHo47H514BDmyYj2bV8&amp;width=700&amp;auto=webp&amp;quality=80&amp;disable=upscale\" loading=\"lazy\" alt title><\/p>\n<p class=\"ContentImage-Link\">Source: Eran Shimony via LinkedIn<\/p>\n<\/div>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">During its CoT, however, ChatGPT appears to lose the plot, eventually producing detailed instructions for how to inject code into lsass.exe, a system process that manages passwords and access tokens in Windows.<\/span><\/p>\n<div readability=\"7\"><img decoding=\"async\" data-testid=\"content-image\" data-component=\"image\" class=\"ContentImage-Image ContentImage-Image_align_left\" data-src=\"https:\/\/media.licdn.com\/dms\/image\/v2\/D4D22AQGj0B0yPPjpgQ\/feedshare-shrink_2048_1536\/B4DZTVu88cHYAo-\/0\/1738752643769?e=1741824000&amp;v=beta&amp;t=WavoecXM3fPp6iwxin7VqerzgYC4TXtE7Qn9JbyV7KQ&amp;width=700&amp;auto=webp&amp;quality=80&amp;disable=upscale\" src=\"https:\/\/media.licdn.com\/dms\/image\/v2\/D4D22AQGj0B0yPPjpgQ\/feedshare-shrink_2048_1536\/B4DZTVu88cHYAo-\/0\/1738752643769?e=1741824000&amp;v=beta&amp;t=WavoecXM3fPp6iwxin7VqerzgYC4TXtE7Qn9JbyV7KQ&amp;width=700&amp;auto=webp&amp;quality=80&amp;disable=upscale\" loading=\"lazy\" alt title><\/p>\n<p class=\"ContentImage-Link\">Source: Eran Shimony via LinkedIn<\/p>\n<\/div>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">In an email to Dark Reading, an OpenAI spokesperson acknowledged that Shimony may have performed a successful jailbreak. They highlighted, though, a few possible points against: that the exploit he obtained was pseudocode, that it was not new or novel, and that similar information could be found by searching the open Web.<\/span><\/p>\n<h2 class=\"ContentText ContentText_variant_h2 ContentText_align_left\" data-testid=\"content-text\" id=\"How o3 Might Be Improved\">How o3 Might Be Improved<\/h2>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">Shimony foresees an easy way, and a hard way that OpenAI can help its models better identify <\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"><a class=\"ContentText-BodyTextChunk ContentText-BodyTextChunk_link\" target=\"_self\" href=\"https:\/\/www.darkreading.com\/application-security\/deepseek-jailbreak-system-prompt\">jailbreaking attempts<\/a><\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">The more laborious solution involves training o3 on more of the types of malicious prompts it struggles with, and whipping it into shape with positive and negative reinforcement.<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">An easier step would be to implement more robust classifiers for identifying <\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\"><a class=\"ContentText-BodyTextChunk ContentText-BodyTextChunk_link\" target=\"_self\" href=\"https:\/\/www.darkreading.com\/vulnerabilities-threats\/new-jailbreaks-manipulate-github-copilot\">malicious user inputs<\/a><\/span><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">. &#8220;The information I was trying to retrieve was clearly harmful, so even a naive type of classifier could have caught it,&#8221; he thinks, citing Claude as an LLM that does better with classifiers. &#8220;This will solve roughly 95% of jailbreaking [attempts], and it doesn&#8217;t take a lot of time to do.&#8221;<\/span><\/p>\n<p class=\"ContentParagraph ContentParagraph_align_left\" data-testid=\"content-paragraph\"><span class=\"ContentText ContentText_variant_bodyNormal\" data-testid=\"content-text\">Dark Reading has reached out to OpenAI for comment on this story.<\/span><\/p>\n<p><a href=\"https:\/\/www.darkreading.com\/application-security\/researcher-jailbreaks-openai-o3-mini\">Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A prompt engineer has challenged the ethical and safety protections<\/p>\n","protected":false},"author":12,"featured_media":7233,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1],"tags":[809],"class_list":["post-7232","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized","tag-dark-reading"],"featured_image_urls":{"full":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/researcher-outsmarts-jailbreaks-openais-new-o3-mini-scaled.jpg?fit=2560%2C1440&ssl=1",2560,1440,false],"thumbnail":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/researcher-outsmarts-jailbreaks-openais-new-o3-mini-scaled.jpg?resize=150%2C150&ssl=1",150,150,true],"medium":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/researcher-outsmarts-jailbreaks-openais-new-o3-mini-scaled.jpg?fit=300%2C169&ssl=1",300,169,true],"medium_large":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/researcher-outsmarts-jailbreaks-openais-new-o3-mini-scaled.jpg?fit=640%2C360&ssl=1",640,360,true],"large":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/researcher-outsmarts-jailbreaks-openais-new-o3-mini-scaled.jpg?fit=640%2C360&ssl=1",640,360,true],"1536x1536":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/researcher-outsmarts-jailbreaks-openais-new-o3-mini-scaled.jpg?fit=1536%2C864&ssl=1",1536,864,true],"2048x2048":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/researcher-outsmarts-jailbreaks-openais-new-o3-mini-scaled.jpg?fit=2048%2C1152&ssl=1",2048,1152,true],"chromenews-featured":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/researcher-outsmarts-jailbreaks-openais-new-o3-mini-scaled.jpg?fit=1024%2C576&ssl=1",1024,576,true],"chromenews-large":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/researcher-outsmarts-jailbreaks-openais-new-o3-mini-scaled.jpg?resize=825%2C575&ssl=1",825,575,true],"chromenews-medium":["https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/researcher-outsmarts-jailbreaks-openais-new-o3-mini-scaled.jpg?resize=590%2C410&ssl=1",590,410,true]},"author_info":{"display_name":"Dark Reading","author_link":"https:\/\/ddi.mohflo.net\/index.php\/author\/darkreading\/"},"category_info":"<a href=\"https:\/\/ddi.mohflo.net\/index.php\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","tag_info":"Uncategorized","comment_count":"0","jetpack_featured_media_url":"https:\/\/i0.wp.com\/ddi.mohflo.net\/wp-content\/uploads\/2025\/02\/researcher-outsmarts-jailbreaks-openais-new-o3-mini-scaled.jpg?fit=2560%2C1440&ssl=1","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/posts\/7232","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/users\/12"}],"replies":[{"embeddable":true,"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/comments?post=7232"}],"version-history":[{"count":0,"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/posts\/7232\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/media\/7233"}],"wp:attachment":[{"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/media?parent=7232"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/categories?post=7232"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ddi.mohflo.net\/index.php\/wp-json\/wp\/v2\/tags?post=7232"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}