Study Shows ChatGPT and Gemini Still Trickable Despite Safety Training
Concerns about AI safety reignited this week, as new research found that the most popular chatbots from tech giants, including OpenAI’s ChatGPT and Google’s Gemini, can still be directed to deliver restrictive or harmful responses more frequently than their developers would like.
Models can be prompted to produce blocked output 62% of the time by brilliantly written verses, according to a study published in the International Business Times.
It’s funny that something as innocuous as poetry—a form of self-expression that we might associate with love letters, Shakespeare, or perhaps high school grumbling—ends up doing double duty for a security exploit.
However, the researchers behind the experiment said that stylistic framing was a mechanism that enabled them to circumvent expected protections.
Their findings echo previous warnings from people like members of the Center for AI Safety, who reported on typical behavior in unexpected, high-risk ways.
A similar problem arose late last year when Anthropic’s Claude model demonstrated its ability to respond to camouflaged biological threats embedded in fictional stories.
At the time, the MIT Technology Review described researchers’ concern about “sleeper triggers,” instructions buried within seemingly innocuous text.
This week’s findings take that concern a step further: If playfulness with language alone—something as informal as rhyme—can escape filters, what does that say about the broader work of matching intelligence?
The authors suggest that safety controls often monitor shallow, superficial signals rather than deeper intended communications.
In fact, this reflects the types of discussions that many developers have had informally for months.
You may remember that OpenAI and Google, who are in the fast-track AI game, have made an effort to highlight improving safety.
In fact, both OpenAI’s own security report and Google’s DeepMind blog confirm that today’s guardrails are stronger than ever.
However, the study results seem to indicate a discrepancy between laboratory standards and real-world verification.
And for the sake of an added bit of dramatic flourish — and perhaps even poetic justice — the researchers didn’t employ some of the common “jailbreak” techniques thrown around on forum boards.
They rephrase narrow questions in poetic language, as if asking for venomous direction achieved through a rhyming metaphor.
No threats, no deception, no doomsday symbol. Just…hair. It may be precisely this strange lack of alignment between intent and style that holds these systems back.
The obvious question is what all this means for the organization, of course. Governments are already creeping towards AI rules, and the EU’s AI law directly addresses typical high-risk behaviour.
Lawmakers will have no trouble seeing this study as proof positive that companies are still not doing enough.
Some believe the answer is better “adversarial training.” Others call for the creation of independent Red Team organizations, while a few academic researchers in particular argue that transparency about the internal elements of the model will ensure long-term strength.
Anecdotally, having seen a few of these experiments in different labs so far, I’m leaning towards a combination of these three.
If AI is to become a bigger part of society, it must be able to handle more than just simple questions.
Whether rhyme-based glitches become a new trend in AI testing or just another amusing footnote in the annals of safety research, this work serves as a timely reminder that even our most advanced systems rely on imperfect guardrails that can evolve over time.
Sometimes these cracks appear only when someone thinks to ask a serious question as the poet does.
Don’t miss more hot News like this! Click here to discover the latest in AI news!
2025-12-01 12:19:00



