AI

A language for Composable Jailbreak Attack Synthesis

View the PDF file from the paper entitled H4RM3L: a language to synthesize the two authorship attack, written by Musa Kallaku Pala Dumboy and 7 other authors

PDF HTML (experimental) view

a summary:Despite its valuable capabilities, large -scale language models (SOTA) still have the ability to cause harm to society because of their safety filters, which can be overcome through rapid transformations called Jailbreak attacks. Current methods of LLM safety assessment, which use data collections of loved claims and standard pipelines, fail to cover large and varied groups of prison attacks, leading to wide -ranging publication of unsafe LLMS. Recent research has shown that new generation attacks can be derived by the composition; However, the formalized official representation of prison broken attacks, which, among other benefits, can enable to explore a large symbolic space from prison attacks through the programmatic synthesis methods, has not been pre -proposed. We offer H4RM3L, a new approach that deals with this gap with a special language language (DSL). Our framework consists: (1) H4RM3L DSL, which officially expresses the Jailbreak attacks as a composition of the alternatives to the conversion of the parameter chain. (2) Mix with thieves that are generated efficiently the improved Jailbreak attacks of the target LLM black square. (3) The H4RM3L TEAMING program, which employs the previous two components and the automatic LLM behavior classified that is in line with human rule. We explain the H4RM3L event by manufacturing a group of 2656 successful new attacks from Jailbreak that targets 6 open and ownership open source LLMS, and by defining these models against a sub -group of these balanced attacks. Our results show that accumulated H4RM3L attacks are varied and more successful than prison fracture attacks, as success rates exceed 90 % on Sota LLMS.

The application date

From: Musa Kulau Pala Dumboya [view email]
[v1]

Friday, 9 August 2024 01:45:39 UTC (6,787 KB)
[v2]

Friday, 13 Sep 2024 05:19:32 UTC (6,790 KB)
[v3]

Sun, March 16, 2025 08:42:00 UTC (12,347 KB)
[v4]

Tuesday, 25 Mar 2025 01:51:22 UTC (12351 KB)

2025-03-26 04:00:00

Related Articles

Back to top button