Combining Continuous and Discrete Adversarial Training for LLMs

12 1 minute read

[Submitted on 22 May 2025 (v1), last revised 28 Oct 2025 (this version, v2)]

View a PDF of the article MixAT: Combining Continuous and Intermittent Adversarial Training for MBA Students, by Csaba Dekani, Stefan Paluka, Ruben Stapp, and Dimitar I. Dimitrov and Martin Vetchev

View PDF HTML (beta)

a summary:Despite recent efforts in the field of integrity and harmonization of LLMs, current adversarial attacks on LLMs are still capable of constantly forcing malicious generations. Although adversarial training has been extensively studied and shown to significantly improve the power of traditional machine learning models, its strengths and weaknesses in the context of LLMs are poorly understood. Specifically, while existing discrete adversarial attacks are effective at producing malicious content, training LLM using concrete adversarial claims is often computationally expensive, leading to a reliance on continuous relaxation. At the same time, despite its effectiveness and generalization capabilities, continuous disturbance training does not always cover the full range of vulnerabilities exploited by discrete attacks. In this work, we aim to fill this gap by introducing MixAT, a new method that combines stronger and faster discrete continuous attacks during training. We rigorously evaluate MixAT across a wide range of state-of-the-art attacks, and propose an At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerabilities in models. We show that MixAT achieves significantly better robustness (ALO-ASR <20%) مقارنة بالدفاعات السابقة (ALO-ASR> 50%), while maintaining a running time similar to continuous relaxation-based methods. We also analyze MixAT in real-world deployment settings, exploring how chat blocks, quantization, low-rank transformers, and temperature affect both training and competitive evaluation, revealing additional blind spots in existing methodologies. Our results demonstrate that MixAT’s discrete persistent defense provides a principled and superior trade-off between accuracy and robustness with minimal computational overhead, highlighting its promise for building a more secure MBA. We provide our code and templates at this https URL.