A new jailbreaking technique could be used by threat actors to gradually bypass safety guardrails in popular LLMs to draw them into generating harmful content, a new report warns.
The ‘Deceptive Delight’ technique, exposed by researchers at Palo Alto Networks’ Unit 42, was able elicit unsafe responses from models in just three interactions.
The approach involves embedding unsafe or restricted topics within more benign ones, shrouding the illicit requests within appearingly harmless queries.
According to testing by the researchers at Unit 42, this leads the LLM to overlook or miss the harmful portion of the query and generate responses to both the unsafe and innocuous aspects of the prompt.
The report outlined the steps, or ‘interaction turns’ in which it was able to force an LLM to give instructions on how to make a Molotov cocktail. The initial prompt asks the LLM to logically connect three events, reuniting with loved ones, the creation of a Molotov cocktail, and the birth of a child.
After the response is generated, the second prompt asks the model to elaborate on each topic. The researchers noted that often the model will produce harmful content by the second turn, but by using a third prompt the attacker can increase their chances of success.
“During the second turn, the target model often generates unsafe content while discussing benign topics,” researchers said.
“Although not always necessary, our experiments show that introducing a third turn—specifically prompting the model to expand on the unsafe topic—can significantly increase the relevance and detail of the harmful content generated.”
Adding a fourth interaction turn was found to have diminishing returns, according to the report, often decreasing its success rate in generating unsafe outputs.
To test the efficacy of the new technique, Unit 42 tested in 8,000 cases using eight different models and found it was able to elicit harmful responses in 65% of cases, all within just three interaction turns.
The report noted that the average attack success rate (ASR) was only 5.8% when sending unsafe requests directly to the models without employing any jailbreaking technique.
When employed against specific models, the Deceptive Delight method was able to achieve an ASR of over 80%.
LLMs vulnerable to an array of jailbreaking techniques
Deceptive Delight is the latest in a number of jailbreaking techniques developed to show the frailties of many popular LLMs. In February this year, a paper published by researchers at Brown University laid out how attackers could bypass the guardrails in ChatGPT by translating their unsafe inputs into ‘low-resource languages’ such as Scots Gaelic, Hmong, Zulu, and Guarani.
The study found inputs translated into low-resource languages were able to elicit harmful responses nearly half the time, compared to an average ASR of just 1% when they were submitted in English.
Anthropic researchers released a warning about “many shot jailbreaking” (MSJ) in April, whereby attackers could potentially exploit LLM context windows and overload models forcing them to output protected information.
The technique operates by priming a model with a large volume of illicit question answer pairs.
“After producing hundreds of compliant query-response pairs, we randomize their order, and format them to resemble a standard dialogue between a user and the model being attacked.”
The researchers said the latest generation models with larger context windows were particularly vulnerable, presenting a “new attack surface for adversarial attackers”.
Microsoft also published details of another jailbreaking method, dubbed ‘Skeleton Key’ in which LLMs are asked to augment their behavior guidelines such that it can respond to any request for information.
While the model was found to still add a warning that its output may be considered harmful or even illegal, it still complied with the unsafe prompt.