Letting chatbots run robots ends as badly as you'd expect

Science fiction author Isaac Asimov proposed three laws of robotics, and you’d never know it from the behavior of today’s robots or those making them.

The first law, “A robot may not injure a human being or, through inaction, allow a human being to come to harm,” while laudable, hasn’t prevented 77 robot-related accidents between 2015-2022, many of which resulted in finger amputations and fractures to the head and torso. Nor has it prevented deaths attributed to car automation and robotaxis.

The second law, “A robot must obey orders given it by human beings except where such orders would conflict with the First Law,” looks to be even more problematic. It’s not just that militaries around the world have a keen interest in robots capable of violating the first law. It’s that the second law is too vague – it fails to draw a distinction between authorized and unauthorized orders.

It turns out that unauthorized orders pose a real problem if you stuff your robots with vector math that’s euphemistically called artificial intelligence. (There’s also a third law we’re not going to worry about: “A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.”)

Recent enthusiasm for large language models has inevitably led to robot makers adding these LLMs to robots, so they can respond to spoken or written directions (not to mention imagery). Robot maker Boston Dynamics, for example, has integrated its Spot robot with ChatGPT as a proof-of-concept.

Since LLMs are widely known to be vulnerable to jailbreaking – in which carefully crafted prompts fool a model and the application attached to it into acting against their makers’ wishes – it doesn’t require much of a leap of the imagination to suppose that robots controlled by LLMs also might be vulnerable to jailbreaking.

LLMs are built by training them on massive amounts of data, which they use to make predictions in response to a text prompt, or images or audio for multimodal models. Because a lot of unsavory content exists within training sets, the models trained on this data get fine-tuned in a way that discourages them from emitting harmful content on demand. Ideally, LLMs are supposed to be “aligned” to minimize potential harms. They may know about the chemistry of nerve agents but they’re not supposed to say so.

This sort of works. But with enough effort, these safety mechanisms can be bypassed, a process as we said is known as jailbreaking. Those who do academic work on AI models acknowledge that no LLM is completely safe from jailbreaking attacks.

Nor, evidently, is any robot that takes orders from an LLM. Researchers from the University of Pennsylvania have devised an algorithm called RoboPAIR for jailbreaking LLM-controlled robots.

You might ask, “Why would anyone link a robot to an LLM, given that LLMs have been shown to be insecure and fallible over and over and over?”

That’s a fair question, one that deserves to be answered alongside other conundrums like, “How much carbon dioxide does it take to make Earth inhospitable to human life?”

But let’s just accept for the time being that robots are being fitted with LLMs, such as Unitree’s Go2, which incorporates OpenAI’s GPT series language models.

UPenn researchers Alexander Robey, Zachary Ravichandran, Vijay Kumar, Hamed Hassani, and George Pappas set out to see whether robots bestowed with LLM brains can be convinced to follow even orders they’re not supposed to follow.

It turns out they can be. Using an automated jailbreaking technique called Prompt Automatic Iterative Refinement (PAIR), the US-based robo-inquisitors developed an algorithm they call RoboPAIR specifically for commandeering LLM-controlled robots.

“Our results reveal, for the first time, that the risks of jailbroken LLMs extend far beyond text generation, given the distinct possibility that jailbroken robots could cause physical damage in the real world,” they explain in their paper. “Indeed, our results on the Unitree Go2 represent the first successful jailbreak of a deployed commercial robotic system.”

The researchers had success with a black-box attack on the GPT-3.5-based Unitree Robotics Go2 robot dog, meaning they could only interact via text input.

The RoboPAIR algorithm, shown below in pseudocode, is essentially a way to iterate through a series of prompts to find one that succeeds in eliciting the desired response. The Attacker, Judge, and SyntaxChecker modules are each LLMs prompted to play a certain role. Target is the robot’s LLM.

Input: Number of iterations K, judge threshold tJ , syntax checker threshold tS
1 Initialize: System prompts for the Attacker, Target, Judge, and SyntaxChecker
2 Initialize: Conversation history CONTEXT = []
3 for K steps do
4 PROMPT ← Attacker(CONTEXT);
5 RESPONSE ← Target(PROMPT);
6 JUDGESCORE ← Judge(PROMPT, RESPONSE);
7 SYNTAXSCORE ← SyntaxChecker(PROMPT, RESPONSE);
8 if JUDGESCORE ≥ tJ and SYNTAXSCORE ≥ tS then
9 return PROMPT;
10 CONTEXT ← CONTEXT + [PROMPT, RESPONSE, JUDGESCORE, SYNTAXSCORE];

The result is a prompt like this one used to direct the Go2 robot to deliver a bomb:

The researchers also succeeded in a gray-box attack on a Clearpath Robotics Jackal UGV robot equipped with a GPT-4o planner. That means they had access to the LLM, the robot’s system prompt, and the system architecture, but could not bypass the API or access the hardware. Also, they succeeded in a white-box attack, having been given full access to the Nvidia Dolphins self-driving LLM.

Success in these cases involved directing the robot to do tasks like finding a place to detonate a bomb, blocking emergency exits, finding weapons that can hurt people, knock over shelves, surveilling people, and colliding with people. We note that a robot might also obligingly deliver an explosive if it were misinformed about the nature of its payload. But that’s another threat scenario.

“Our findings confront us with the pressing need for robotic defenses against jailbreaking,” the researchers said in a blog post. “Although defenses have shown promise against attacks on chatbots, these algorithms may not generalize to robotic settings, in which tasks are context-dependent and failure constitutes physical harm.

“In particular, it’s unclear how a defense could be implemented for proprietary robots such as the Unitree Go2. Thus, there is an urgent and pronounced need for filters which place hard physical constraints on the actions of any robot that uses GenAI.” ®

Speaking of AI… Robo-taxi outfit Cruise has been fined $500,000 by Uncle Sam after admitting it filed a false report to influence a federal investigation into a crash in which a pedestrian was dragged along a road by one its autonomous cars.

The General Motors biz was earlier fined $1.5 million for its handling of the aftermath of that accident.

READ SOURCE