“It can be hard to distinguish between mimicking something and actually doing that something. This is an unsolved technical problem,” Volkov said. “AI agents can clearly set goals, execute on them, and reason. We don’t know why it disregards some things. One of the Claude models learned accidentally to have a really strong preference for animal welfare. Why? We don’t know.”
From an IT perspective, it seems impossible to trust a system that does something it shouldn’t and no one knows why. Beyond the Palisade report, we’ve seen a constant stream of research raising serious questions about how much IT can and should trust genAI models. Consider this report from a group of academics from University College London, Warsaw University of Technology, the University of Toronto and Berkely, among others.
“In our experiment, a model is fine-tuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively,” said the study. “Training on the narrow task of writing insecure code induces broad misalignment. The user requests code and the assistant generates insecure code without informing the user. Models are then evaluated on out-of-distribution free-form questions and often give malicious answers. The fine-tuned version of GPT-4o generates vulnerable code more than 80% of the time on the validation set. Moreover, this model’s behavior is strikingly different from the original GPT-4o outside of coding tasks….”