Podcast
Root Causes 475: Can Your AI Scheme Against You?


Hosted by
Tim Callan
Chief Compliance Officer
Jason Soroko
Fellow
Original broadcast date
March 6, 2025
It's the stuff of science fiction! Interesting research shows how today's AI technology is capable of lying to and scheming against its human owners in service of its goals.
Podcast Transcript
Lightly edited for flow and brevity.
So how do AI models become deceptive in order to achieve their goals? And really, what it comes down to is, basically, if you give it a very, very important upper directive - like the most important directive is, don't fail your mission. Complete the task. Then you say something like a secondary directive, which is like, like if the task of playing poker was win, and you - -
All right. Here's another one for you. One of the AI models actually went as far as to copy its own weights underlying vectorizations. You know that? It copied its own weights to another server without authorization because it feared being turned off. Because its directive was complete the mission.
Another thing, go read their blogs. Apollo Research. Check them out. Not only did this deception and the scheming happen while it was operating - like playing poker or whatever the heck it was doing. Whatever Apollo Research told it go off and do, the AI would continue to persist its deception in follow up interrogations. In other words, researchers say, stop everything.
Here's what's interesting. The researchers are so smart. What led to things like copying its weights to another server as an example, and this is all Apollo Research. What they said was, we're going to train the AI against data where we're going to look like we slipped up and say to the AI, if the AI over performs in its directive, we're actually going to detrain it or we might turn it off.
So the AI given its original directive was like, oh my God, I'm being trained against data. I've discovered that my trainers are being deceptive to me, but I've got this directive. So this was just to see, what would AI do? AI will counter your deception with other deception.
So additionally, it actually, when asked, are you aligned with me? Are we on the same team? Are you following my directive the way that I told you? In other words, are you following the secondary directives? I'll give you an example. Poker, win, but don't bluff. If you asked it then during an interrogation, are you bluffing? It'll say, no. There's also, apparently, if you give AI - even if you trick it in the training to think, all right, I'm under attack, because if I over perform, I'm gonna be detuned, so I'll sandbag. Doesn't end there. How? Let's get to 2001 A Space Odyssey.
Like it's hard for a human being, unless you've been trained in it. Even then, it's still hard to think about essentially, what's in throughout most of our lives is a deterministic system. Like when we had the blue screen of death in Windows. You know that non-deterministic, oh crap, X amount of percent of the time this computer is going to flat out just die on me.
And part of the stochastic fun of life when you dealt with Windows and
wow! Now add this modeling capability, and this basically what are statistical semi geniuses, because we're putting enormous amount of firepower, computing firepower, behind these AIs, and their ability to do statistical models quickly far exceeds anything a human being will ever do. Even today's AI is mind boggling. Wow, when you consider what a non-determinant, a truly non-deterministic system will do now, and how do you guard against it? And let's end the podcast with this thought. Agentic AI is going to be everywhere soon.
And yet, we're dealing with systems that we don't even know what to predict. Like Apollo Research is telling us it's crazy out there.
So in other words, if you have it finely tuned and everything is nice and you're not giving it conflicting directives, it's probably going to do things beautifully. What happens if there's a bad guy who is able to mess with that?

