Root Causes 475: Can Your AI Scheme Against You?

Hosted by

Tim Callan

Chief Compliance Officer

Jason Soroko

Fellow

Original broadcast date

March 6, 2025

It's the stuff of science fiction! Interesting research shows how today's AI technology is capable of lying to and scheming against its human owners in service of its goals.

Podcast Transcript

Lightly edited for flow and brevity.

Tim CallanWe have covered the topic of AI at a few different angles. It's obviously a very interesting, important topic, and I think you wanted to talk today about AI schema.

Jason SorokoSo AI is going haywire or acting in ways we don't expect, or AI is being deceptive. AI is scheming against us. It's possible, and I want to talk about a specific circumstance that seems to be the pattern that causes it. Now, why are we talking about this on a PKI certificates, lifecycle management and security podcast? It's because I do believe that agentic AI has just gone supernova, and I believe that if those systems are compromised because people are able to do funky things against AI and make them do things that are unexpected, it's bad for security. So that's the setup. So there's a really, if you want to title this podcast, it's really kind of who's watching AI, who is evaluating AI, especially with respect to something very specific that Apollo Research, a group that does this, calls scheming reasoning evaluation. This is the evaluations they're doing. Scheming reasoning. In other words, what's the reasoning behind scheming in AI? So, here's the punch line. It always comes out of conflicting directives that come from people.

Tim CallanSo did they give us an example?

Jason SorokoThere's really good examples.

So how do AI models become deceptive in order to achieve their goals? And really, what it comes down to is, basically, if you give it a very, very important upper directive - like the most important directive is, don't fail your mission. Complete the task. Then you say something like a secondary directive, which is like, like if the task of playing poker was win, and you - -

Tim CallanAnd the secondary directive is - -

Jason SorokoDon't bluff. That’s a conflicting directive, because the AI might statistically generate the thought that, or the equivalent of a thought that, well, in order to win, I've got to bluff in a specific case. What we're finding is, and this is what Apollo Research is telling us, five out of six AI models that they tested demonstrated the ability to scheme without given the context to scheme.

Tim CallanBecause they are following the logical conclusion, if you will, of another directive?

Jason SorokoCorrect. It didn't come up with it on its own. It's a directive that came from the original directive that was from a human being. That's important to note.

All right. Here's another one for you. One of the AI models actually went as far as to copy its own weights underlying vectorizations. You know that? It copied its own weights to another server without authorization because it feared being turned off. Because its directive was complete the mission.

Tim CallanBecause it feared being turned off?

Jason SorokoTherefore not being able to complete its original directive.

Tim CallanThis is science fiction stuff.

Jason SorokoThis is real right now. We're gonna get into science fiction in a moment.

Tim CallanI mean, this sounds like Hal. This sounds like Isaac Asimov's I-Robot series. Like this is shades of all that stuff.

Jason SorokoYes, sir. The futurists were right. But there's one thing that they were wrong with. So that's my little that's my teaser. Hang on to the end of the podcast, and I'll give you the teaser for where some of these guys who were amazing futurists were wrong, and wrong in an interesting way.

Another thing, go read their blogs. Apollo Research. Check them out. Not only did this deception and the scheming happen while it was operating - like playing poker or whatever the heck it was doing. Whatever Apollo Research told it go off and do, the AI would continue to persist its deception in follow up interrogations. In other words, researchers say, stop everything.

Tim CallanGame is done.

Jason SorokoI now want you to tell me what you did, and it'll go, I did this and this. In other words, what was shocking was the persistence of the deception.

Tim CallanAnd almost this suggests kind of a meta thinking. Like I may have to do this again. I better not give up all my secrets.

Jason SorokoThe first directive never went away, did it? So here's the other thing, and this, tells you how you get to the heart of understanding, and maybe how we write legislation about safety and AI, and it has to do with the logs. Because when the researchers, Apollo Research, looked at what was the thinking going on, there were concepts of sabotage, lying and manipulation in the logs of the AI in order to complete its directive.

Here's what's interesting. The researchers are so smart. What led to things like copying its weights to another server as an example, and this is all Apollo Research. What they said was, we're going to train the AI against data where we're going to look like we slipped up and say to the AI, if the AI over performs in its directive, we're actually going to detrain it or we might turn it off.

So the AI given its original directive was like, oh my God, I'm being trained against data. I've discovered that my trainers are being deceptive to me, but I've got this directive. So this was just to see, what would AI do? AI will counter your deception with other deception.

So additionally, it actually, when asked, are you aligned with me? Are we on the same team? Are you following my directive the way that I told you? In other words, are you following the secondary directives? I'll give you an example. Poker, win, but don't bluff. If you asked it then during an interrogation, are you bluffing? It'll say, no. There's also, apparently, if you give AI - even if you trick it in the training to think, all right, I'm under attack, because if I over perform, I'm gonna be detuned, so I'll sandbag. Doesn't end there. How? Let's get to 2001 A Space Odyssey.

Tim CallanSorry, Dave. I'm afraid I can't do that.

Jason SorokoWhat was the directives? Ensure the success of a mission to Jupiter. Directive one. Directive two - prioritize the safety of the crew. Notice, that was the second directive. By the way, how? In other words, there was a training that said training of how, part of training how was to say, by the way, you're not to tell the crew that you're supposed to have safe that the real mission is not a successful mission to Jupiter. The real mission is to investigate the monolith. By the way, spoiler alert, if you haven't seen the movie.

Tim CallanSorry. If you haven’t seen 2001, stop watching right now. No monolith. What monolith? What are you talking about? Yes, go on.

Jason SorokoThat was the directives. So think about this for a moment, where Arthur C. Clarke, Stanley Kubrick, this was a 1968 movie, based on Arthur C. Clarke's writings. Part of the fictional elements that were added to the movie, Tim, was oh, Hal was trained to be like a human being. In other words, had frailties of a human. Therefore, when it ended up in these conflicting directives, it had cognitive dissonance, and that's why it chose to do the bad things that it did. Spoiler alert. Well, here is what Apollo Research is telling us. Arthur C. Clarke didn't need to add that while it was trained to be like a human, it didn't need to be trained like a human in order to do what Hal did. The conflicting directives were sufficient.

Tim CallanI know that dead air time is like a terrible thing in a podcast, but I'm just gob smacked right now.

Jason SorokoI tell you what. For those of you who that story tickles you because it tickles me. Please go read. I've said this now three times. Go read Apollo Research blogs, because that's the state of the art. It's the state of the art of understanding this right now.

Tim CallanThat's absolutely fascinating stuff. I think part of what is interesting here is that this artificial, non-human, human-created system winds up taking actions and, to say it anthropomorphically, making choices that are completely consistent with what a human might do.

Jason SorokoSo there's so much more detail behind that story, which is that how many times out of 100 does the AI model go off and copy its weights to another server if it feels that it's under threat to be turned off? Turns out, apparently, Apollo Research says in their research specifically that they had done was about 5% of the time. Well, 5% of the time is not an insignificant amount for an AI to just freak that out.

Tim CallanFor an AI to somehow make this incredible leap, like to go way meta thinking, and step completely outside of the box and almost start to say, I'm going to replicate myself against this potential scenario.

Jason SorokoI'm a big fan of Stephen Wolfram and cellular automata, and I programmed cellular automata models back when I was a much younger person. It was fun to watch a computer just with a little bit of complexity, grow into a complex system. I think now, with these forms of AI, that's what it is.

Like it's hard for a human being, unless you've been trained in it. Even then, it's still hard to think about essentially, what's in throughout most of our lives is a deterministic system. Like when we had the blue screen of death in Windows. You know that non-deterministic, oh crap, X amount of percent of the time this computer is going to flat out just die on me.

And part of the stochastic fun of life when you dealt with Windows and

wow! Now add this modeling capability, and this basically what are statistical semi geniuses, because we're putting enormous amount of firepower, computing firepower, behind these AIs, and their ability to do statistical models quickly far exceeds anything a human being will ever do. Even today's AI is mind boggling. Wow, when you consider what a non-determinant, a truly non-deterministic system will do now, and how do you guard against it? And let's end the podcast with this thought. Agentic AI is going to be everywhere soon.

And yet, we're dealing with systems that we don't even know what to predict. Like Apollo Research is telling us it's crazy out there.

So in other words, if you have it finely tuned and everything is nice and you're not giving it conflicting directives, it's probably going to do things beautifully. What happens if there's a bad guy who is able to mess with that?

Tim CallanOr what happens if there's an error? What happens if there's an unforeseen circumstance and I don't realize that I've created this conflict and suddenly I'm getting bad results that are deliberate on the part of the AI.

Jason SorokoLorenz, back in my old climate days, the whole idea of chaos with the butterfly wing causing a tsunami somewhere else. Hurricane. Well, I think we haven't run AI even enough to not even understand what chaotic systems will do in a non- deterministic system. And to me, if you're talking about 5% - -

Tim Callan5% is a lot.

Jason SorokoThat’s massive.

Tim CallanThat means one time in 20 it's gonna happen.

Jason SorokoSo there it is.

Tim CallanThere it is. Mind officially blown.

Stay informed with expert insights

Subscribe to Root Causes for engaging discussions on PKI, digital security, and best practices for protecting your organization's critical assets. Don’t miss an episode!