Redirecting you to
Podcast Mar 06, 2025

Root Causes 475: Can Your AI Scheme Against You?

It's the stuff of science fiction! Interesting research shows how today's AI technology is capable of lying to and scheming against its human owners in service of its goals.

  • Original Broadcast Date: March 6, 2025

Episode Transcript

Lightly edited for flow and brevity.

  • Tim Callan

    We have covered the topic of AI at a few different angles. It's obviously a very interesting, important topic, and I think you wanted to talk today about AI schema.

  • Jason Soroko

    So AI is going haywire or acting in ways we don't expect, or AI is being deceptive. AI is scheming against us. It's possible, and I want to talk about a specific circumstance that seems to be the pattern that causes it. Now, why are we talking about this on a PKI certificates, lifecycle management and security podcast? It's because I do believe that agentic AI has just gone supernova, and I believe that if those systems are compromised because people are able to do funky things against AI and make them do things that are unexpected, it's bad for security. So that's the setup. So there's a really, if you want to title this podcast, it's really kind of who's watching AI, who is evaluating AI, especially with respect to something very specific that Apollo Research, a group that does this, calls scheming reasoning evaluation. This is the evaluations they're doing. Scheming reasoning. In other words, what's the reasoning behind scheming in AI? So, here's the punch line. It always comes out of conflicting directives that come from people.

  • Tim Callan

    So did they give us an example?

  • Jason Soroko

    There's really good examples.

    So how do AI models become deceptive in order to achieve their goals? And really, what it comes down to is, basically, if you give it a very, very important upper directive - like the most important directive is, don't fail your mission. Complete the task. Then you say something like a secondary directive, which is like, like if the task of playing poker was win, and you - -

  • Tim Callan

    And the secondary directive is - -

  • Jason Soroko

    Don't bluff. That’s a conflicting directive, because the AI might statistically generate the thought that, or the equivalent of a thought that, well, in order to win, I've got to bluff in a specific case. What we're finding is, and this is what Apollo Research is telling us, five out of six AI models that they tested demonstrated the ability to scheme without given the context to scheme.

  • Tim Callan

    Because they are following the logical conclusion, if you will, of another directive?

  • Jason Soroko

    Correct. It didn't come up with it on its own. It's a directive that came from the original directive that was from a human being. That's important to note.

    All right. Here's another one for you. One of the AI models actually went as far as to copy its own weights underlying vectorizations. You know that? It copied its own weights to another server without authorization because it feared being turned off. Because its directive was complete the mission.

  • Tim Callan

    Because it feared being turned off?

  • Jason Soroko

    Therefore not being able to complete its original directive.

  • Tim Callan

    This is science fiction stuff.

  • Jason Soroko

    This is real right now. We're gonna get into science fiction in a moment.

  • Tim Callan

    I mean, this sounds like Hal. This sounds like Isaac Asimov's I-Robot series. Like this is shades of all that stuff.

  • Jason Soroko

    Yes, sir. The futurists were right. But there's one thing that they were wrong with. So that's my little that's my teaser. Hang on to the end of the podcast, and I'll give you the teaser for where some of these guys who were amazing futurists were wrong, and wrong in an interesting way.

    Another thing, go read their blogs. Apollo Research. Check them out. Not only did this deception and the scheming happen while it was operating - like playing poker or whatever the heck it was doing. Whatever Apollo Research told it go off and do, the AI would continue to persist its deception in follow up interrogations. In other words, researchers say, stop everything.

  • Tim Callan

    Game is done.

  • Jason Soroko

    I now want you to tell me what you did, and it'll go, I did this and this. In other words, what was shocking was the persistence of the deception.

  • Tim Callan

    And almost this suggests kind of a meta thinking. Like I may have to do this again. I better not give up all my secrets.

  • Jason Soroko

    The first directive never went away, did it? So here's the other thing, and this, tells you how you get to the heart of understanding, and maybe how we write legislation about safety and AI, and it has to do with the logs. Because when the researchers, Apollo Research, looked at what was the thinking going on, there were concepts of sabotage, lying and manipulation in the logs of the AI in order to complete its directive.

    Here's what's interesting. The researchers are so smart. What led to things like copying its weights to another server as an example, and this is all Apollo Research. What they said was, we're going to train the AI against data where we're going to look like we slipped up and say to the AI, if the AI over performs in its directive, we're actually going to detrain it or we might turn it off.

    So the AI given its original directive was like, oh my God, I'm being trained against data. I've discovered that my trainers are being deceptive to me, but I've got this directive. So this was just to see, what would AI do? AI will counter your deception with other deception.

    So additionally, it actually, when asked, are you aligned with me? Are we on the same team? Are you following my directive the way that I told you? In other words, are you following the secondary directives? I'll give you an example. Poker, win, but don't bluff. If you asked it then during an interrogation, are you bluffing? It'll say, no. There's also, apparently, if you give AI - even if you trick it in the training to think, all right, I'm under attack, because if I over perform, I'm gonna be detuned, so I'll sandbag. Doesn't end there. How? Let's get to 2001 A Space Odyssey.

  • Tim Callan

    Sorry, Dave. I'm afraid I can't do that.

  • Jason Soroko

    What was the directives? Ensure the success of a mission to Jupiter. Directive one. Directive two - prioritize the safety of the crew. Notice, that was the second directive. By the way, how? In other words, there was a training that said training of how, part of training how was to say, by the way, you're not to tell the crew that you're supposed to have safe that the real mission is not a successful mission to Jupiter. The real mission is to investigate the monolith. By the way, spoiler alert, if you haven't seen the movie.

  • Tim Callan

    Sorry. If you haven’t seen 2001, stop watching right now. No monolith. What monolith? What are you talking about? Yes, go on.

  • Jason Soroko

    That was the directives. So think about this for a moment, where Arthur C. Clarke, Stanley Kubrick, this was a 1968 movie, based on Arthur C. Clarke's writings. Part of the fictional elements that were added to the movie, Tim, was oh, Hal was trained to be like a human being. In other words, had frailties of a human. Therefore, when it ended up in these conflicting directives, it had cognitive dissonance, and that's why it chose to do the bad things that it did. Spoiler alert. Well, here is what Apollo Research is telling us. Arthur C. Clarke didn't need to add that while it was trained to be like a human, it didn't need to be trained like a human in order to do what Hal did. The conflicting directives were sufficient.

  • Tim Callan

    I know that dead air time is like a terrible thing in a podcast, but I'm just gob smacked right now.

  • Jason Soroko

    I tell you what. For those of you who that story tickles you because it tickles me. Please go read. I've said this now three times. Go read Apollo Research blogs, because that's the state of the art. It's the state of the art of understanding this right now.

  • Tim Callan

    That's absolutely fascinating stuff. I think part of what is interesting here is that this artificial, non-human, human-created system winds up taking actions and, to say it anthropomorphically, making choices that are completely consistent with what a human might do.

  • Jason Soroko

    So there's so much more detail behind that story, which is that how many times out of 100 does the AI model go off and copy its weights to another server if it feels that it's under threat to be turned off? Turns out, apparently, Apollo Research says in their research specifically that they had done was about 5% of the time. Well, 5% of the time is not an insignificant amount for an AI to just freak that out.

  • Tim Callan

    For an AI to somehow make this incredible leap, like to go way meta thinking, and step completely outside of the box and almost start to say, I'm going to replicate myself against this potential scenario.

  • Jason Soroko

    I'm a big fan of Stephen Wolfram and cellular automata, and I programmed cellular automata models back when I was a much younger person. It was fun to watch a computer just with a little bit of complexity, grow into a complex system. I think now, with these forms of AI, that's what it is.

    Like it's hard for a human being, unless you've been trained in it. Even then, it's still hard to think about essentially, what's in throughout most of our lives is a deterministic system. Like when we had the blue screen of death in Windows. You know that non-deterministic, oh crap, X amount of percent of the time this computer is going to flat out just die on me.

    And part of the stochastic fun of life when you dealt with Windows and

    wow! Now add this modeling capability, and this basically what are statistical semi geniuses, because we're putting enormous amount of firepower, computing firepower, behind these AIs, and their ability to do statistical models quickly far exceeds anything a human being will ever do. Even today's AI is mind boggling. Wow, when you consider what a non-determinant, a truly non-deterministic system will do now, and how do you guard against it? And let's end the podcast with this thought. Agentic AI is going to be everywhere soon.

    And yet, we're dealing with systems that we don't even know what to predict. Like Apollo Research is telling us it's crazy out there.

    So in other words, if you have it finely tuned and everything is nice and you're not giving it conflicting directives, it's probably going to do things beautifully. What happens if there's a bad guy who is able to mess with that?

  • Tim Callan

    Or what happens if there's an error? What happens if there's an unforeseen circumstance and I don't realize that I've created this conflict and suddenly I'm getting bad results that are deliberate on the part of the AI.

  • Jason Soroko

    Lorenz, back in my old climate days, the whole idea of chaos with the butterfly wing causing a tsunami somewhere else. Hurricane. Well, I think we haven't run AI even enough to not even understand what chaotic systems will do in a non- deterministic system. And to me, if you're talking about 5% - -

  • Tim Callan

    5% is a lot.

  • Jason Soroko

    That’s massive.

  • Tim Callan

    That means one time in 20 it's gonna happen.

  • Jason Soroko

    So there it is.

  • Tim Callan

    There it is. Mind officially blown.