Root Causes 71: Short Lived DevOps Certificates

Tim Callan

How you doing today, Jay?

Jason Soroko

Doing fantastic, Tim. And I think we're lucky again to have one of our guests, Dave back.

Tim Callan

Yes. A repeat guest. Dave, welcome back. Dave Colon, Senior DevOps Engineer here at Sectigo. Welcome back, Dave.

David Colon

Hey, Tim. Hey, Jason. How's it going?

Tim Callan

So, it's good. We've had some good conversations about a high-level sort of definition of DevOps and DevSecOps, and what the obstacles are and what they mean, and how to get around those obstacles. So, it was a really good start. And I think now we wanted to focus in a little more on the concepts of access and identity, and things along those lines that are a little closer to the PKI world.

David Colon

Okay. So, before we jump into that, I definitely want to talk about, historically, how I've seen companies give production access to privileged employees. So, in previous roles, and others that I've shared with, what usually happens is, from a small startup, they literally, you know, worst- case scenario, they'll say, oh, here's the password, here's the root password, or here's the password to a user on a box, and the pseudo password is the same. That is probably the worst-case scenario nowadays. And I think everyone can agree upon that. One of the reasons for that is, when you have your SSH port open, especially if it's publicly exposed, there's so many of these SSH worms that just try to brute force and it's just not awesome. Then the - -

Tim Callan

So, in other words, it's not even about trusting the humans, right? And trusting humans takes a couple forms. There's trusting their integrity, and there's also trusting their ability, right? But even setting that aside, you're saying, regardless of that, it's just fundamentally insecure as a way to do things, even if every individual involved was perfectly honest, and perfectly capable.

David Colon

Exactly

Tim Callan

Right

David Colon

And it also gets hard to rotate passwords there, too. And let everyone know, especially as you grow.

Tim Callan

Yeah, it doesn't scale at all.

David Colon

Exactly.

And then the next common thing that people have done is, oh, just generate a SSH key and give us your public key and we'll put it on the server’s authorized keys files. And that's great, too. But when you start working at a bigger company, depending on the tools you have, if configuration management wasn't a tool there, you probably have your own bait bash, or Perl script that hands out SSH keys, and it doesn't happen instantaneously. So, you can have a developer or a sysadmin start day one, and they don't have access to production systems until, you know, their 90th day in. So that's a slow process. And you know, you don't really get the most bang for your buck from your onboarding process for that employee and that also introduces challenges with configuration drift, something that we've talked about before. Another thing that kind of sucks with that approach, and it's similar to the username and password, which I've alluded to before, was the SSH host key fingerprint. When you're using username and password authentication, or if you're using SSH keys as your authentication, there's this, you know, as a client, SSH root at your server, it will say, hey, this is the host key’s fingerprint. Do you trust it? And I brought this up before. I've never met anyone that said, hey, can someone verify this is correct? And even when it is correct, in the concept of immutable infrastructure, that IP or hostname may be the same, but the server actually changed. So, your SSH client might complain saying, hey, there might be a possible man-in-the-middle attack. And that kind of sucks, because it's trying to be helpful but now you've trained a bunch of sysadmins to basically ignore that. So, you never truly know if there's a man-in-the-middle-attack happening.

Tim Callan

Right. It's kind of like when's the last time anyone got excited about hearing a car alarm going off, right? Because we just get trained to ignore this particular alert or this particular warning signal and at that point, when it is valid, it doesn't matter, because everybody assumes it's not.

David Colon

Yeah, that's actually a good analogy. So, the next evolutionary step was and it's always been baked into SSH. I was actually - - or actually open SSH server. I was actually kind of surprised how long it's actually been baked in there. But you can use PKI infrastructure, and TLS or x.509 certificates to authenticate. And this makes it very easy because one thing that most companies have is an IT department and that IT department can put trusted CA, root CA on every company-issued laptop, and by having that root CA exist on every company-issued laptop, and then giving each user their own client certificate, when they SSH to a company server, it automatically trusts it, so that fingerprint is no longer needed. If that SSH server doesn't have that root CA as part of its chain, then you know something's wrong. Someone did something and changed and that's when it actually becomes very effective and helpful. It also makes it easier to authenticate users. So, on day one, an employee can start and they have their client certificate, and they can log into production servers immediately, because there is a policy somewhere that grants access to that specific user to log in all production servers and because it works in a PKI format, it's automatically trusted.

Now, a question that might come up would be, what is an appropriate time for expiration? Is this something like IoT? Do you want it to be 10 years, right? How long - - no one can predict how long an employee may stay at the company? So, do you rotate every 90 days? What's the friction there? And that's where the next evolutionary thought comes into play, which is this concept of short-lived client certificates and with short-lived client certificates, basically it promotes the idea that has always existed, which is no one should log into a production server, because there is inherent danger, right? A sysadmin can log into a root production server, or log in into a production server as root, may do something as simple as update the binaries, or the packages on that system, but because they upgraded MySQL, let's say four to five, it broke the application and now the business is having to deal with an outage. By using short-lived access tokens, you can create these audit policies to see who has access, you can, you know, finetune it to say what time should people have access to, or maybe mandate that there should be an acknowledge alert before someone asked for that access token? Another good thing - -

Tim Callan

So how short-lived is short-lived?

David Colon

So, we played around with that. I preferred a two-hour approach.

Tim Callan

Okay. That's pretty short-lived.

David Colon

Yeah. Yeah. It definitely is. And it makes sense why two hours, because if you have to log into, let's say, an emergency situation, a production server is down, you have to restart Apache or something silly like that, um, two hours is more than enough. If you're spending longer than two hours, you can reauthenticate fairly easily. Another good thing is, let's say it's something as simple as restarting Apache. If I restart Apache, I now as a user don't have to worry about destroying my token or whatever, because in an hour and 58 minutes, it's just going to expire anyway.

Tim Callan

And the advantage of being as short as two hours as opposed to, let's say, one day?

David Colon

So, like, the chances of, I don't know - - let's say that two-hour scenario. Let’s say I happen to leave my laptop up and I happen to be in a Starbucks wi-fi access, and I happen to go to the bathroom and forget to lock my screen, it doesn't matter if it's a day or two hours. But if it's a day and I did that the next morning, let's say it was, you know, two in the morning, I happen to restart Apache, and I wake up to go to work at 7:00 in the morning and I'm just making silly mental errors because I was disturbed in the middle of my sleep, you know, a day - - it's more likely that a day is gonna hang around your device - -

Tim Callan

Right. So, two hours gives you what you need. And other than that, it's just more risk.

David Colon

Yeah, and this has been measured against, typically, how long we've dealt with outages. I know some teams have dealt with like multi-hour outages. We've had our own share of those, but traditionally, most outages last less than an hour.

Tim Callan

Great. So, that's great. So that's a real fact-based, evidence-based approach to making that decision. I like that.

David Colon

Yeah. And it also helps with when you apply the DevOps automation and toolsets, right, and the concept of immutable infrastructure, because that puts less onus on what can go wrong in the production setting if you can always revert back.

Jason Soroko

Yeah. So, Dave, this is this is really interesting because I've been sitting around, you know, this industry for a long time and there's - - a lot of this is covered in good privileged access management toolsets that are out there today. A lot of bigger organizations are definitely employing them. For startup shops and people who really aren't running a big PAM system, these are all extremely good concepts. From way back in the day, and I'm going to, Tim, we might want to subsegment this into grizzled veteran talk, close your ears now if you don't want to hear it, but the whole idea of fire ticketing, right, is the idea that I'm going to start a very sensitive credentialled job and then as soon as I choose to end it, I can end it. Because even with two hours, I already know I can hear, you know, administrators around the world saying, yeah, but I don't want to get caught short in the middle of doing something when it's two hours and five minutes. And so, the idea of fire ticketing is you go through some sort of system where you issue the certificate, you issue the credential and then as soon as you've actually done the job, you can yourself kill the credential, right. So that's, there's different ways of actually handling that but part of the reason why the overlap time is not so bad when you're working in a command line interface and Linux, but if you think about Microsoft stack, as an example, one of the credential types that you might really want to monitor is the issuance of a credential for the purposes of a domain controller administrator, as an example. And that domain controller administrator, if you if you put a hard stop of two hours, and then you cycled say, the underlying Active Directory hash of that user, they might be caught short in the middle of doing their job and cause untold amount of damage or confusion or configuration issues that are left in a limbo state. And so therefore, you know, that's where I think the idea of a more flexible fire ticketing came in.

And for people that aren't aware, for those of you who, you know, have only lived in this newer world of SSH to get into your servers, in the Microsoft stack of technologies, we're still dealing with probably about a 20-year-old problem of the pass the hash problem, which is where, when you log into a system, even remotely, your log in hash is actually stored on that remote computer. It's on yours. It's on the remote computer. Somebody bad actually comes in, is able to read a specific memory, protected memory of that computer system, they're actually able to then take that hash and then imitate you across the network, across anything where Active Directory is able to, to access. Being a domain controller administrator is kind of the holy grail for the bad guys because once they get a hold of that hash, it's almost unlimited what they can't do, or can or can't do with your entire Active Directory tree. So, it's interesting how old ideas come to play again and I really like the fact that it took people in the Microsoft stack of technologies world, you know, decade or two, in order to think of these things. I'm really glad now, Dave, that, you know, this new world where people are working in cloud systems, in DevOps, working with SSH authentication, these are really strong forms of authentication and I'm really, really glad to hear that the principle of least privileges is now part of the conversation much earlier on. So, congratulations to you and your generation for coming up with this faster.

David Colon

Thanks. You actually touched on a really good point. I think when we were doing some of our testing, that two-hour thing, if I'm not mistaken, I’m not sure if it was fixed in recent open SSH, but I know people may interpret it differently. But for at least Linux, and open SSH server, you only need that certificate, you have like two hours to log in, basically. So, we even thought about making it as short as 10 minutes, 15 minutes, because why would you request access to something and this is something that came up and I remember, we made it as short as 10 minutes and you're still logged in into your session - -

Tim Callan

Right. So, my session isn't blown up when the certificate expires. My ability to log in goes away when the certificate expires.

David Colon

Exactly. So, the reason for the two hours is we found some team members, especially me because I'm guilty of this, is I’ll log into the same server through three or four different terminal shell sessions and the reason for that is like I'll have one terminal session for, let's say, a log file. Let's say I couldn't figure out our central logging system at the moment and it was just easier for me to look at the log files in real time. Another one looking at the config. And then another one looking at, let's say, top on the CPU or memory resources. This is kind of like a little bit of an old school workflow. There's the central tools now that exist that show you in real time and web browser pages each one of those and sometimes together, but that's where we came down to two hours is just in case someone needs to log in through another session after the first 10 or 15 minutes.

Jason Soroko

So, for everybody listening to this podcast, obviously, this is a PKI podcast. And I just got to pause there for a moment and just remind people that all this cool stuff Dave's talking about is brought to you by PKI. SSH in itself is just key pairs, right? It's just, and they're really unmanaged, except by you. So therefore, if it's just pure manual management, it's not really management, per se. It's not automated management. What PKI is giving us, especially with that x.509 certificate structure, is that ability to expire a certificate to give you that principle of least privileges and give you the benefits that Dave's talking about right now.

David Colon

Yeah, and it also gives you, like, I believe like based on the name, the subject name or email address, to say, oh, that person's privileged to get access to root or to log in as a non-privileged user, for someone like in development, where developers don't need root access, they just need to be able to just, like what I spoke about, look at a couple of log files or something. For whatever reason, the central logging system isn't working.

Jason Soroko

Yeah. So, it brings you far beyond just the concept of baring a secret, which is, which was essentially be the private key. This private key is now wrapped with some sort of identity and policy.

David Colon

Exactly. And another good thing too, about the short-lived access is when you combine it with other DevOps concepts, like immutable infrastructure and pipelines, is no one should be logging in as a privileged user on a production system unless absolutely necessary, because any changes you want to make on the system should be going through your pipelines, and it should be under immutable infrastructure. Therefore, if a change breaks something, you just revert back to the old, you know, git hash or something. If a change didn't break anything, then great, it's awesome. That short-lived access really should only be break glass in case of emergency moments.

Jason Soroko

This is great, Dave. Tim, this is this is why it's so important to have a practitioner on to hear from real world examples of this.

Tim Callan

Yeah, absolutely. I loved that anecdote about how you guys went about figuring out the right length for the certificate because it's just so, you know, it's just - - again, it's real world-based, right? It's so much better than somebody sat and made something up, which is too often how these things are done. And so, I like that. That shows the pragmatic aspect of how that choice is being made.

David Colon

And the beautiful concept, or the beautiful like feature of PKI is I can control that expiration, right? So, we kept playing around with it. It's not something like, oh, we had to rebuild and do everything from scratch. No. There is some certificates that live for a day. Some live for as short as ten minutes or two minutes and we played around until it worked for us.

Tim Callan

Right. And, by the way, if circumstances change in a month or a year, and we want to change that, we can.

David Colon

Yep. And it also gave us a way to like, because we are a PKI company and we know how important it is to protect a private key that because of that practice of doing it for a commercial CA, we applied it now to our internal and private CAs.

Jason Soroko

Dave, will just one last thing while we're on the topic of very short-lived certificates, let's go back into the DevSecOps topic again, and you know, it’s something that I know in previous podcasts we've touched on, but when you're talking about, you know, human beings doing human administration on servers, I think your two-hour analogy is right on the nose. How about something that is truly automated? A mutual TLS authentication session for, say, a container that you know, you know, if it's healthy and it's running properly, shouldn't run more than five to ten seconds. Sometimes even shorter than that. But let's give it five to ten seconds. You know, that same kind of thinking that you put into that the human administration for two hours, what would you put as a certificate expiry for a very short lift container like that?

David Colon

So, and we haven't done it yet or I'm not aware of anyone. I'm sure it's probably done somewhere else. But it is something that me and a couple folks have talked about and we have this concept of sidecar and Kubernetes where you can have a container probably live up to a couple of minutes, let's say. And part of the shutdown of that container is to revoke the certificate itself. So, there's a sidecar that's responsible for let's say, renewing your certificate in case - - you know, let's say the two-hour analogy let's say you need it for two hours and five minutes, it will renew it at let's say an hour and 30 and try every five minutes until it gets a new one. And then it also revokes itself when it's no longer needed.

Tim Callan

So, you've got the revocation there, but you also have a reasonably short- lived expiration, right? Because theoretically, like, theoretically, a software could cause that revocation not to occur?

David Colon

Exactly. Yeah. You can like let's say Ctrl C, that Docker container, that sidecar never - - sorry, Kubernetes pod. And that sidecar never had the chance to execute its shutdown script, essentially. So, that's the reason why you also have a short-lived.

Tim Callan

Yeah. Yeah. That's your backstop. So, okay. This is great. This is great information. I love it. I think this is a good time to probably wrap up this podcast. It's been a nice length. So, once again, Dave, thank you very much. We love your particular take on these matters and your particular brand of insights. Jay, always a pleasure talking to you.

Jason Soroko

Thank you, Tim.

Tim Callan

And this has been Root Causes.