Root Causes 348: What Is a Merkle Tree?

Hosted by

Tim Callan

Chief Compliance Officer

Jason Soroko

Fellow

Original broadcast date

December 15, 2023

One foundational element of modern cryptographic systems is the Merkle tree. Merkle tree is an enabler of blockchain and CT logs, among other things. We explain this data structure, its properties, and its use cases.

Podcast Transcript

Lightly edited for flow and brevity.

Tim CallanWe have spoken in the past on a number of episodes about blockchain. And one of the aspects of blockchain from a technology or I don’t know if want to say a mathematical perspective is that it is based on a structure called a Merkle Tree. And today, we thought we would explain what is a Merkle Tree.

Jason SorokoVery cool. Look, there is so many kinds of data structures in computer science. There’s a ton out there and a Merkle Tree is one that, interestingly, patented in 1979 by Ralph Merkle, it is a data structure and it has some properties to it which have become useful in today’s world, and you are hearing more and more about them. In fact, the CT logs that we’ve talked about before, Tim, based on a Merkle Tree data structure.

Tim CallanYes. You’re right. It is more than blockchain for sure. CT logs are another example and just Merkle, just if anybody is looking it up, that is M‑e‑r‑k‑l‑e. It’s a data structure. And go ahead, Jay. Tell us about Merkle Trees.

Jason SorokoSo, let’s put it this way. Blockchains are interesting for a lot of reasons. You can go back to several different podcasts we’ve done on this, and I would say that this specific data structure, even though it’s related to blockchains in a way, it’s not in itself a blockchain. Right?

It’s not used for the same thing. It is essentially a one-way tree. It is a one-way data structure. You only really ever add to it. So, that’s true. Same thing for blockchains, etc., but it’s used for kind of different purposes and it’s ideal when you have a large, large - what will eventually become a large dataset. So, a Merkle Tree needs to have this property - - if you are trying to solve the problem of a very large dataset that will grow through time, large and complex, and you need for every data item stored in the data structure, if you need to ensure that those blocks of data within the data structure have to be guaranteed to be received undamaged, unaltered and to also make sure that people who are adding – when I say people I mean systems – that are adding data to the Merkle Tree cannot be lying and sending fake blocks of information into the data structure and so how do achieve that? And the way to achieve that, Tim, is we gotta go back to a good ‘ol cryptographic primitive – hashing.

Hashing is really at the heart of the Merkle Tree. And, in fact, a Merkle Tree is in a data structure sense a hash tree. Right? And what the means is that every single block of data within this data structure has been hashed with something and what that hash is depends on what kind of a data block we are talking about here. Because when you are talking about a tree structure, there’s obviously a root to the data structure. There’s an initial block and then every single other block after that will typically have another two blocks assigned to it as children and then on and on and on like that. And so, this is essentially a way for the labeling of every single one of these blocks of information. There’s an associated label which is a hash of the actual block itself and those blocks are basically added upon. Right? So, in other words, every single hash after that is related to the parent within the tree.

Tim CallanRight.

Jason SorokoAnd so, in that way, what you get is alright, well, hey, I’m looking at a block of information and I see the hash of this. Does this hash relate to the hash of one of its parents is a question you could ask mathematically. And if that checks out yes, then yes, this was added to the Merkle Tree and therefore you can - - basically you know that it was hashed truly by whatever peer system had added the information and it has not been tampered with because you’ve got the hash and you’ve also got the hash of the label of the parent.

And that’s why you call it a hash tree is because essentially you are hashing through a set of parents and children all the way up to a root.

Tim CallanSo there is unalterability as part of this. Presumably there is some kind of trustworthy mechanism where things are added to the Merkle Tree and so, there also is an aspect of the fact that the new blocks that are added are themselves coming from a known source for want of a better word. I’d say a trusted source. And all of that is sort of built into the structure. Is that correct?

Jason SorokoYeah and obviously, there is an initial setup at the root and getting into how that is done is a bit beyond the scope of a just a simple conversation here. We could get into it a different time but once you have this root with its hash of the block information that it contains then, of course, that root hash labeling is then used onwards down through every single one of the children and through the generations along the Merkle Tree and so there’s a chain of unalterability. In other words, if something were to change either at the most recent leaf block or up the chain of the hashes, you would know that something up the chain was altered and then you could either choose to distrust it or figure out what the heck happened and then fix the problem.

But, let’s just talk about the use cases for a moment because I think those are what are most interesting.

Tim CallanSure.

Jason SorokoI think whenever you are talking about a Merkle Tree, it seems to be ideal for any kind of a trusted association with PKI issuance, Tim. And that’s the reason we are talking about it here. That’s why Merkle Tree as a data structure was chosen for the CT logs. Because of the fact that, if you think about it, every single certificate that’s issued by a publicly trusted CA, the information about that basically constitutes a block on a Merkle Tree and that block is hashed and then hashed along this long chain of next certificate, next certificate, next certificate. Right? And so, therefore, it’s a perfect data structure to put a very large, extremely large list of data such as the millions of certificates that are issued on a daily basis.

One of the other important aspects of a Merkle Tree is it can be made public. Because it is unalterable, it can be made public so other people can read it and the ability to glean information off of that non-stop stream of truth can get you a lot of good things such as, hey, were any domains that I own issued with a certificate in the past year or two? Well, you can look up the Merkle Tree CT log and see whether that was true or not. And if something does pop up that you don’t expect, you can raise the alarm bells and do what you need to do to help protect yourself. Whether it’s revoking a certificate or just investigate what the heck is going on.

I would say, Tim, that I would imagine that for other PKI issuance topics in the future, I don’t think we’ve seen the end of the usage of Merkle Trees.

Tim CallanIt’s interesting that it’s got those properties that turn out to be very valuable in certain aspects and your CT log example is such a great one. Are there other places that people use Merkle Trees in production systems that we would understand?

Jason SorokoI think so, Tim. You know what, I’m just looking this up just to see whether or not it’s true but some of you may have heard of the interplanetary file system – IPFS.

Tim CallanSure. IPFS. We’ve discussed that in the past. Is that a Merkle Tree-based strategy?

Jason SorokoApparently, these types of hash trees are used in IPFS. Apparently, the Apache Wave Protocol, the Mercurial Distributed Revision Control System. It is used in Apache Cassandra. In fact, even within the bitcoin implementation itself, other forms of hash trees are used, obviously, and I don’t think that they are that far away from actual Merkle Tree. There may be just enough difference to declare that it is not but my understanding is that it’s very, very close.

If you are going to put a trusted stream of data and I’m talking about a growing data structure, a data structure that is going to grow through time and that needs to be publicly available and trusted amongst peers, this kind of data structure just is kind of ideal.

Tim CallanYeah. I also found it interesting what you threw out in the beginning which is that there is a patent on this.

Jason SorokoYeah. I don’t know a lot about that.

Tim CallanSo, I don’t know if that patent is just not being enforced because you would think that you wouldn’t build massive entire strategies like things on CT logs on something where there is some kind of potential patent liability.

Jason SorokoU.S. Patent 4309569 Ralph Merkle – method of providing digital signatures and, yeah. So, I think like a lot of very fundamental patented work, sometimes patents are given to open source essentially. This is one where I didn’t come with that information. I don’t know exactly what Ralph Merkle did, but I do know that this is used all over the place. And it seems to be a publicly available piece of intellectual property. Maybe next time we are talking about this topic we can bring it up. But there it is.

Tim CallanWell, anyway. So, that’s probably good for today. A nice, short and sweet explanation. You know, it is a thing that comes up. People talk about it; we throw around this word Merkle Tree and we thought it would be worth our while to explain what does that mean when we say that.

Jason SorokoExactly. So, stay tuned. You’ll probably hear more about Merkle Trees down the road.

Tim CallanProbably will.

Jason SorokoAnd directly related to PKI and certificate issuance. They go together like peanut butter and jelly.

Stay informed with expert insights

Subscribe to Root Causes for engaging discussions on PKI, digital security, and best practices for protecting your organization's critical assets. Don’t miss an episode!