Description

In last episode, we talked about 2-3 trees, where the nodes of every tree contain data in the form of keys, as well as potential child nodes, and can contain more than one key. This takes us to b-trees, which is a generalized version of the 2-3 tree, and are super efficient for storing data in an indexed database, like MySQL.

Show Notes

This episode of the Basecs podcast is based on Vaidehi's blog post, Busying Oneself With B-Trees from her basecs blog series.

Transcript

[00:00:02] SY: (Music) Welcome to the Base.cs Podcast where we explore the basics of computer science concepts. I’m your host Saron, Founder of CodeNewbie.

[00:00:08] VJ: And I’m Vaidehi Joshi, Author and Developer.

[00:00:11] SY: And she is the brilliant mind behind the Base.cs Blog Series. Today we’re talking about…

[00:00:15] VJ: B-trees.

[00:00:16] SY: This season of the Base.cs Podcast is brought to you by Square. If you’re designing a website or building a mobile app, you’re probably going to want to take payments at some point. Square APIs allow you to easily implement their payment form on your website. And with their In-App Payments SDK, you can add it to your mobile app too. You don’t have to worry about dealing with PCI compliance because they’ll take care of that for you. Don’t let anything come between you and your money. Start building with Square over at squareup.com/go/codenewbie.

[00:00:54] SY: All right, what is a B-tree?

[00:00:57] VJ: So a B-tree is a variation of a structure we’ve kind of already seen before. They’re self-balancing trees, that’s something we’re pretty familiar with, but they’re a self-balancing tree that is a variation on a binary search tree and it allows for the tree structure to have nodes that have more than two child nodes.

[00:01:20] SY: Okay. That sounds a little bit like the 2–3 tree, right, from last week?

[00:01:24] VJ: It is. A 2–3 tree is kind of like a B-tree. B-trees are like a more generic version. And when we talked about 2–3 trees last episode, what we were looking at was like a subset of a B-tree. A B-tree is basically based on that and we can actually like apply some of the 2–3 tree characteristics to a B-tree and we'll get into that later in the episode. But yeah, they’re very, very much connected because they’re both self-balancing. They’re both a variation on binary search trees and they both have a little bit more flexibility when it comes to how many children nodes you can have.

[00:02:01] SY: I like that because I feel like a lot of our tree investigation, exploration has been finding rules that are super, super strict and kind of hard to maintain and to keep up with and they’ve been really restricting and it’s nice to see that this is going to be maybe a little bit more open-minded, maybe a little bit more flexible. So I’m excited about B-trees.

[00:02:20] VJ: I’m glad you’re excited because they’re flexible but then like there’s a lot of things happening too.

[00:02:25] SY: Okay. Cool. Okay. So what is the story behind B-trees? When were they made? Why were they made? That kind of thing.

[00:02:33] VJ: They date back to the year 1971 and there are these two researchers who are working in Seattle at a place called the Boeing Research Labs, which I guess maybe it’s just like that still exists, I guess, because Boeing is still a thing.

[00:02:46] SY: I would say Boeing like the flight situation?

[00:02:49] VJ: Yeah, the 747. I don’t know what. I don’t know.

[00:02:57] SY: Okay, cool, Research Labs. Yeah.

[00:02:58] VJ: Yeah. So there are these two researchers who are working there, Rudolf Bayer and Edward McCreight and they’re working on a paper there because they were basically trying to figure out how to create a data structure that really, really made it easy to read and write to large portions of data at a time, and that’s basically the Genesis of the B-tree. They wrote a paper about this data structure and they introduced this idea of the B-tree, which was optimized for this task, for reading and writing to large pieces of data. And the main thesis of the paper, which is also the main idea behind B-trees which is why it’s important brand up, is that you could use the branching factor of a tree in order to figure out how many possible children one node could have and how many keys it could contain, and this is not a new idea, right? Last episode we talked about 2–3 trees. We learned that you can have more than one key in a node and more than one reference to a child. Actually, you could have like two keys and three children. That’s the whole idea with a 2–3 tree. So this paper is very much a generalization of that idea and 2–3 trees were invented one year before B-trees. They were invented in 1970 by a computer scientist named John Hopcroft. And then a year later, B-trees were created and they were just a generalized version of the same data structure.

[00:04:20] SY: Okay. So this idea of having multiple children is not new to us because we talked about last week, but you used a term that I don’t think we talked about, the branching factor. What is that?

[00:04:31] VJ: We haven’t talked about this term specifically, but it’s not new to us also. The branching factor basically represents the bound of how many children one node can have. So when I say the bound, what I mean is the minimum number of children and the maximum number of children, the range of possible children that can come from one node in a tree. So since a B-tree is a generalization of a 2–3 tree, in a 2–3 tree we know that every node can either have two or three children or if it’s a leaf node, it has no children. But we know that that’s the bound, 2 to 3. That’s it. If you’re trying to generalize that, you need a way of talking about what that lower bound is and what that upper bound is. So that’s what the branching factor is and a lot of the times when we’re learning about B-trees, you might see the branching factor referred to as the variable B, like capital B.

[00:05:24] SY: Okay.

[00:05:25] VJ: But don’t get freaked out because it’s just a variable. Sometimes you’ll also see it referred to as the variable N and that’s just like an abstract variable name so that you can represent what that branching factor is, the maximum and minimum number of children per node in the tree.

[00:05:39] SY: Is there a formula associated with the branching? Like how do we use the branching factor to actually give us the upper and lower bounds?

[00:05:46] VJ: There is a formula. It’s math time.

[00:05:48] SY: Yes. Let’s go.

[00:05:49] VJ: The formula basically is in kind of two parts because there’s two things we have to consider. We have to consider the number of children that a node can have and there’s a formula for that and we have to consider the number of keys the node can have. There’s a separate formula for that. I think the one for the children is a little bit easier, so kind of reconcile. So we’ll start with that. The number of children that you can have per node in a B-tree can be described in this formula. B, the branching factor, is less than or equal to X, where X is the number of children per node is less than two times B. So remember I talked about the bounds, the lower bound and upper bound. So that’s really all this is talking about. I know there’s like symbols somewhere.

[00:06:38] SY: Okay.

[00:06:39] VJ: To Bs and Bs. All it’s really saying is the lower bound has to be less than or equal to this and the upper bound has to be less than this other thing. So B is less than or equal to X is less than to B.

[00:06:55] SY: Okay. So let’s do an example. So let’s say we have a branching factor of 2, so B equals 2. So if I plug that into my formula, I have B is less than or equal to X, which is less than 2 times 2, which is 4. So 2 is less than or equal to X is less than 4. So I guess that means my X can be either to up until just under four and I don’t think we can do like 3.5 children, right?

[00:07:27] VJ: Right. You have to have full nodes.

[00:07:29] SY: Yeah, the integers. Okay. Cool. Full nodes, not half nodes. Okay. So basically I can either have two or three children.

[00:07:36] VJ: Exactly. And by the way, when you said the branching factor of 2, when you were like, “Let’s make B equal to 2.” You’re actually using the branching factor for a 2–3 tree because in a 2–3 tree B is equal to 2.

[00:07:51] SY: Okay.

[00:07:51] VJ: So you basically use this formula to prove the rule we learned in last week’s episode, which is that in a 2–3 tree you can only have two or three children per node and when you think about the fact that a 2–3 tree’s B, it’s bound, its branching factor is equal to 2, now the formula, the math makes sense. It all sort of adds up.

[00:08:14] SY: Okay. That’s not too bad. You mentioned that there is a formula for children, but you also said there’s a formula for keys?

[00:08:20] VJ: Yes. Yes. There is a formula for keys. This is one step more complicated, but it’s not too bad. The formula for figuring out the number of keys per node in a B-tree can be described as B minus 1 is less than or equal to Y and Y is the number of keys or the data that can be contained in the node. So B minus 1 is less than or equal to Y and Y is less than 2 times B minus 1.

[00:08:50] SY: Okay. So that’s pretty similar to the children per node. But it looks like we’re just doing a minus one on either side.

[00:08:57] VJ: Exactly.

[00:08:58] SY: Okay. Oh, that reminds me of the N minus one thing we discovered last time.

[00:09:03] VJ: Yes. And this is the opposite.

[00:09:05] SY: Yeah. Yeah. Okay. So again, if we use our branching factor of 2, we have 2 minus 1, which is 1 is less than or equal to Y is less than 2 times 2 which is 4 minus 1 which is 3. So basically, we have 1 is less than or equal to 3, which means that the number of keys we can have are either one or two.

[00:09:31] VJ: Exactly. And again, this aligns with exactly what we’re talking about last week where I said, “These are the rules.” Now we know this is why these are the rules because of these formulas, and when you have a 2–3 tree, the number of keys per node corresponds directly to the branching factor, just the way that the number of children per node corresponds to the branching factor. And I said earlier on this episode, a B-tree is a generalization of a 2–3 tree. So 2–3 tree follows this formula, but also we can use this formula to figure out all the other types of B-trees that you can really have.

[00:10:09] SY: Yeah. Yeah, because we’ve abstracted out the rules for our 2–3 tree and now we can use it for anything.

[00:10:15] VJ: Exactly. So a couple of other trees that are similar to 2–3 trees are 3–5 trees.

[00:10:22] SY: Okay.

[00:10:22] VJ: That basically means you can have up to three pieces of data per node and a node can have up to five children.

[00:10:28] SY: Okay.

[00:10:29] VJ: And then you have 4–7 trees and you have 5–9 trees.

[00:10:33] SY: Neat! Okay. So every node in my B-tree has to follow these rules.

[00:10:39] VJ: Yes, but there’s one asterisk here because there’s always a little caveat. The caveat is that the root node doesn’t have to follow one of these things. It has a little bit more flexibility because it doesn’t have a lower bound when it comes to how many keys it must have. So what this means is that when you… for example, let’s say I mentioned there’s a type of tree called the 5–9 tree. So let’s say you have a 5–9 tree, its root node could have up to nine children. You can have many children, but it can possibly just contain one key. It doesn’t have to contain five keys. It doesn’t have to follow that lower bound and that’s totally okay. So the root node is like this exception where you can break that one rule in that one scenario. So there’s that one exception with this root node. The other thing which we kind of I think already inherently know, but I just want to say it explicitly is that the leaf nodes break the rule of how many children they’re going to have because as we already know leaf nodes don’t have any children. So they’re never going to have nine children because they literally don’t have children. So that’s the other thing to keep in mind is that leaves and the root have their own little idiosyncrasies when it comes to this branching factor and the rules that you have to follow with a B-tree.

[00:12:02] SY: So we talked about what we can do with trees before, we talked about inserting, deleting, kind of shifting things around. What kind of stuff can we do with a B-tree?

[00:12:12] VJ: You can do a lot of like manipulation. Just like you said, there’s inserting, deleting, and searching for things. But with a B-tree, we actually already have a lot of the tools for dealing with this structure. So we’ve talked about in lots of previous episodes that you can rotate keys around or rather you can rotate nodes around and their values to restructure a tree. So that’s the same thing with a B-tree. The difference is that because you have multiple keys per node, so potentially like in a 2–3 tree, you could have two keys. So you just have to do this one extra step which is like you have to consider what those keys are going to do when you restructure the tree and there are some strategies. We won’t really talk about them in detail today. You can look it up in the blog post, but you can basically split the keys of a node or you can merge nodes together and propagate their keys up.

[00:13:06] SY: Whoa!

[00:13:07] VJ: Yeah. Yeah, it’s cool.

[00:13:08] SY: That sounds hardcore.

[00:13:09] VJ: Yeah. It's really interesting, to be honest like most of us probably won’t Implement a B-tree, so we don’t really have to think about it. But if you ever had to, like I guess the important thing to keep in mind is that you have two pieces of data potentially. Or maybe like in a 5–9 tree, you have five pieces of data, you need to think about like, “Okay, if I’m going to rotate something, what’s going to happen to these keys?” And that’s where the splitting of the keys and merging nodes and moving keys around comes into play.

[00:13:37] SY: So we talked about with the 2–3 tree specifically that it was around and it was important because it helps us manage large amounts of data. I assume B-trees also do the same thing if they’re an abstraction of a 2–3 tree. So how does it actually do that? How do these lower and upper bounds, these multiple nodes and keys, like how does all that translate to managing large amounts of data?

[00:14:00] VJ: So in last week’s episode, one of the things we talked about is that in a 2–3 tree you can sort of traverse the tree in order and you get all of the keys in sequential order and the fact that the sorted order of the keys is connected to your ability to sequentially travel the structure is actually the thing that makes B-trees efficient when it comes to storing data. And the reason for that is because a lot of times you use, if you’re storing the data in a database that has an index, also known as an Index Database, that index basically maps to the keys in the B-tree, which is why like being able to quickly sort through those keys and map it to some data makes it super efficient. And another really nice thing is that it’s really easy for a database to be able to read a section of a disk at a time because a lot of B-trees will have a B value, a branching factor that is equivalent to the size of one chunk of data. And so you can basically retrieve one section or a subtree of the B-tree by mapping it to part of the database. It’s kind of like a loose explanation. I’m not a database person. So like I’m sure there’s all these other intricacies, but like that’s the general idea, the idea that you can map portions of data in your database to a B-tree and then you can pull it all out and read whatever values you need to at once without having to make tons and tons of additional calls to your database. And that’s the really nice thing about like using the structure to be efficient about reading from and writing to large portions of data, which is what those two researchers were trying to do anyways.

[00:15:45] SY: Are there places or tools where we might encounter B-trees and maybe we don’t know it?

[00:15:50] VJ: Probably the place that you’ll find it the most is in databases, especially like distributed databases, large-scale databases because usually B-trees are used under the hood in some way, shape, or form. And if you ever read about database management, like I was the other day, B-trees are going to come up for sure.

[00:16:09] SY: Very cool. And that’s the end of today’s show. If you liked what you heard, please leave us a review and make sure to check out Vaidehi’s blog post. Link to that is in your show notes. Also want to give a huge shout-out to Square. Square has APIs and SDKs to make taking payments easy whether you’re building a mobile app in iOS, Android, Flutter, or React Native. If you want to embed a checkout experience into your website, it’s super easy. Square has processed billions of transactions so you know it’s tried and true. Start building with Square over at squareup.com/go/codenewbie. This episode was edited and mixed by Levi Sharpe.

[00:16:46] VJ: Bye everyone.

[00:16:47] SY: Thanks for listening. See you next week.

Thank you for supporting the show!

Thank you for supporting the show!