CodeNewbie

Description

In this episode, join us as we adventure into the safari that is radix trees, where parent nodes eat their offspring nodes as they chomp them down and compress. Don't worry, with all of this new added space in the trie(b), they'll more efficiently keep their children's memory alive.

Show Notes

This episode of the Basecs podcast is based on Vaidehi's blog post, Compressing Radix Trees Without (Too Many) Tears from her basecs blog series.

Transcript

[00:00:02] SY: (Music) Welcome to the Base.cs Podcast where we explore the basics of computer science concepts. I’m your host Saron, Founder of CodeNewbie.

[00:00:09] VJ: And I’m Vaidehi Joshi, Author and Developer.

[00:00:12] SY: And she is the brilliant mind behind the Base.cs Blog Series. Today we’re talking about.

[00:00:17] VJ: Radix Trees.

[00:00:19] SY: This season of the Base.cs Podcast is brought to you by Square. If you’re designing a website or building a mobile app, you’re probably going to want to take payments at some point. Square APIs allow you to easily implement their payment form on your website. And with their In-App Payments SDK, you can add it to your mobile app too. You don’t have to worry about dealing with PCI compliance because they’ll take care of that for you. Don’t let anything come between you and your money. Start building with Square over at squareup.com/go/codenewbie.

[00:00:53] SY: Okay. So radix trees are related to tries. Is that right?

[00:00:57] VJ: That’s right.

[00:00:59] SY: Okay. Let’s do a quick recap of tries then.

[00:01:01] VJ: Yeah. Well, we’ll remember that tries are a data structure that looks like a tree, but isn’t quite a tree. They are used to basically store strings and they are tree-like structures that contain nodes that have references to parts of the alphabet, usually letters, and you can store strings and words with some sort of value inside of those nodes and you can retrieve those by traversing down the trie data structure. So it’s mostly useful for storing like key value where the key is a string and the value is whatever you want it to be, but you can find the value by finding the string within the trie.

[00:01:47] SY: And we talked about how it’s used in things like autocomplete. It’s probably the most popular example, right? Because it’s all about like letters and trying to figure out the relationship between letters and how they’re connected.

[00:01:58] VJ: Yeah, and what words are contained within words, which is also kind of cool, like substrings of a larger string. Tries are pretty fun.

[00:02:06] SY: And what I remember about tries that is kind of, well, it seems like it’s a problem. It’s just that it takes up a lot of space, right? Because every time we generate a node, we have to generate like 26 other nodes and each node has its own pointers, like it seems like a lot of stuff needs to happen for every node that you initiate.

[00:02:26] VJ: Yes. So when you have a trie, part of the annoying thing that we kind of discovered last week and like just kind of accepted as a reality, which is that when you add a node to a trie exactly like you said, you have to add a bunch of other things too. So specifically every node contains references. So when you create a new node, you also add an array with 26 references and you’re not actually adding 26 nodes, you’re just adding 26 references that could point to nothing. So for example, if you have a node in a trie that has no children, you still have to initialize the array with all these pointers, but they don’t actually point anywhere. So it’s like empty. It’s a little bit of like a moot point. You’re like, “Wow! I made all this stuff,” but it’s not pointing at anything.

[00:03:14] SY: Okay. So they do take up a ton of space then.

[00:03:16] VJ: Yes, they take up a ton of space and they’re also a little silly sometimes in the fact that they’re like repetitive, they’re redundant, and actually that’s some terminology that we can use to talk about tries. I know we’re not talking about tries today, but recapping tries is a good way to segue into radix trees because tries, like your standard trie data structure has redundancy within it. So if you find yourself with a node that doesn’t have a value but only exists so that it can have another child, so that it can point to another letter in the word, that’s kind of a redundant node because you are only adding it so that you have reference to its child. It doesn’t have any value. So you’re like, “Okay, it’s not really helping me. It’s kind of just getting me to my end goal.”

[00:04:08] SY: So is there anything we can do about it taking up so much space? Can we reduce it?

[00:04:13] VJ: Yes. So when we find ourselves repeating space for nodes and edges, when we have redundant edges or nodes in a standard trie, we can actually do something nicer which is we cannot repeat ourselves. We can avoid that redundancy and compress any redundancy and repetition within that data structure. And that’s actually where our radix trees sort of fit in because radix trees are actually a kind of space optimized version of a trie. They have that kind of compression going on. It’s like an optimized version and it’s basically a compressed trie.

[00:04:56] SY: Let’s do an example because we talked a lot about redundancy. We start talking about redundancy when we talked about nodes that only have a child node, but I think it’d just be easier if we just go through an example. Can we do that?

[00:05:06] VJ: Yeah. Sure.

[00:05:06] SY: So let’s say we have an array of words. Let’s say the words are Deck, D-E-C-K, Did, D-I-D, Dog, D-O-G, Doge, D-O-G, fun word, and then Dogs, D-O-G-S. So a lot of Ds.

[00:05:24] VJ: Yes. And Os, DOs.

[00:05:27] SY: And we got a couple of DOs. That’s true. Okay. So when we talk about some of these being redundant in this example, what are we talking about?

[00:05:35] VJ: So if you imagine this as a standard trie, we’ll have a root node that’s empty, usually it’s like empty value or empty string value, and then it would have for example everything here in this list that we just described. Everything starts with the D. So maybe it has a reference to a node that lives at the index for the letter D.

[00:05:56] SY: Yes.

[00:05:56] VJ: So that’s kind of like that is really the trie itself. We have a node that’s a letter D and D doesn’t have a value, but we have to construct these words. And so this trie would look like you’d have one branch path where you would have D connected to E, which has a child C which has a child K. And that’s our first letter, Deck.

[00:06:19] SY: First word, Deck.

[00:06:20] VJ: Sorry, right. That’s our first word, Deck.

[00:06:23] SY: Yes.

[00:06:24] VJ: Which has four letters, I can spell. I promise. So yeah, so we have four nodes, four letters to comprise that first word Deck that has some value at the end of it. And so it’s a similar situation with the word Did. We have the same letter D at the beginning and then we have two more nodes, I which connects to another node D. But can you guess what that trie is going to look like for the rest of these letters? So we have Dog, Doge, and Dogs.

[00:06:55] SY: For dog, Doge, and Dogs, we start with the no D. We need to branch off again for the letter O, but the thing is that all three of these share that O, right? So if we’re doing Dog, we would add another node, G. So we have our D-O-G. But then when we want to do Doge, we already have our D and our O and our G. So we just need to add a new child node to G and we call that E.

[00:07:27] VJ: Yup.

[00:07:28] SY: And the same thing with S. We have our D-O-G. So there would be another child node coming off of G and that node is S.

[00:07:35] VJ: Exactly.

[00:07:36] SY: Yeah.

[00:07:37] VJ: This is an example of it actually not being redundant, right? Because the O is shared by all three words. The G is shared by all three words and then the D obviously is shared and then we have S branching off for Dogs and we have E branching off for Doge. So those aren’t shared, but like there’s a lot of shared value here. There’s a lot of shared nodes, they’re not just added for no reason. But if you look at the rest of this trie and we start thinking about Deck and Did, we don’t have any sharing going on and that’s like a good example of what I was mentioning earlier with redundancy, like E and C in Deck are redundant. They’re not shared. They’re a little silly and the same thing with the edges that connect D, I, and I, D. They’re not really giving us very much benefit and yet we’re initializing them for no great reason aside from the fact that we have to.

[00:08:29] SY: They’re really expensive.

[00:08:30] VJ: Yeah. They’re really expensive. That’s a good point. I mean think about how much money we’re spending on this.

[00:08:35] SY: There’s just a ton. I ain’t got that kind of money. So basically Dog, Doge, and Dogs are all cool. We like them. Deck and Did, we’re just not crazy about right now. We’re not fans. So what could we possibly do? I mean, that’s how you spell Deck and that’s how you spell Did. So what can we possibly do in that situation?

[00:08:56] VJ: This is where we can sort of do some squeezing. We can compress those nodes that were redundant and not really helping us and that will help us, first, not repeat ourselves. And also, we will not use more space than necessary and we’ll compact down this trie. So I think this is a good introduction to what exactly a compressed trie is because I’ve been using that phrase, but I didn’t specifically define it. A compressed trie is a try that’s been compacted, but it has to follow one of two rules. First, if you have a parent node inside of this trie, so any kind of parent node, which is also known as an internal node. It has to have two or more children. So in the case of Deck and Did, E is a parent, but it only has one child.

[00:09:48] SY: One child.

[00:09:48] VJ: And C is also a parent, it has only one child, and the same thing with I. I is a parent from the word Did, D-I-D. I is a parent of D, but it only has one child. So we know something is going to happen there because I just said that the rule is you have to have at least two children. You can have more, but minimum of two. In other words, a node has to have at least more than one branch path and we can already start to guess that Deck and Did do not follow that rule.

[00:10:12] SY: No, they do not.

[00:10:13] VJ: Now the second rule is that if you want to compact the trie, every node that contains a single child has to be merged together so that any only children have to be merged with their parent. So you can kind of imagine they will sit together if they’re only children. Can you kind of like guess what’s going to happen with our standard trie when we compress it?

[00:10:37] SY: Okay. Okay. Okay. Okay. So we have Deck, we have our D node, which is shared by all the words in our array. So that’s going to be still its own node.

[00:10:47] VJ: Yeah.

[00:10:48] SY: But then our E and our C and our K are all breaking that rule of needing to have two children or more. So they’re going to be squished and compressed. I guess where can they possibly go? I guess they have to go within one node.

[00:11:03] VJ: Yeah. So if you kind of think about maybe we start the end of the word, so we have D, which we’re going to leave alone because it has many children. So it doesn’t violate any rules.

[00:11:12] SY: It got babies.

[00:11:13] VJ: Many, many.

[00:11:15] SY: It was busy. Yeah. They’re just making babies. Sponsored by my mom.

[00:11:26] VJ: Oh my goodness. Anyway, all of them have only children right? But we can start with K and we can say, “Oh, K is an only child. Its parent had only one child.” So we can make the only child sit with its parent. So we can say now we have D and E and then we can merge the two nodes we’re just looking at and have C, K, and then we’ll do that sort of again because we can work our way up.

[00:11:54] SY: They’re swallowing.

[00:11:56] VJ: Yeah. Yeah.

[00:11:57] SY: One child at a time.

[00:11:57] VJ: This is like some Planet Earth nonsense where the parent eats the child. Some tigers do that. I don’t know.

[00:12:06] SY: That sounds right.

[00:12:06] VJ: Something about cubs and don’t let parents eat cubs. I’ve been watching a lot of nature documentaries recently in case you haven’t noticed.

[00:12:14] SY: I can tell. Yeah. Okay. So after all the babies have been swallowed, so our E, C, K does become like one unit, like they become one node that’s like eaten by D. But D still is its own thing and then E, C, K is all trapped in one node coming off of that D. Interesting.

[00:12:38] VJ: So we kind of compressed that path, those three nodes, right?

[00:12:40] SY: Yeah. Yeah. You know, I just realized.

[00:12:43] VJ: Yeah.

[00:12:44] SY Okay. So by making it all into one node, you know how we initiate a new node and has to have all the arrays and all these other place holders and all these pointers and crap?

[00:12:52] VJ: Yeah, one array and a bunch of pointers.

[00:12:55] SY: Yes, one array a bunch. So we don’t need those anymore. So we are by not initializing two new nodes, we don’t have to take up all that space.

[00:13:05] VJ: Yeah. All that silly empty space that wasn’t even being used, then we were kind of like, “Oh, no, I guess we have to do it, but now we don’t have to do it.”

[00:13:11] SY: Yeah. They got like this. That was the question. That’s awesome.

[00:13:15] VJ: Somehow we have made swallowing children fun.

[00:13:18] SY: Yes. The right thing to do.

[00:13:22] VJ: This is the one and only time where it’s okay, folks.

[00:13:24] SY: Yes, context is important. Okay. So that’s Deck. So the other redundant situation is Did. Can we do the same thing with Did?

[00:13:35] VJ: Yeah, we can.

[00:13:36] SY: For Did, we’re starting from the bottom. So it’s D, I, D. So starting at that last D, we’re going to say, “Do you have any children?” No. And then we’re going to ask the parent, “Do you have any children? Oh, you only have one child, D.” So it’s going to eat up D and that’s kind of it, right? After I, D becomes one node, one merged unit, there’s nothing else for us to do.

[00:13:58] VJ: Yes. Yeah. Because when we work our way back up, we’ll see, “Oh, D is the main parent here,” but it has at least two children, in this case many more and so we leave it alone, but we’ve compressed that one path that was originally two nodes. We’ve merged into one and compacted it.

[00:14:13] SY: Okay. So after compressing our Deck and our Did, now it’s not our standard trie. It’s a radix tree now, right? Once we compressed it, it’s what we call it now?

[00:14:25] VJ: Yes. We basically have compacted it and we’ve turned our standard trie into a compressed trie and that is another name for it. This is slightly confusing. You can call it a radix tree. Sometimes you will see it is also called as a radix trie, which is of course why does have to be this confusing? I don’t know.

[00:14:45] SY: Why not?

[00:14:46] VJ: And another term for it is also a compact prefix tree, but they are all…

[00:14:51] SY: Let’s go nuts.

[00:14:52] VJ: Yeah, I know. Let’s come up with seven names for one thing. All it is, is a space optimized version of what we otherwise would call a standard trie. Unlike regular tries, they basically, the edges, the pointers the references of the radix tree can hold a sequence of a string, which is basically what we did with… instead of D, E, C, K, we turned it into D, E, C, K.

[00:15:16] SY: Okay. So our final compressed trie looks like D at the beginning of everything is the node D and then off of D, we have E, C, K. And then off of D, we have I, D. And then off of D, we have our D, O.

[00:15:34] VJ: Yeah. Still got that going on.

[00:15:36] SY: Still got that going on. And then off of O, we have our G, and then off of G, we have our E for Doge, and then we also have off of G our S for Dogs.

[00:15:47] VJ: Yup.

[00:15:47] SY: Okay. So it looks like we did some awesome compressing. Is that it? Is there more to do or are we done?

[00:15:53] VJ: I want to kind of think about like if we ignore Deck and Did and let’s say we had a smaller version of this radix tree, I know I was saying earlier like we can have D and O and G, Dog, and then Doge, D, O, G, E, and then Dogs, D, O, G, S. If we just look at that small version, there’s some compression you can do there as well because there are some shared prefixes in all three of those words, right? Dog, Doge, and Dogs all have a shared prefix.

[00:16:22] SY: They have O, G.

[00:16:25] VJ: Exactly. That’s kind of like a smaller example of the original radix tree. Actually, they’re both right. The original example that we worked with when we compress things, like we kind of have a mini version, a subset of that is just those three words because you can compress D, O, and G into a single node. And the interesting thing about this is because Dog, D, O, G is a key, it’s one of the words in our original list that we started with, it has some value, right? So because we’re dealing with key value pairs, we can have D, O, G as a single node.

[00:17:07] SY: Yeah.

[00:17:07] VJ: It can have a value and then it can have a child, E for Doge to represent Doge, and that can have a value, whatever the value is for Doge, and then we can also have Dogs represented because we can still have the original parent node D, O, G and it can have its own child S, which might have its own. So that’s basically three nodes with three different words, but we would have had to use a lot more nodes to represent this if we weren’t compressing things.

[00:17:37] SY: So you’re basically saying if we had an array with just those three words: Dog, Doge, and Dogs, okay, let me count how many nodes that would be if after we compress. So we’d have D, O, G as a node and then the E coming off of D, O, G for Doge, and then the S coming off a D, O, G for Dog. So we’d only have like three nodes.

[00:17:57] VJ: Yeah. And what’s kind of cool is, imagine we haven’t compressed it if it had been standard, we would have needed five nodes including the root node basically to retrieve the value of Dog because you would have the root node, then you would have D and you have a pointer to D. Then you would have O, then you have a pointer to O and you have a pointer to G. So that’s the fourth node. Then G would point somewhere and that’s where the value would be and you’d say, Oh, okay. This is the value for the key Dog. So you actually have a lot more nodes than maybe you think just to represent one word, but because we compress things and we turned the word Dog into a shared node, which is basically a shared prefix, this is where one of the names comes from, the compact prefix tree it’s because you can create these little compact prefixes in the tree.

[00:18:52] SY: Okay.

[00:18:53] VJ: But if we turned Dog into this shared prefix, we can condense it and now if you want to access the value for the key Dog, you still have the root node. So that’s one. Then you have the second node, which contains D, O, G, that’s two, and then D, O, G points somewhere and wherever that points, that node has a value and that’s your hit, your value for the key Dog. So now instead of five nodes to access the value of the key dog, we only have three. So we eliminated two of them by merging O and G into D.

[00:19:25] SY: That sounds awesome. I really did not think we could compress more at that point. When you said Dog, Doge, and Dogs was going to be more compression, I was like, “I don’t know Vaidehi, I don’t know if we’re going to be able to do that.” But we did it. Look at us go.

[00:19:35] VJ: Yeah, because it’s those rules, right? Like the moment you see this is like an only child situation, there’s not like any branching happening. Well, do we need to have these extra nodes? Maybe not. Actually, not maybe not, definitely not. If there’s only one child, that kind of tells you, “Hey, I can kind of compact these.” What’s really cool is if you start representing complex tries as compacted versions, you start to see all the prefixes, that word share. And you’re like, “Oh my gosh. Look, all of these words share this prefix.” And I guess like if you were really into Latin or Greek, you could start to see like the roots of words, which is pretty cool because, first of all, it’s like really efficient and you look smart.

[00:20:20] SY: Yeah. I like both of those things. That’s great. And that’s the end of today’s show. If you liked what you heard, please leave us a review and make sure to check out Vaidehi’s blog post. The link to that is in your show notes. Also want to give a huge shout-out to Square. Square has APIs and SDKs to make taking payments easy whether you’re building a mobile app in iOS, Android, Flutter, or React Native. If you want to embed a checkout experience into your website, it’s super easy. Square has processed billions of transactions so you know it’s tried and true. Start building with Square over at squareup.com/go/codenewbie. This episode was edited and mixed by Levi Sharpe.

[00:21:04] VJ: Bye everyone.

[00:21:06] SY: Thanks for listening. See you next week.

Season 6 Episode 7 June 12, 2019

The cannibalistic efficiency of radix trees

Description

Show Notes

Transcript