CodeNewbie

Description

In this episode we continue our talk on pies and tries, and how this data structure is used to power such things as auto-complete!

Show Notes

This episode of the Basecs podcast is based on Vaidehi's blog post, Trying to Understand Tries from her basecs blog series.

Transcript

[00:00:02] SY: (Music) Welcome to the Base.cs Podcast where we explore the basics of computer science concepts. I’m your host Saron, Founder of CodeNewbie.

[00:00:09] VJ: And I’m Vaidehi Joshi, Author and Developer.

[00:00:11] SY: And she is the brilliant mind behind the Base.cs Blog Series. Today are continuing our conversation on…

[00:00:17] VJ: Trie.

[00:00:19] SY: This season of the Base.cs Podcast is brought to you by Square. If you’re designing a website or building a mobile app, you’re probably going to want to take payments at some point. Square APIs allow you to easily implement their payment form on your website and with their In-App Payments SDK, you can add it to your mobile app too. You don’t have to worry about dealing with PCI compliance because they’ll take care of that for you. Don’t let anything come between you and your money. Start building with Square over squareup.com/go/codenewbie. Okay, let’s do a quick recap. What is a try?

[00:00:57] VJ: So a trie is a data structure that kind of looks like a tree so we can say it’s tree-like.

[00:01:05] SY: Tree-esque.

[00:01:06] VJ: Oh, yeah. I like that, tree-esque. It’s a treesque. Yeah, I won’t say that word again. I don’t know what that was.

[00:01:16] SY: That worked. That totally worked, yes.

[00:01:19] VJ: It’s tree-like but actually, it’s very different from trees because it has some more rules that it has to follow in terms of how it is created and used. Probably the biggest difference in terms of what makes it different from a tree, for example, a binary search tree is it is really used to store letters of the alphabet. So we use tries to contain strings and substrings, which is why the nodes in a trie usually correspond to the letters of some alphabet.

[00:01:52] SY: And I remember there are just a lot of nodes involved, right, because you had a node and then had 26 nodes that may or may not actually be used and then each one of those, the 26 nodes, may not be... it may seem very node-y.

[00:02:03] VJ: Oh, yeah, very node-y and there are some arrays, too, which is kind of interesting because really, it’s like a data structure that like within the node, there’s also an array basically. Like a node is an array with references to other nodes that are also arrays and those references could be pointing somewhere. And if they point somewhere, that means there’s a node that exists there. If they don’t point anywhere, basically, they’re empty. It’s like a combination of a few different things in order to create a trie data structure.

[00:02:33] SY: Yeah. Last week we talked about how to create a trie and how to add to a trie.
How do we delete from a trie? We haven’t talked about that yet.

[00:02:45] VJ: Deleting from a trie, there are really only two steps, but it’s important to highlight what step one is and what step two is and why you have to do it in that order, and we’ll see why in a second, but maybe should we start with an example and then try to remove something from it?

[00:03:00] SY: Sure. Yeah.

[00:03:01] VJ: So let’s say that we have a trie where the root node is empty and usually, it’s an empty string, which we talked about last episode. It has two keys within it. Yes. It has keys for the string pie as in P-I-E, and it has a key for the string pies, P-I-E-S and because they’re keys, at the end of the last letter of each of these keys, there’s a value. We’ve also talked about this in the last episode. So in the case of pie singular, let’s say that, the key pie has a value of 5, which means at the node E, there’s a value of 5. Let’s say for the plural P-I-E-S, multiple pies, there’s a value of 12. Now if we wanted to remove a key value pair, so if we basically wanted to say, “You know what? We don’t want one of these keys. We don’t want multiple pies. We just want one pie.” I don’t know why you would do that in real life, but you know, it’s all right, we’re dealing with tries so that’s fine.

[00:04:07] SY: We’re trying.

[00:04:09] VJ: Nice. So let’s say we want to delete the key pies, P-I-E-S, but we want to leave the key pie, P-I-E, singular. So if we wanted to remove that key, what we have to do is find the node that contains the value for that key first.

[00:04:25] SY: Okay.

[00:04:25] VJ: So do you know which one that would be?

[00:04:27] SY: That’d be the S.

[00:04:28] VJ: Yeah, totally.

[00:04:28] SY: Okay.

[00:04:29] VJ: I was just…

[00:04:29] SY: I was like, “Please don’t let [inaudible 00:04:30].” Because if it’s not S, I have no idea.

[00:04:34] VJ: No, it’s S. You’re totally right. Once we find the node that contains the value for that key, in our case pies, once we find the node that contains the value 12, which is S, the first thing we’re going to do is we’re going to set the value to null. So we’re going to remove the value. Notice we’re not doing anything else with the node. All we’re doing is we’re basically just saying, “I’m going to take the key that exists in this trie and I’m going to clear out its value.”

[00:05:02] SY: Yeah, so node 12.

[00:05:04] VJ: Exactly. The node S still exists.

[00:05:07] SY: Yes.

[00:05:07] VJ: But it has no value. So now what we’ve done is we’ve removed the key that exists in the trie but we’ve kept the node. The second step, and this is the part where maybe it seems weird, but hopefully, it makes sense if you think about it, we’re going to look at the node whose value we just removed, in this case S. It’s no longer a key but it’s still in the structure and we’re going to check that node and we’re going to look at all of its references, which basically means we’re going to look at its array and see if there are any references to other nodes. So if there are any pointers in that array that are pointing elsewhere, we need to check what those are. If they are all null, if they’re all empty, which basically means if there are no children coming from this node, if after the node S, there are no other letters then we can basically remove the letter S. There’s no value to it. There’s literally no value.

[00:06:04] SY: There’s no point.

[00:06:05] VJ: And it’s not adding anything.

[00:06:07] SY: It’s not doing anything. It’s just taking up space for no gosh-darn reason.

[00:06:10] VJ: Yeah, typical S. Typical. Those are actually the two steps really. Those are the two things. We just find the node, remove any value it had and then check its references. I have a proposition.

[00:06:26] SY: Okay.

[00:06:26] VJ: So what if instead of deleting the key pies, say we don’t even want to mess with that key, say instead we’re like, “Okay, let’s delete the key pie, singular, P-I-E.”

[00:06:38] SY: Okay. So we’re going to do step one, which is to go to the key, so go to E and take the value for that, which is 5, and we’re going to make it null. We’re going to reset it so it’s going to be null. Now that it’s null, we are going to check to see if it has any references of its own and it does because we also have pies, plural. So because it has a reference of its own that goes to the S node, then we cannot delete it because it would be messing us up and deleting our other key. We just leave it as null and I think that’s it.

[00:07:17] VJ: Yeah, exactly. And that was why I was kind of like, “Oh, it might seem weird but there’s a reason.” And the reason is when we reset our tree and we were like, “Okay, instead of deleting pies, let’s delete pie.” Well, we’re only trying to delete one key and we’re not trying to do anything with the other one. So if we deleted the node E after setting its value to null, then we would lose that whole key and you’d have an orphan node, and you also wouldn’t have the key pies anymore. It’s tricky the first time maybe because you have to remember to check if it has any children but then once you remember that check, I think deleting is pretty straightforward.

[00:07:54] SY: Okay. So now that we know how to delete from a trie, what else? What else should we know about them?

[00:07:59] VJ: Well, I think that there are some similarities between tries and another familiar old friend, hash tables.

[00:08:08] SY: Yeah, this come up a couple times I feel like.

[00:08:10] VJ: Yeah, I think after covering and sorting algorithms, they came up and I might have even mentioned it in the last episode, but they seem to be similar in a few different ways. The first obvious thing is they both use arrays and we’ll remember from previous discussions on hash tables they basically have a set size and you have a little hashing function that when you give it some sort of input, it tells you where to stick it, where it should live in the array. Between the hashing function and the array, you can always know where to find things and where to put them and where to retrieve them later. So tries are similar in that they use arrays because every single node has an array, right? The array is 26 slots long and each of them has potentially a pointer somewhere or maybe it’s null and you can also retrieve data from it.

In our case, what we’ve been retrieving is strings and subsets of strings, and those have been the keys and usually, there’s a value. So there’s like a similar like “If this value exists, give it to me,” and it’s like a hit. If it doesn’t exist, well, there’s nothing there. It’s a miss. So there’s like a little bit of similarity, but there are some differences too.

[00:09:23] SY: Okay. So what are the differences?

[00:09:27] VJ: Even though they both use arrays, hash tables deal with hash collisions by using linked lists. When you get to a slot, a bucket in a hash table, if you have multiple instances of something or multiple things that live in that bucket, you can deal with it by basically having a pointer to a linked list of items. That’s one of the ways. Tries don’t have a linked list. Tries have pointers, right?

[00:09:54] SY: Yes.

[00:09:54] VJ: And we’ve seen that one node can have a pointer to another node and that’s how the trie is structured, but these two differences, the use of linked lists and hash tables and the use of pointers in tries show us what’s the pro and the con between the two. As I mentioned, a hash table has a hash function, but a trie doesn’t need a hash function, so that’s kind of nice. Not that hash functions are bad. It’s just that you don’t need one because every key in a trie is represented with string, right? And so every key is unique so you don’t need to worry about collisions because if you have a key, it’s a string then you will have a value and you’re not going to have multiple instances of that string. You just have one key. There will always be a value for it if it exists in the trie.

[00:10:43] SY: That’s really convenient.

[00:10:44] VJ: Yeah, and you don’t need a hash function to tell you where to look because you can just traverse down the trie structure and find if that node has a value and that means you have a key. If it doesn’t have a value, there’s no key. However, there is one annoying thing with tries and that has to do with these pointers, which basically, I guess what I’m trying to say is there’s a lot of them.

[00:11:06] SY: Okay, because I was thinking about that because it feels like you start off with the root node, which has nothing and then it’s empty, no value, and then it has like 26 pointers and then maybe one of them is full and then that one has like 26 more. It just feels like there’s a lot going on.

[00:11:23] VJ: Yes.

[00:11:23] SY: With no actual data in the stuff that’s going on.

[00:11:28] VJ: I will say one thing which is there’s actually a constant number of pointers because each time we add a node, which basically means each time we add a letter into the trie, there are only ever 26 references because there are 26 letters in the alphabet. And it’s not going to change in the context of our trie. If we have five words in the trie or 5,000, the number of pointers actually doesn’t grow. That’s a constant value, but I do agree that the number of pointers that could potentially be empty is kind of annoying and I will challenge our listeners to think about ways to optimize tries, which I won’t get into, which I don’t actually learn about even until recently but there are ways to improve it. I would just say that they’re constant so it’s actually not as terrible as maybe we think. It’s not great, but it’s not like horrific either because they’re not growing as you grow the trie. It’s always just 26 pointers.

That is a downside, but here’s an upside. When you create a trie, the amount of work you’re actually doing, the hardest part I guess is when you’re adding words into the trie. And if you have an empty trie, well, you’re always going to have to add a lot of nodes, a lot of letters. But if you’ve already created a trie and a lot of words exist in it, a lot of keys are already inserted, then the bulk of the work is already done, which is kind of nice because now, say, you’ve got a hundred different keys in the trie and you want to add something and maybe a substring of the word you want to add already exists, well, now you can just work your way down the trie and say, “Oh, I have this letter, I have this letter already. I have this other letter. Maybe I just need to add one more node.” So if you’ve already created the trie…

[00:13:24] SY: You’ve made the investment.

[00:13:25] VJ: Yeah, exactly. Yeah, you’ve made an upfront investment and you do less work as the trie grows in size because I guess the intermediate nodes, the branches of the trie have already been built up. So to add a value isn’t necessarily a really hard thing to do.

[00:13:41] SY: How do tries perform? Because so far they sound, actually, I can’t really tell. The fact that there can only be 26 sounds like it’s going to be a good thing, but then there are still a lot of empty nodes that are hanging out. I’m actually not even sure how to guess it. How do tries perform in terms of a Big O notation?

[00:14:01] VJ: When it comes to… let’s talk about creating the trie because I kind of just was talking about how that’s the upfront investment. That’s kind of the harder part and you kind of see that reflected in the Big O notation for the run time complexity of creating this trie structure and that’s dependent on two things. First, it’s dependent on how many words, how many keys the trie could contain. It also depends on how long they are. If you have shorter words and not too many of them, well, it won’t be so hard to create that trie, but if you have long words and a ton of them, well, now you’re inserting a lot more nodes, you’re taking up more space. It’s going to take longer. In general, it’s just not as efficient. So basically the terms that we use to describe this is M where m is the longest word in the trie and N where N is the total number of words and the Big O notation, the complexity of creating a trie structure is OM times N. So M times N is basically the longest word and the total number of words. That’s basically the worst-case scenario, right?

[00:15:14] SY: Yeah. Okay, remind me again. Is that good or is that bad?

[00:15:19] VJ: It depends on what M and N are.

[00:15:22] SY: True, true, yeah.

[00:15:24] VJ: We usually talk about it in terms of worst case so if the longest word is a long word and then you have a ton of words, that’s going to be a higher number, but if the longest word is actually not that long and you don’t have too many words, well, now that’s a little bit more efficient. But the way we talk about it in terms of Big O is the worst case and so the worst case is O is M times N.

[00:15:47] SY: So the worst case could be really bad.

[00:15:50] VJ: Potentially but again, this is also for just creating the trie.

[00:15:54] SY: True, true, true.

[00:15:55] VJ: Maybe you could create a very large trie ahead of time or on a separate thread or in the background or something like that. I guess it depends on if you’re going to have this trie in memory, if it’s used for something, maybe it’s okay if it’s big and you’re not blocking other operations while you’re creating it. Maybe potentially you’re working with a structure that you’ve built a little bit at a time over time and you’re not creating this massive structure from scratch.
So maybe that’s not that bad because, really, when we’re talking about inserting things in a trie and actually, it’s the same complexity for searching and deleting but inserting something into a trie, the amount of time that that takes really just depends on the length of the word and the total number of words.

So if you are inserting something and the length of the word is short, that’s going to take less time and the total number of words that you have is also not that big, that’s going to be a faster run time. But if it’s longer words that you’re searching for or you’re trying to delete, well, now you’re traversing basically based on the longest word that you have. And if the length of the word that you’re searching for is the longest word and that’s the one you’re trying to delete, you’re obviously going to have to do more work.

[00:17:12] SY: So if we wanted to insert like Supercalifragilisticexpialidocious, that would be a problem.

[00:17:20] VJ: Yeah, that would take more especially if you didn’t have any substring of that.

[00:17:24] SY: Yeah. Well, they’re super. That’s good. That’s probably the only one, though.

[00:17:30] VJ: Yes, yeah. It depends basically on the length of the word that you’re inserting and the total number of words too.

[00:17:37] SY: So where do we see tries in real life? Where can I find them?

[00:17:41] VJ: Oh, you can actually see them in a lot of places, but usually, they’re used in conjunction with other things.

[00:17:48] SY: Okay.

[00:17:49] VJ: So basically, you’ll see them used in combination with another structure or in the context of an algorithm or something, but tries are pretty cool because they’re used to power things like autocomplete, which actually kind of makes sense because if you’re looking for a substring of another string…

[00:18:07] SY: You’re like traversing down.

[00:18:09] VJ: Yeah, and you’re potentially finding a value sooner or maybe you know the potential values that could come down that branch. If somebody was using auto complete and we used a trie to power it and someone typed in P-I, and we know P-I-E is also a potential thing down the path or P-I-E-S maybe, that’s also a potential thing like a potential hit down the path.

[00:18:33] SY: That makes a lot of sense.

[00:18:34] VJ: And again, that depends on the number of words, the number of keys you have in the trie. So if you have a bigger trie, well, now you can predict way more things. The other kind of neat feature you can see tries used in under the hood is for spell checkers and matching algorithms. You can guess how that works because if you have a word and it’s spelled incorrectly then basically you can say, “Oh, here’s a word, find it in the trie. If it doesn’t exist in the trie, well, it’s not correct. Here’s a word, if it exists in the trie, oh, it’s spelled correctly. It’s fine. It exists as a key.”

[00:19:12] SY: These tries are pretty cool.

[00:19:14] VJ: I know. They’re really fun.

[00:19:16] SY: Yeah, I like that they deal with the alphabet because it makes it so much easier to relate to the real world. I feel like so many times, you’re talking about integers and well, integers.

[00:19:27] VJ: It’s just numbers.

[00:19:29] SY: It’s just numbers. You feel like, “All right, so, how do I use this for my web app?” You know what I mean? How do I bring it up with this one? I like this. This might be my favorite one.

[00:19:39] VJ: Wait, your favorite structure ever?

[00:19:42] SY: I think so, yeah.

[00:19:44] VJ: Wow! It took six seasons and it happened. I’ve been waiting for you to tell me what your favorite structure is.

[00:19:51] SY: This is the one.

[00:19:52] VJ: That’s cool.

[00:19:53] SY: This is the one I like. It’s a little tricky with the whole nodes on nodes, the whole inception of nodes, but besides that part, yeah, I think this is my favorite one. Yeah. Well, wait, what’s your favorite one? Which is your favorite data structure? Do you have one?

[00:20:05] VJ: Oh, no. I do, I do but that’s because I like to go in circles and so I like graphs.

[00:20:15] SY: Okay, true. You’re right. I’m a fan of the graph situation for sure.

[00:20:20] VJ: But tries are cool. Tries are really cool.

[00:20:22] SY: Tries are cool.

[00:20:22] VJ: I like the substring matching thing and I’m like, “Hey, you know what? This little data structure, it’s got a purpose. It’s got a value. It can do stuff and it does it really well.”

[00:20:32] SY: It could do stuff. Yeah. That’s right.

[00:20:37] VJ: It was the little trie that could.

[00:20:40] SY: Oh, my god, that’s so cute. That might have to be the episode title. I love that. The little trie that could. And that’s the end of today’s show. If you like what you heard, please leave us a review and make sure to check out Vaidehi’s blog post. The link to that is in your show notes. Also, want to give a huge shout-out to Square. Square has APIs and SDKs to make taking payments easy, whether you’re building a mobile app in iOS, Android, Flutter or React Native. If you want to embed a checkout experience into your website, it’s super easy. Square has processed billions of transactions so you know it’s tried and true. Start building with Square over at squareup.com/go/codenewbie. This episode was edited and mixed by Levi Sharpe.

[00:21:25] VJ: Bye everyone.

[00:21:26] SY: Thanks for listening. See you next week.

Season 6 Episode 6 June 05, 2019

Dear tries, you (auto)complete me

Description

Show Notes

Transcript