School is in session, and the teacher is directing students to their assigned seat. Each unique name gets its own unique table. But there's an unexpected student in the class. There's another Brian! What do we do?! In this episode, we dig into how to manage these collisions in a hashtable, and how to use our collision resolution strategy to find new Brian his own desk.
This episode of the Basecs podcast is based on Vaidehi's blog post, Taking Hash Tables Off The Shelf from her basecs blog series.
[00:00:00] SY: (Music) Welcome to the basecs podcast, where we explore the basics of computer science concepts. I'm your host, Saron, founder of CodeNewbie.
[00:00:09] VJ: And I'm Vaidehi Joshi, author and developer.
[00:00:12] SY: And she is the brilliant mind behind the basecs blog series. Today we're talking about...
[00:00:17] VJ: Hash tables.
[00:00:18] SY: This season of the Basecs podcast is brought to you by our wonderful friends at Twilio. If you haven't checked out their API yet, you totally should. With Twilio, your app can send text messages and make phone calls with just five lines of code. And for a lot of developers, Twilio is the first API they've ever used. And that first time you type a few lines of code, run your program and see your phone light up with the text message? It's kind of magical. And as a thank you for being a podcast listener, you get a promo code for twenty dollars in free Twilio credit. Just text your name to 480-485-4321. You'll also get a link to a quick start guide for your favorite programming language. So text 480-485-4321. Link to that is in your show notes.
[00:01:06] DigitalOcean provides the easiest cloud platform to deploy, manage and scale applications of any size. They remove infrastructure friction and provide predictability so you can spend more time building what you love. Try DigitalOcean for free by going to do.co/codenewbie and get $100 of infrastructure credit. Link is in your show notes. Ok. Let's get started.
[00:01:31] SY: So last episode we kicked off the season—brand new season, season four with a new topic: hash tables. Can we do a quick recap of what a hash table is?
[00:01:43] VJ: Sure. So a hash table is a funny kind of data structure (Laughing) in that it has two kinds of very important but different pieces to it. You have the array, which is where the data itself is stored. It's the thing that holds all the data in the hash table. And then you have the hash function, which is what decides where the data lives. And it's the function that kind of does the decision-making of how do I put things into this data structure.
[00:02:15] SY: Ok, so it's kind of like when you're in grade school and, you know, you, you line up outside the classroom and then you go up to the teacher, and then the teacher gives you like an assigned seat. And the teacher says, "ok, you're at this desk. You're at this desk." Like the function is kind of like the teacher with the assignments and tells you where to go. And the line is like the array.
[00:02:37] VJ: Yeah, it's like you need to have the students in the classroom otherwise it's an empty classroom. And then like what's the point?
[00:02:43] SY: That's true.
[00:02:43] VJ: But then you also need the teacher to, you know, make the big decisions and like decide who goes where as you mentioned and, you know, kinda keep everything in order and, you know, regulate it and strict. (Laughing)
[00:02:54] SY: Yes, and we can only have one student per desk, right? Which is kind of an important related part of this hash table situation.
[00:03:03] VJ: Yes. Yeah. You, you only want one student at every desk—or so we think. Maybe I'll, maybe I'll complicate things (Laughing) in this episode. Never know with me. And I guess you can—if you wanna use the technical term for what that would be in the context of a hash table, we're talking about hash buckets. And those are like the sloths if you think about like in an array or the desks. It's basically the location for where our data or children, students live.
[00:03:32] SY: I feel like we're always talking about kids, which is really funny because we don't have any.
[00:03:36] VJ: Yeah. (Laughing)
[00:03:36] SY: We like always end up talking about children.
[00:03:38] VJ: Or bacon.
[00:03:39] SY: I don't know why. Maybe it's my biological clock ticking. It's my mom getting to my head. That's what it is.
[00:03:43] VJ: Oh, I'm just hungry.
[00:03:44] SY: Ok.
[00:03:44] VJ: That's why I always bring up food, (Laughing) but yeah. We record these at dinnertime. I don't know why.
[00:03:49] SY: Why is that related to children? Oh my goodness. Ok, so ideally, we'd have each child at their own desk. So we'd have each piece of data in its own hash bucket ideally. And last episode I know we talked about this idea of even distribution and how that's, that's pretty cool. That's pretty important. Can you remind me why the even distribution is a big deal?
[00:04:19] VJ: So we'll get into it even more today, and I think, you know, it will kind of show itself a bit more, but the, the short version of the story is that if your hash table and the more specifically the data that it holds, if it's not evenly distributed, then it kind of defeats the purpose of a hash table because a hash table's great in that you can easily find any data by, you know, passing a key or passing a value to a hash function and being like, "just find it for me quickly." But the way that the hash function works and the way that it can do it quickly is because it's even distributed. And you'll start to see a bit more of that in this episode today.
[00:04:57 SY: Ok, I can, I can live with that. I'm, I'm excited for what we're going to talk about this episode which is the scenario when we have two students—ok, so let's say we have this, you know, this group of kids. They're outside the classroom. They are waiting for their seat assignment. The teacher says—the teacher's very smart, so the teacher says, you know, it's going to be based on your name. So Alex will be here. Matt will be there. You know, that kinda thing. And it's alphabetical 'cause she's cool. (Laughing) And then—that's what the cool kids do. And then what happens is one of the students is names Brian, which is fine. Brian goes to, you know, Brian's assigned seat. And then there's another Brian.
[00:05:43] VJ: Oh my gosh.
[00:05:44] SY: Oh my gosh. Right? What do you, what do you do? Because according to the rule we had so far, they should kind of go to the same, the same table, but we talked about how even distribution is not, not a good thing. So what do we do now?
[00:06:01] VJ: Yeah, so in this example, I guess we are—to kind of set the scene a bit more—we're, we're like operating in a classroom that has only twenty-six tables or desks, right? And every single child—or every single letter of the alphabet maps to a certain desk. And inevitably, there is going to come a time that you're going to have two Brian's. (Laughing) Or, you know...
[00:06:23] SY: Darn Brians.
[00:06:24] VJ: Yeah. Those Brians.
[00:06:26] SY: So much trouble.
[00:06:27] VJ: Ruining everything. (Laughing) Well you're gonna—I guess the point is that at some point with some set of data, no matter what it is, if you have a hash table with a limited number of desks, with a limited number of hash buckets, if you will, then you're gonna have multiple pieces of data that will end up in the same slot, in the same hash bucket.
[00:06:46] SY: Right.
[00:06:47] VJ: And multiple children sitting at the same desk. That's something that we don't need to panic, 'cause if you're having some internal anxiety, don't. (Laughing) It's ok. Because this...
[00:06:57] SY: I'm glad that you know that I am having this anxiety. (Laughing) That's—you, you know me very well 'cause I was like, "oh my God." I mean have you, have you ever seen, you know, two kids at the same desk? It's a terrible idea.
[00:07:07] VJ: It...
[00:07:07] SY: It's a terrible outcome...
[00:07:07] VJ: Yeah.
[00:07:08] SY: ...so I'm, I'm glad that this is normal and we have a way to hopefully resolve it.
[00:07:13] VJ: Yeah, well the, the teacher, this, this hash function teacher, is pretty smart. This is, this is something that...
[00:07:18] SY: Ok.
[00:07:18] VJ: ...she's, you know, she's well, well versed in education. (Laughing)
[00:07:27] SY: She's really great with seat assignments, and that's really the primary skill required for the job in case you didn't know.
[00:07:35] VJ: She's, she's—this is not her first rodeo, you know? She's, she's seen these situations occur before. And the situation we're talking about is of course a collision, and that occurs whenever a hash table's hashing function ends up with the same index, or the same key for two different pieces of data. And that's ok because collisions are normal. They're totally normal, and we don't have to panic. And they actually are pretty common. And the way that we can deal with it is through some form of collision resolution.
[00:08:07] SY: Oh.
[00:08:08] VJ: Yeah. It sounds real fancy.
[00:08:10] SY: Collision resolution.
[00:08:11] VJ: So (Laughing) I mean I guess imagine that you were a teacher like...
[00:08:16] SY: Ok. I can already see it.
[00:08:18] VJ: Yeah. Chaos. I can see it, too. (Laughing) There's just like glue and markers everywhere.
[00:08:25] SY: Just everywhere.
[00:08:28] VJ: And popsicle sticks. (Laughing) Anyway, so imagine that you're a teacher. Like I guess fundamentally like if you have two kids that are like—you're, you're somehow—you just noticed that they're both sitting at the same desk. And they're like...
[00:08:38] SY: Yeah.
[00:08:39] VJ: ...fighting. And you're like "ok, what can I do?" And there's like an empty desk next door.
[00:08:43] SY: Oh, that's convenient.
[00:08:44] VJ: If it was me, I would just probably be like, "can you just sit at the next desk?" You know?
[00:08:48] SY: Yeah, just move over.
[00:08:49] VJ: I, I realized I asked you what you would do, but I just answered the question.
[00:08:52] SY: That's... (Laughing)
[00:08:54] VJ: You were...
[00:08:54] SY: I think it would be strange if I was like, "you can look at the desk, but you cannot sit there. You, you, you resolve these problems, ok?"
[00:09:03] VJ: Pick up the glue, Brian. (Laughing) No desk for you.
[00:09:08] SY: No desk for you. Not until those popsicles are done. (Laughing) That's how popsicles work, right?
[00:09:14] VJ: Yeah.
[00:09:14] SY: Ok, so this, this next, this open desk—like that's what we look for. This open desk, does it have to be next, next-door? Is that important?
[00:09:23] VJ: I mean I, I think it's like a, it's a pretty easy thing. It's like oh, this one's filled. Just go to the next available spot.
[00:09:30] SY: Right. Right.
[00:09:30] VJ: And I guess this is kind of me like me leading us into a specific kind of technique 'cause I really like the student in the classroom metaphor, but basically if you imagine that that was the situation and that we are not dealing with students and a teacher, but rather with a hash table and its hashing function, that actually is a way to resolve it. And it is a really simple way. It's like...
[00:09:51] SY: Yeah.
[00:09:51] VJ: I don't need to do anything fancy. There's already something at this, at this bucket. There's already Brian in this array, in this slot.
[00:09:59] SY: He's already here.
[00:10:00] VJ: So I'm not gonna put it here so I'll just go to the next spot that's available.
[00:10:02] SY: Right.
[00:10:02] VJ: And like that is one way to resolve it. I mean, it works sometimes, you know? And actually the name for this is something—it's a kind of rehashing where you hashed it and it didn't work out. So you're like oh, let me rehash it and go next door and stick it in there.
[00:10:17] SY: Right.
[00:10:17] VJ: And the term for this is linear probing.
[00:10:21] SY: Why?
[00:10:23] VJ: Well (Laughing) that's a good question.
[00:10:25] SY: That feels like not the name I would come up with for, for move it over to the next one, you know? Like maybe I would call it move it over to the next one, but like linear...
[00:10:35] VJ: Move it over to the next one. That's a mouthful. This strategy.
[00:10:40] SY: Move over. The move over strategy. But linear probing is definitely not what I would guess that that was. Why is it, why is it called that?
[00:10:49] VJ: The reason that it's called that is because it's kind of—you can, can kind of imagine that it's probing. It's kind of like sticking its head from into one...
[00:10:57] SY: Oh.
[00:10:57] VJ: ...one bucket into the next bucket into the next bucket. And it's like is there space here? (Laughing) Is there space here? Is there space here?
[00:11:02] SY: Oh, and you're doing it linearly.
[00:11:04] VJ: Yep. Exactly.
[00:11:05] SY: Oh.
[00:11:06] VJ: And actually that's a...
[00:11:07] SY: Ok.
[00:11:07] VJ: ...that's a great, that's a great question because the probing actually makes more sense if you think about like let's go back to our classroom and now there's a Brian and there's the other Brian and there's a Brandon and there's a Bradley and there's a...
[00:11:19] SY: Oh, man.
[00:11:20] VJ: ...a, a Bristopher. And... (Laughing)
[00:11:23] SY: Wait, is that a real name?
[00:11:24] VJ: No, I just...
[00:11:24] SY: I'm gonna feel bad if that's a real—you just made that up, right? Ok.
[00:11:24] VJ: No, I just can't come up with that many "B" names, so now I'm taking other names and adding B's to them.
[00:11:31] SY: I love that by the way you didn't just stick with "B" names, you stuck with "Br" names. (Laughing) Like everything you came up with is like "Br."
[00:11:39] VJ: Maybe that's the problem. I really—once I, once you got me on a roll I was like I literally will not think of any other names now.
[00:11:44] SY: Bristopher. (Laughing) Oh my goodness. That's a—if that is not a name it needs to be. That's an amazing name. Bristopher. Wow.
[00:11:54] VJ: Anyways, so eventually you, you've got all these B's, and you're just like—Bristopher comes along and you're like ok, well now I'm just... Imagine you have enough B's to fill up...
[00:12:02] SY: Yeah.
[00:12:02] VJ: ...all the desks, for example, or almost all the desks.
[00:12:05] SY: Wow.
[00:12:06] VJ: You could effectively probe your way through all the desks or the hash buckets in the hash table. If you were a hash function, you would—if you were probing linearly, you would just kind of go from one bucket to the next until you find an empty bucket.
[00:12:21] SY: Right.
[00:12:21] VJ: And then you just put the item there. And imagine that like you've gotten—you started in the middle of the room, and you've gone all the way to one end. And now you can kind of just loop back around and go back to the front of the room, which is exactly what the linear probing strategy would do. It's what a hash function would do if it was using that strategy. It would probe through the table until it finds an empty space. And if it needs to, it can kind of just loop back through the table.
[00:12:47] SY: But what happens if all the seats are, are taken? Is there, is there an option of doubling up?
[00:12:55] VJ: There is an option of doubling up, but let's kind of pretend that that option doesn't exist for a second.
[00:13:00] SY: Ok. Cool.
[00:13:00] VJ: Let's imagine that it's like, like I guess the option of doubling up depends on like how your desks are arranged. But imagine that that's not even an option yet.
[00:13:09] SY: Ok.
[00:13:09] VJ: There's a bigger problem, which is that I mentioned earlier that we wanna like evenly distribute our data, our students...
[00:13:17] SY: Yes.
[00:13:17] VJ: ...in this situation. If you suddenly have like all of the B's kind of taking up the front of the room—if you suddenly have all of your data—a certain amount of input data like in a few hash buckets in one section, now your hash table, now your room isn't evenly distributed, right?
[00:13:37] SY: Uh-oh. Yeah.
[00:13:37] VJ: 'Cause you can kind of imagine like oh my gosh all the second row is just filled with B's and now it's gonna be really, really difficult to like pull one piece of data out of the hash table.
[00:13:50] SY: Ok.
[00:13:51] VJ: And that's actually something called clustering. When all of your data in a, in a hash table is kind of—the majority of it is in a few buckets. That's something that's called a clustered table, and it's not good. It's, it's actually a symptom of a poorly designed hash table and hashing function.
[00:14:07] SY: Oh, no.
[00:14:08] VJ: On the other hand, a well-designed hash table would not be clustered. It would have its data spread across hash buckets. And so like if they—if we were talking about the teacher metaphor, a very clever teacher, very clever hash function would known oh, I need to space out all the students, especially Brians and Bristophers because we can't have them all sitting together.
[00:14:29] SY: No. No.
[00:14:29] VJ: Yeah.
[00:14:29] SY: They're trouble.
[00:14:30] VJ: Of course You don't wanna cluster them.
[00:14:31] SY: They'll just eat all the popsicle sticks. (Laughing)
[00:14:39] VJ: But anyways, yes. Clustering's bad.
[00:14:42] SY: Ok, so just to make sure I understand: even if there's only one B child at a desk, if all the children are on one row, let's say there's like ten rows in the classroom—dang, that's a really big classroom. Maybe not ten. Maybe like five. Maybe (Laughing) like this is too much. Five rows, and all of them end up on the second row. Even though each table is fine, the fact that one row is overly occupied and not spaced out, that's where the clustering issue comes in.
[00:15:17] VJ: Yeah. I guess if we're talking about it in the sense of buckets being filled and the buckets are the tables in this situation, if all the students are at one end of the room kind of just taking up all the space next to each other. That's, that's the thing you want to avoid, and that's the clustering, really. And remem-, and remember that this is a situation where you can only have one student per desk
[00:15:41] SY: Right, right. Ok, so we talked about how the downside of linear probing is clustering, which is no good. We don't want that. Are there any other downsides? Anything else wrong with this strategy?
[00:15:55] I think the tendency for clustering is the main one, but another thing is it could be kind of, you know, slow depending on what your hash table looks like if you are kind of probing through every bucket. And imagine that you are in a situation where you're already clustered, and now you're like, "oh, I'm clustered. Oh, I need to go to the next one. Oh, I need to go to the next one. Oh, I need to go to the next one." And now you're just kind of like going through the table for not any really good reason.
[00:16:21] SY: Yeah. (Laughing) Aw.
[00:16:24] VJ: Yes.
[00:16:24] SY: That sounds really inefficient.
[00:16:25] VJ: It, it's not. It, it, it's not like the best thing I would think.
[00:16:29] SY: Yeah.
[00:16:29] VJ: It probably depends—it's very hard to always, you know, give a blanket...
[00:16:32] SY: Right.
[00:16:32] VJ: ...statement about a strategy because it depends obviously on what your data looks like and how your—what your hash table looks like and how many buckets you have. And also if you're not doing linear probing that often, then it's not like the worst thing. It isn't...
[00:16:45] SY: Yeah.
[00:16:45] VJ: ...solution. And you could use it maybe in certain situations. But as you can kind of tell from our, you know, Bristopher example—I just wanna say Bristopher a lot. (Laughing)
[00:16:56] SY: You should. (Laughing) Make it a thing.
[00:17:00] VJ: Twenty years from now there's going to be so many people named Bristopher.
[00:17:04] SY: All 'cause of this episode. (Laughing)
[00:17:07] SY: But I guess the point really is that it is "a" solution but as you can see, it very often might not be the best solution.
[00:17:17] SY: Yeah, it's interesting 'cause when you first proposed it before we talked about how it was called linear probing when you said, "oh, there's two Brian's just, you know, go to the one next to it. It seems like a, like a great solution. It's like yeah, just go to the next table, but if the next table's full, and the next table's full, and the next table's full that can be quite a long time. So it seems like a, a simple operation, you know, asking the, the child, the, the piece of data to just move over. Seems very, you know, just very simple like not a complicated crazy algorithm or anything. But I can see that—well if you don't really have that many empty hash buckets anyway then, you know, that could take a while.
[00:17:55] VJ: Yeah, and I guess you did bring up a really good point, which is that computationally; it's not that hard.
[00:18:00] SY: Right. Yeah.
[00:18:00] VJ: It’s pretty lightning fast in a way because you basically can be like, "oh, if I'm at index zero, oh I need to linear probe. I just need to do index plus one. So I can...
[00:18:09] SY: Right. Yeah.
[00:18:09] VJ: ...just be like "go to index one. Oh, this is filled. Ok, now index plus one. Now I'm at two.
[00:18:13] SY: Yeah.
[00:18:14] VJ: And that's not, that's not hard. It's actually pretty fast.
[00:18:15] SY: Individually.
[00:18:16] VJ: Yeah.
[00:18:17] SY: Right, right. Yeah.
[00:18:17] VJ: But it could be slower than you realized depending on what your table and your function, how they behave.
[00:18:24] SY: Interesting. Ok, so considering that there are some pretty significant downsides to linear probing and potential downsides depending on the data that we're dealing with, I assume we have a different option; something else we can do to solve our Brian problem? (Laughing)
[00:18:42] VJ: Oh, you've—you're, you're getting so used to me providing very, very handy solutions. (Laughing) You're like, "oh, obviously, Vaidehi, you have a better solution, right?"
[00:18:49] SY: Solve our problems. Please, go. Give me options. I will tell you which one I like. (Laughing)
[00:18:56] VJ: Yeah, what if I was just like linear probing. That's it. Sorry. You're stuck with this, Saron.
[00:19:00] SY: No.
[00:19:01] VJ: Don't worry. I won't.
[00:19:02] SY: Bristoph!
[00:19:04] VJ: I love that. Oh my gosh. Oh my gosh. And then for like his short name would be like "Bris." And then when he's in trouble, he'll be like, he'll be like "Bristopher Brian Bradley Brad."
[00:19:17] SY: Buttocks. (Laughing)
[00:19:21] VJ: All right.
[00:19:23] SY: Oh, I just wanted to say buttocks. That's all. That's all.
[00:19:27] VJ: Any credibility that I had for naming children is now gone. So...
[00:19:30] SY: It's just gone. You're welcome.
[00:19:33] VJ: Didn't you ask me a question… right?
[00:19:35] SY: Yeah, actually...
[00:19:36] VJ: Oh, right. (Laughing)
[00:19:37] SY: Yes. I said so linear programming with all of its, all of its, it sounds like pretty significant downsides. I assume there is something else we can do.
[00:19:46] VJ: Oh, yes. There is, there is, there is. There is actually a whole other strategy that is going to solve some of the problems that we ran into. It's gonna present some of its own problems.
[00:19:59] SY: Oh. Oh, man. It's never that simple. Computer science is never that simple. Ok, so I guess we'll talk about that on next week on our next (Music) episode. But for today, that's the end of the show. If you liked what you heard, leave us a review, and make sure to check out Vaidehi's blog post. Link to that is in your show notes. Also wanna give a huge shout out to Twilio for sponsoring the season. Make sure to check out their awesome API. To get your twenty dollars in free Twilio credit, text your name to 480-485-4321. Phone number and link are in your show notes. Digital Ocean is the easiest way to deploy, manage and scale your application. Everything about it was built with simplicity at the forefront. Setting, deploying, even billing. Their support is amazing. They've got hundreds of detailed documentation and tutorials. So if it's your first time deploying an app, they've got great tools and community to make is nice and easy. Try DigitalOcean for free by going to do.co/codenewbie and get $100 of infrastructure credit. Link is in your show notes. Vaidehi, you wanna say good-bye?
[00:21:04] VJ: Bye, everyone.
[00:21:05] SY: Thanks for listening. See you next week.
Thank you for supporting the show!