CodeNewbie

Description

Last week, we talked about two ways of classifying sorting algorithms: time complexity and space usage. This episode, we dig into two more! We explore how algorithms can be internal or external, and what "stability" means for a sorting algorithm. And we do it all with the help of cards, clovers, and a pair of Michaels.

Show Notes

This episode of the Basecs podcast is based on Vaidehi's blog post, Sorting Out The Basics Behind Sorting Algorithms from her basecs blog series.

Transcript

[00:00:00] SY: Okay. I’ve got a very exciting announcement. We finally have a date for Codeland, our annual conference. It’s happening July 22nd in New York City. Early bird tickets are now on sale. For just 99 bucks, you get breakfast, lunch, dinner, a coding workshop, technical talks and after-party, and of course an amazing community. Sale ends March 10th. Go to codelandconf.com for more info. Link is in your show notes.

[00:00:29] (Music) Welcome to the Base.cs Podcast where we explore the basics of computer science concepts. I’m your host Saron, Founder of CodeNewbie.

[00:00:37] VJ: And I’m Vaidehi Joshi, Author and Developer.

[00:00:39] SY: And she is the brilliant mind behind the Base.cs Blog Series. Today we’re continuing our conversation on…

[00:00:44] VJ: Sorting Algorithms.

[00:00:46] SY: This season of the Base.cs Podcast is brought to you by our wonderful friends at DigitalOcean. DigitalOcean provides the easiest cloud platform to deploy, manage, and scale applications of any size. They remove infrastructure friction and provide predictability so you can spend more time building what you love. Try DigitalOcean for free by going to DO.co/codenewbie and get $100 in infrastructure credit. Link is in your show notes.

[00:01:15] This season is also brought to you by the fine folks at Linode. If you’ve got a personal project, a small business or a big business with lots of data, Linode offers you secure hosting for all your infrastructure needs. They are a Linux Cloud Hosting provider where you can get a new server up and running in under a minute. Plans start at one gigabytes of RAM for just five bucks a month. And with the promo code CodeNewbie2019, you can get a $20 credit. So go to linode.com/codenewbie and give it a try. Also, they’re hiring. Check out their jobs at linode.com/careers. Link is in your show notes.

[00:01:53] SY: (Music) Okay, let’s get started. So we talked about sorting algorithms last episode. We kicked off the conversation with what that was. Let’s do a quick recap. What is a sorting algorithm?

[00:02:07] VJ: Well, a sorting algorithm is some way of organizing data. And when we talk about organizing data, we’re talking about of course sorting it in some way and when we are talking about sorting things, especially in the context of computer science, you can sort anything but the things that you’re sorting, whatever types of data, whether it’s numbers or letters or objects or data structures of some type, they just have to be similar items. And a sorting algorithm lets you organize those similar items in some way.

[00:02:39] SY: Okay.

[00:02:40] VJ: You’ll remember from last episode that we kind of talked about how the way that you organize items when you sort is through some sort of property which, you know, if you and I think about it, it’s kind of obvious like maybe you are alphabetizing something or organizing numbers from smallest to biggest in some sort of order. You just have to have some sort of property that you’re organizing by because that’s how you’re going to define whether you sort something one way or another.

[00:03:05] SY: Yes. So sorting by alphabet, sorting by date created, sorting by file size, that kind of thing.

[00:03:10] VJ: Exactly.

[00:03:11] SY: Okay. Cool. So we talked about how we can classify I guess is maybe a good word, classify different sorting algorithms and last episode we talked about two, in particular two out of six, one was by time complexity. Can you remind us what that means again?

[00:03:29] VJ: Sure. Yeah. When we’re thinking about sorting things and thinking about in the context of time complexity what we’re really asking ourselves is how much time it takes to sort a dataset with respect to its input size. So if you are looking at a sorting algorithm, let’s just call it Algorithm X. When we’re talking about time complexity, we’re talking about what does this Algorithm X do with respect to a dataset of 5 items, 50 items, 50 million items, so on and so forth.

[00:03:59] SY: Okay. And then the second one we talked about was space complexity, also space usage is another way of thinking about it, can you remind me again what that was?

[00:04:09] VJ: Sure. So when we talk about space complexity what we’re talking about is memory usage. So how much space, how much memory we need, specifically extra memory in order to sort input. So if you have a sorting algorithm and you have to sort the data, but you can’t do it all in place, you maybe need a little bit of extra memory, maybe you have like a set of data that you’re going to copy over and then sort that or maybe you have some variables that you need to create to like have temporary references, those are different types of memory usage. Sometimes it can be small, sometimes if you’re copying over a whole dataset it’s a lot larger and then your sorting algorithm requires more space so space complexity is going to be different.

[00:04:54] SY: Yes. And I remember that even with in-space complexity, we can classify it further by talking about in-place versus out of place.

[00:05:02] VJ: Right. Yeah. So when we’re talking about space complexity, the words that we use to describe space complexity like if somebody was like, “Can you tell me the space complexity of this algorithm?” You would say, “Oh, yeah, sure, this algorithm has to copy over all its data,” and then it makes all of its changes on its copied version. That’s an out-of-place algorithm which kind of makes sense when you visualize it, right? Because you’re kind of like sitting in one place with a bunch of data and you’re like, “Hang on, we’re going to copy a bunch of stuff, make a whole duplicate, I’m going to move this out of place,” versus another way you could talk about an algorithm is an in-place algorithm, which operates directly on the input dataset and changes it.

[00:05:44] SY: And remind me again. I think we talked about how one was maybe safer than the other one or some of the pros and cons of them. What were they again?

[00:05:52] VJ: Yeah. So when you have an in-place algorithm, you’ll remember that we are operating directly on the inputted data. We are not creating a whole copy. We’re not transforming our data and duplicating it. We’re changing and sorting it right there. That’s where it gets dangerous. That’s what you’re kind of like thinking.

[00:06:10] SY: Right. What if it’s gone?

[00:06:11] VJ: You’re thinking about how scary it could be. Yeah, because if you’re doing it right there, if you mess up, well, or if something goes wrong, then that dataset is effectively destroyed once you transform it. The out-of-place algorithm, the one where you copy over your data, that requires more space, but it’s got this pro, it’s got this benefit where it’s a little safer because you are copying your data, you’re not operating on it. But again, you’re using more space.

[00:06:37] SY: I think my default status when I’m doing things in my own files is out of place always. I’m like, “Duplicate, then we’ll rename it or we’ll modify it or we’ll whatever,” but I just have copies all over the place. I’m just so scared of permanently replacing or changing or accidentally deleting something and never able to get it even though I know I have like that backups and Time Machine still, still out of place just feels like the safer thing to do.

[00:06:59] VJ: Yeah, but then do you ever have to like… what happens when your like hard drive fills up and it’s like, “You’re out of space”?

[00:07:06] SY: Oh, I’m always out of space. Oh, I’m constantly out of space, confidently.

[00:07:11] VJ: Because I’m just imagining you make like six copies of a file being like, “No, no, no, I need another one.”

[00:07:15] SY: Oh, this is a huge problem.

[00:07:19] VJ: So I’m just imagining you just are like, Well, filled up this machine, time to go buy a new laptop.”

[00:07:26] SY: Literally that happened like a month ago.

[00:07:29] VJ: Oh my God! That’s wonderful.

[00:07:29] SY: “Oh, I got a laptop,” because I was like, “I’m keep running out of space. I think I’m just going to get a lot more gigs. Oh, look, I’m running out of space on this one, too.” So yeah, story of my life. Okay. So now we’re going to talk about two new classifications of algorithms. The first one is called stability. What are we talking about when we talk about stability?

[00:07:55] VJ: So stability is an interesting, it’s an interesting way of classifying a sorting algorithm because it kind of depends on a certain situation arising and that situation is when you have a dataset and you’re sorting the things in the dataset, right? And you’re sorting the same way that we’ve talked about in last episode and this episode. You’ve got a sort key and you’re like, “All right, I’m going to sort this by date or time or let’s just say for the sake of simplicity, I’ll alphabetize.” What happens when you have two elements that are the same? How do you sort them? It could be fine, depending on how your algorithm handles it.

[00:08:31] SY: It sounds like it’s going to be a catastrophe. Okay. I’m at the edge of my seat right now.

[00:08:37] VJ: I know. I heard you go, “Uh-oh!” And I was like, “Oh, no, wait. Did I make it sound really scary?” Sorry.

[00:08:43] SY: I have no idea what’s going to happen.

[00:08:45] VJ: We have two fours in our dataset. What will we do? Just abandon ship. We don’t need the number for it. Just get rid of it.

[00:08:51] SY: Just forget the algorithm entirely. Just forget it. Just forget it. Well, this is CSS. It’ll be fine.

[00:09:00] VJ: Oh, man. So the fact that there are duplicates isn’t bad. In fact, if you think about most datasets, like for example if you have like users, it’s a great example and or if you have a guest on the CodeNewbie Podcast, maybe you have a repeat guest or you have people who have the same name.
[00:09:18] SY: I’ve had to rejig some code when that happens. I was like, “Oh, the system did not account for that.”

[00:09:25] VJ: Oh, did that really happen?

[00:09:26] SY: Yeah.

[00:09:27] VJ: Oh, that’s wonderful.

[00:09:28] SY: So I went through a phase where, you know, like the first few episodes, I’m like, “Oh, I’m sure everyone’s first name will be good enough,” and then, you know, I had like two Michaels or something, I’m like, “Crap! That’s not going to work.alright I got to add a last name.” And then I had to re you know, rework the system to have first and last name, and then like you said, I had a couple repeat guests. I invited people back and I was like, “Oh God, this didn’t work again. All right, I got to rejig things.” So yeah, yeah. I feel you. Stability sounds real nice.

[00:09:59] VJ: Well, so what stability is, is when you have that situation arise which as we’ve seen is pretty common. You have to think about how you’re going to sort items that are duplicates because really what we’re looking at here is a situation where many different items in a list or in a dataset could be kind of considered equal. Like if you have two Michaels, you could sort them equally if you’re talking about alphabetizing them because they’re equivalent so you could put one before the other and you still would have a sorted set, right?

[00:10:29] SY: Uh-hmm.

[00:10:30] VJ: Your set would still be sorted alphabetically whether you put one Michael first or another. So what stability is talking about and when we talk about stable sorting algorithms, we’re talking about algorithms that preserve the relative order of elements. So what that means is when an algorithm sees two items that are equal in a sense or where the property that they’re being sorted by is equal or it has an equal key, it preserves the relative order in which it sees them. Another way of saying it as the order in which it encounters those elements as it sorts them. To that end, a stable sorting algorithm is one where the output is guaranteed to preserve the same original order of two elements. So an algorithm that sorts alphabetically and sees one Michael and then sees the second Michael will always preserve that original order of the two Michael elements when it sorts them and it’s going to preserve that order. But an unstable sorting algorithm is a situation where you could have two items, like the two names Michael, that have the same key value, but there’s no guarantee that they’ll actually show up in the same order when you sort them.

[00:11:40] SY: I hate when my Michael show up in the wrong order. I really do. It’s the worst.

[00:11:48] VJ: I mean, I think another great example of this is I was just thinking about this earlier today. If you have like something like a deck of cards and you are sorting the deck of cards and you have two fives like the five of hearts and the five of diamonds, your algorithm could probably see both of them and be like, “Oh, these are equivalent. I’m going to sort them. It doesn’t matter how.” That’s a not stable algorithm because it saw maybe the five of hearts and it saw the five of diamonds but it sorts it and then the five of diamonds somehow shows up first even though when it encountered it was the opposite order. A stable algorithm will preserve that same order. So it sees the five of hearts before the five of diamonds. Even though they’re equivalent it’s going to maintain the order in which it’s sorted them.

[00:12:29] SY: Okay. I have many questions.

[00:12:31] VJ: All right.

[00:12:32] SY: Question number one, for the unstable one, if it doesn’t do it in the order in which it encountered it, then what is it doing? Because I feel like the natural normal thing to do would be, “Oh, I see five of Hearts first. I’m going to show you that one first. Oh, I see five of…” oh, crap, what are other suits? Spades?

[00:12:58] VJ: Spades.

[00:12:59] SY: Okay, spades. Is that a thing?

[00:13:00] VJ: Clubs?

[00:13:01] SY: Clubs, thank you. Five of clubs, is that how you say that? Oh, what is English?

[00:13:06] VJ: Or clovers.

[00:13:07] SY: Clovers? No. No. You’re going to set me up Vaidehi. You’re going to set me up on this podcast. That’s what you’re trying to do. Okay.

[00:13:14] VJ: I’m trying to show everyone you don’t know cards.

[00:13:18] SY: Okay. I’m going to go with clubs. Okay. Five of clubs, you know, if I encounter it second, the normal thing to do is to show it second. So I feel like for the algorithm to be unstable it has to go out of its way to be like, “Mm-mmm I’m going to mess with you. I’m going to show you the clubs first and the hearts second even though that doesn’t make any sense and it’s probably harder for me.” So like in what world or what situation would I encounter an unstable algorithm?

[00:13:48] VJ: So I won’t give too much away because we’ll probably talk about this in future seasons, but there are some algorithms that try to be like clever and rather than just see a dataset and just sort through the whole thing as we haven’t even talked about how these algorithms work, but there’s different approaches, right? So some of these algorithms that we may even actually be using under the hood in some languages like for example the one that I’m thinking of is used in some parts of JavaScript, but some algorithms will have to be more efficient so they won’t just take one dataset and sort it. They’ll maybe split it up into pieces. So they’ll like take a large dataset, divide it, partition it, and then take a large unsorted collection divided into smaller collections and then maybe divide those more and sort those and break them apart, which if you think about it just from an efficiency standpoint it might be a little bit faster. I won’t tell you how or why, but you might be able to imagine why that might be the case. But if you’re taking a dataset and you’re dividing it up and you’re sorting it in fragments…

[00:14:51] SY: Yeah.

[00:14:52] VJ: …then you don’t have really the best sense of where that little piece that you’re sorting fits into the larger dataset. So you might combine pieces back together and be like, “Here’s a sorted version,” and then the algorithm might be like, “All right, I’m going to zip up all of the data that’s sorted,” but how does it know that the Michael it saw in this section is going to come before the Michael in that action? Or how does it know that the five of clubs is going to come before the five of diamonds? I don’t even know which suits you said now, I forgot. But my point is if you try to be efficient with some algorithms you may not be able to preserve that relative order and so now your, you have an unstable algorithm.

[00:15:30] SY: Okay.

[00:15:31] VJ: Maybe that’s okay, maybe you don’t care, maybe you do care, it depends on what you’re trying to do with that dataset and what you’re sorting eventually for.

[00:15:41] SY: Okay. Another question. When we talk about it preserving the order in which it encounters it, what are we talking about? Are we talking about the way it was originally stored? Are we talking about like the, you know, the created attribute basically? Or are we talking about, you know, when it looks for the right data to collect it finds it first, like what is the encountering it part describing?

[00:16:10] VJ: I’m not entirely sure if it goes beyond anything except for when it actually goes to sort it. So if you give it just an unsorted dataset, whatever that order is in, that’s what we’re talking about as the original ordering, the original un-ordering if you can think about it like that.

[00:16:30] SY: Yeah.

[00:16:30] VJ: Like whatever way you give…

[00:16:32] SY: Gave it to it.

[00:16:33] VJ: Yeah, whatever way you gave the input to the algorithm, if it keeps that same relative ordering, that’s what makes it stable. So you could be pulling that input from who knows where. You could be saying give me all the users and that’s just according to your database, it doesn’t matter, but whatever way you give the input to the algorithm if it preserves that input when it comes to elements that are the same or that can be sort of the same, that’s what we’re talking about here.

[00:16:59] SY: Okay. So it’s kind of like when you go to jail and they take all your stuff…

[00:17:04] VJ: What?

[00:17:05] SY: No, it makes sense. I swear. I swear.

[00:17:07] VJ: I was like…

[00:17:11] SY: It totally works. Wait for it.

[00:17:12] VJ: All right. I’ll hold my breath.

[00:17:14] SY: So if they take your stuff and they put, you know, in a bag or whatever and then when you get out of jail, everything is the way that you gave it, like the way you sent it off is the way you get it back.

[00:17:26] VJ: Yup.

[00:17:27] SY: You know what I mean? So it’s like that’s stable.

[00:17:30] VJ: Yeah.

[00:17:31] SY: That’s really stable. I never been to jail just for clarification purposes. I just watch a lot of TV and it’s always in the same condition. Okay?

[00:17:41] VJ: In that same little Ziploc bag, right?

[00:17:42] SY: Same little Ziploc bag just as you left it. Stable, okay. So what I think about stable and unstable, just even the word stable makes it sound like it’s better. Is that necessarily true? Do we want stability in this context?

[00:17:58] VJ: I think it depends on whether you care about preserving relative order. If you’re like… it doesn’t matter if it’s just like… maybe it’s just numbers and you’re just like, “Tell me how many fives I have in this dataset or tell me how many things happen on July 31st in this dataset.” If two things happen on July 31st, I don’t care which one happen first or second relatively because to me they’re equivalent, then it doesn’t matter. So I think it depends on what your data is and what you want to do with it, if that makes sense.

[00:18:26] SY: That does make sense. Okay. So we talked about stability. Next we’re going to talk about internal verse external. What is that referring to?

[00:18:37] VJ: So when we talk about internal versus external, we as the name might suggest are thinking about things that can be sorted internally or externally. I know, that’s not helpful. It’s not a good definition. I’m going to amend it. Don’t worry.

[00:18:54] SY: Okay. Okay. Cool.

[00:18:55] VJ: So when we talk about things being internal and external what we’re talking about is whether or not an algorithm can sort a dataset easily in the machines main memory or whether it needs to rely on some sort of external memory. So now we’re really talking about like memory, so like RAM or like main memory, like we’re talking about that kind of memory now.

[00:19:23] SY: Yeah, like a real stuff.

[00:19:25] VJ: The real stuff. I don’t just mean some variables, man, I mean like real data. I don’t even know what that means. That’s what I’m saying.

[00:19:33] SY: I’m with you in spirit. I’m with you. My heart is in it.

[00:19:37] VJ: I’m just trying to make it seem more heavyweight, you know?

[00:19:40] SY: It’s hardcore.

[00:19:43] VJ: Anyways, back to internal and external. If we’re looking at a dataset, especially like a large dataset and that’s really when internal and external kind of comes into play because if it’s small dataset, you’re not going to need external memory, right? You probably could just handle it differently. But if it’s a large dataset, as in the real world it often is, you need to think about whether all the data that you’re trying to sort can be sorted in the main memory or if it can’t be sorted in the main memory whether the sorting needs to occur outside of main memory. So for example, that might happen like on a disk or on a tape, which I don’t really think tapes are used anymore, but this is like a relic of computers. Nobody uses tapes but back in the day. So if it could happen in main memory, that’s an internal sorting algorithm, and if it has to occur outside of main memory, if it has to use some sort of external memory like a disk, then that’s an external sorting algorithm. The way that an algorithm has to store data while it’s actually doing the work of sorting, that’s how we classify if it’s internal or external.

[00:20:52] SY: Is that a decision that we need to think about? Because when I hear about internal versus external in that way, I’m thinking, “Okay, it would need to be external if, you know, my computer is really slow or like the system I’m using is really slow,” right? And then I have to delegate it to something that’s more powerful, but with today’s machines and technology and like speed and power everywhere and all that stuff, is that really something I need to be concerned about?

[00:21:19] VJ: I would say no because I think a lot of the algorithms that we use in languages are probably a lot of them are internal because they can happen in main memory and because sorting can in main memory it’s also faster access and, you know, your data is large, but it can still all happen within your main memory. But there are forms of external sorting that do exist, like for example one of the more well-known ones is like external merge sort where there’s a bunch of data that is kind of split up and sorted of like on a disk and then it’s merged back together and put back into main memory. So it depends on like your dataset, but usually you’re not really going to have to think about it. And if you’re using an algorithm that is external you know.

[00:22:10] SY: Okay.

[00:22:10] VJ: It’s not like you stumble upon it. You’re like probably choosing it for that reason. But I think it’s important to know that that is a classification. I don’t really think it’s one that we’ll even be talking too much about, but it is a way to classify. So it’s a thing to know.

[00:22:23] SY: That exists in the world even though I may not encounter it.

[00:22:27] VJ: Yeah. And if you ever hear people talking about external or internal sorting algorithm, then you’ll know that they’re talking about in the context of memory. And it’s especially true when you’re thinking about large datasets out in the world.

[00:22:43] SY: Cool. So we talked about two more ways of classifying our sorting algorithms and how to think about them. We talked about stability in internal versus external. Next week we’ve got two more to go through we’re going to talk about recursive versus non-recursive and then comparison versus non-comparison. And that’s the end of today’s show. If you liked what you heard, leave us a review and make sure to check out Vaidehi’s blog post. The link to that is in your show notes. Also want to give a huge thank you to Linode for sponsoring the show. Try their powerful servers built for all your infrastructure needs. Getting started with a shiny new server takes less than one minute so you can get up and running in no time. Get a $20 credit by using the promo code CodeNewbie2019. Just go to linode.com/codenewbie for more info. Also want to give a huge shout-out to DigitalOcean. They are the easiest way to deploy, manage, and scale your application. Everything about it was built with simplicity at the forefront; setting, deploying, even billing. Try DigitalOcean for free by going to DO.co/codenewbie and get $100 in infrastructure credit. Link is in your show notes. Vaidehi, you want to say goodbye?

[00:23:54] VJ: Bye everyone.

[00:23:55] SY: Thanks for listening. See you next week.