In this episode, we get into what a compiler is and does. In short, a compiler is a program that reads our code (or any code, in any programming language), and translates it into another language. You'll want to listen in to find out just how it does this!
This episode of the Basecs podcast is based on Vaidehi's blog post, Reading Code Right, With Some Help From The Lexer from her basecs blog series.
[00:00:02] SY: Welcome to the Base.cs Podcast where we explore the basics of computer science concepts. I’m your host Saron, Founder of CodeNewbie.
[00:00:09] VJ: And I’m Vaidehi Joshi, Author and Developer.
[00:00:12] SY: And she is the brilliant mind behind the Base.cs Blog Series. Today, we’re talking about…
[00:00:16] VJ: Compilers and Lexing.
[00:00:19] SY: This season of the Base.cs Podcast is brought to you by DigitalOcean and Heroku. DigitalOcean’s cloud infrastructure is optimized to make it super intuitive and easy to build faster and scale easier. The pricing is predictable. They have flexible configurations and excellent customer support. Get started on DigitalOcean for free with the free $100 credit at DO.co/basecs. That’s DO.co/basecs.
[00:00:47] Did you know you can build, run, and operate applications entirely from the cloud? With Heroku’s intuitive platform, you can streamline development using the most popular open source languages, so you don’t have to worry about backend logistics. Also, you’re not walk-in to the service. So why not start building your apps today with Heroku?
[00:01:12] SY: We haven’t really talked much about compilers, right?
[00:01:15] VJ: No, we haven’t talked too much about them. In the last episode, we sort of were introduced to them.
[00:01:20] SY: Yeah.
[00:01:20] VJ: That sort of happened by way of parse trees. I was just like, “Oh, yeah, compilers,” and I never like defined it. I just stuck it in there. I was like, “We’re going to talk about this tree. Oh, by the way, compilers, don’t worry about that.”
[00:01:34] SY: And I feel like compiler is just one of those things that we’re supposed to know. You know what I mean? Like it’s just like as a developer like compiler, you know, those things.
[00:01:42] VJ: I don’t know how you’re supposed to know that.
[00:01:44] SY: Really? It doesn’t feel like one of those like basic computer science, like binary, like you’re just supposed to like know what binary is, like maybe you don’t know how to count in binary, but like you’ve heard it before.
[00:01:53] VJ: Yeah. Yeah.
[00:01:54] SY: You know what I mean? It’s like a term.
[00:01:55] VJ: I feel the same way, like people expect you to know it, but it’s the same thing with binary. It’s like how are you supposed to know what it is.
[00:02:02] SY: What is it actually?
[00:02:03] VJ: So like the same thing with compilers. Like, “Okay, I’m supposed to know it, but like, what is it? I just don’t use it and I don’t know what I’m talking about.”
[00:02:11] SY: Exactly. Well, no worries. We’re going to solve all of your problems by defining what a compiler is. So Vaidehi, what is a compiler?
[00:02:21] VJ: In the simplest terms, we can say that a compiler is just a program that reads our code. And when I say our code, it could be any code. So like any programming language. Your machine needs to have some way of understanding it and so every programming language will have some way of understanding what you type and basically a compiler is that program that understands a language and translates it into another language, specifically a machine-friendly language, not a programmer-friendly language.
[00:02:58] SY: Okay. Right, because when we’re coding and we’re using our programming language, I mean it’s made obviously for the machine because the machine has to understand it, but it’s also kind of written for us, so that we can understand what we write and how to write it.
[00:03:11] VJ: Yeah. It’s more for us in a way because like we are the ones who are writing it and then understanding it when we write it, but then because machines are so fast, we can compile it down into machine readable code and then the machine can do its job way faster, and that’s good for the machine. It’s like the machine-friendly language and then for us we use the programming language, which is hopefully more human friendly.
[00:03:33] SY: So what is a compiler I guess made of? What is it? What are its parts and pieces?
[00:03:41] VJ: Sugar and spice and everything nice. Well, I guess a good way of answering this question is by thinking about what a compiler does, and that sort of answers the question of like what it consists of. And a compiler has two really distinct phases to it, and obviously like this could change slightly depending on what programming language you’re using, but every programming language when it’s compiled has to do these two main steps. And these two steps are syntax analysis and lexical analysis. And spoiler alert, we actually already learned about one of these because I snuck it in.
[00:04:21] SY: Oh nice! Which one did we already learn?
[00:04:26] VJ: So in the last episode, we talked about what we were introduced to parsing and the concept of the parser and building parse trees using the syntax of a language and actually that is one of the steps of the compiler that’s called syntax analysis, and that’s actually the second step. So I like introduced us to the second step first. And today, we’re going to talk about the first step that happens before you do all of the parsing and the parser builds the parse tree. So that’s lexical analysis. That’s the first step in the process.
[00:04:59] SY: So what happens in lexical analysis?
[00:05:02] VJ: So this is the phase where the compiler basically doesn’t know or really care about the grammar of a sentence or even the meaning of like what the sentence is. And when I say a sentence, I'm thinking it could be like any kind of text or like program. But in the last episode, we talked about parse trees and diagramming sentences. I’m sort of relating that to code. So for continuing with that same metaphor, the first step isn’t thinking about the grammar or the meaning of it. It’s literally just figuring out the words. It’s sort of like breaking up a bunch of compressed things into individual pieces.
[00:05:45] SY: Okay. Can we do maybe an example of that so I can get a feel for what we’re talking about?
[00:05:50] VJ: Yeah, let’s do one.
[00:05:51] SY: Are you in the mood for some Shakespeare?
[00:05:53] VJ: I am always in the mood for some Shakespeare. Of course you are.
[00:05:59] SY: Okay. So let’s do this one. Let’s do, “To sleep, perchance to dream.”
[00:06:06] VJ: Nice. Hamlet.
[00:06:08] SY: If we were to take that and we were going to do some lexical analysis and we’re going to give it to our compiler and say, “Hey, compiler lexilize this.” Technical term, correct term.
[00:06:20] VJ: Lexicate.
[00:06:21] SY: Lexicate.
[00:06:23] VJ: Not a word.
[00:06:24] SY: Probably not. What would it do?
[00:06:28] VJ: Well, we could kind of imagine that if you had this phrase and you sort of passed it from the program into the compiler, you could even imagine that all the white space would get stripped out. So really what would happen is it would be a big bunch of text that to our compiler would read, “To sleep, perchance to dream.” So like no spaces, nothing.
[00:06:49] SY: “To sleep, perchance to dream.”
[00:06:50] VJ: Yeah. Just like all smooshed together.
[00:06:51] SY: They’re smooshed into other thing.
[00:06:55] VJ: Just the way Shakespeare wanted us to read it.
[00:07:00] SY: Okay. We don’t want that.
[00:07:02] VJ: Exactly. Well, yeah, we don’t want that. We have to make sense of it.
[00:07:05] SY: We don’t want to end with that. Yeah.
[00:07:07] VJ: Yeah. And so what we need to do in the first step of the compilation process in the step of lexical analysis, we have to do two things to sort of fix this. First, we need to scan this text, which we could say like, “It’s like a line of code.” In this case, it’s sort of a sentence, but it’s going to be the same thing with code. So the first thing we have to do is sort of like scan through it and then we need to evaluate what it is and sort of break it apart. And so there’s actually a little friend to help us out here. It’s part of the compiler. It’s called the lexer, which matches this phase, the lexical analysis phase. Also, sometimes it’s called the tokenizer. Both of those terms could show up. But really they’re doing the same thing. And the thing that they’re doing is they are scanning and evaluating whatever input is given to the compiler, and that’s the process of lexing or tokenization.
[00:08:05] SY: So it takes in this phrase that’s been squished into, basically it looks like just one long word and it takes it and it scans it. And then what is it scanning for? What is it scanning to? What is this scanning step about?
[00:08:20] VJ: Great question. So that first step of scanning, which another thing to note is this is just like a little caveat. Sometimes you could have a separate program that just does the scanning. So you can have the lexer/tokenizer little friend program and sometimes there’s another program called the scanner that literally just does the scanning. And all it’s really doing when it’s scanning is it’s taking in the source text or that source code one character at a time, and it is literally just like reading it one character at a time. That is the process of scanning it. And what’s interesting is that to the scanner, it doesn’t really matter how big the file is because it’s really only ever seeing one character at a time.
[00:09:05] SY: Oh, interesting.
[00:09:06] VJ: And it’s kind of cool because it’s looking at this whole chunk of text, the source code, the source text, one character at a time, for each character that it sees, for each character in our sentence, for example, our scanner looks at the line and the column for where that character is located. So it’s sort of like very systematically going do, do, do, do to each character and each character, it’s like, “Okay. I am at column one, line one. This is the character I see.” Then it goes to the next one.
[00:09:43] SY: Like records it?
[00:09:44] VJ: Yeah. Yeah. It’s sort of like recording it and like…
[00:09:46] SY: Taking notes.
[00:09:47] VJ: Yeah. It’s taking notes. And at the end of the day when the scanner’s done with its very systematic task, what it ends up with is you can kind of think of it as like the nerdiest Excel Spreadsheet where it has like…
[00:10:01] SY: I’m so excited. Yes.
[00:10:04] VJ: I know. The scanner’s really going to resonate with your personality. I feel like it is like one with you. But the scanner basically for every single character that it reads, it marks the blind and the position of where that character was found and it will effectively map each character to the line and column of where it lives. And it doesn’t just do that for characters. It will also read like new lines and punctuation and like it’s important to remember that in code, like all of those things are part of the text. So like you might strip out white space, but if there are spaces, if there are commas, if you have like the end of a file, all of those things are recorded as like keystrokes, right? So that gets preserved in the source text. So the scanner will be like, “Okay, here’s a comma. Okay, here’s a space. Okay, here is a new line.” So now we’ve moved from line one, from line zero to line one for example. And so it’s sort of like creating this giant, amazing mapping of literally every single character in your source text and where it lives breaking it down to a very granular level.
[00:11:18] SY: Okay. So once we have our beautiful spreadsheet full of all the columns and rows and all these recordings, what do I do with them now?
[00:11:28] VJ: So now that you have all of that data recorded, effectively what you can do is you can transform this big chunk of text that we started off with into substrings. So you don’t just have one big jumbled bunch of characters. You actually are like able to distinguish like where one starts and ends and where the next one starts and ends. And now you’re like, “Okay, now I can start to see individual words inside of this.” And these words, they have a name called “Lexemes”.
[00:11:58] SY: Oh. Okay.
[00:11:59] VJ: So those are the individual substrings that you get once you’re done scanning.
[00:12:03] SY: Okay. So we’ve scanned, we have our lexemes. I think you said that the other step here is evaluating. Are we about to evaluate now or do we have some more stuff to do?
[00:12:14] VJ: No, we’re about ready because we’ve broken up our big source text into individual pieces. So we’ve done the scanning and now we have words, and the next thing to do is basically read the words and figure out what they are, what they mean.
[00:12:31] SY: Cool. So how do we do that?
[00:12:33] VJ: So the evaluating step is basically our compiler looking at each word and figuring out what kind of word it is so that it can turn it into something called a token. And a token is a fancy word for like a special symbol, but it doesn’t necessarily have to be special. It just contains two things. A token has a token name and a value. So we might think like, “Oh, a string is not special, but it is a kind of token,” because when a compiler looks at a bunch of characters, it needs to know, like, “Is this a number character? Is this a string character? Is this like a multiplication sign?” And all of these different lexemes need to be evaluated to whatever token they’re equivalent to. So you could imagine in the English language, a comma has a different meaning than the letter A, because that’s actually a string, right? It means two different things. So each of those would sort of be different tokens in the English language. And similarly, when you’re tokenizing any kind of programming language, the tokens you’re dealing with, the way that you read the lexemes, the separate words are going to be different. They’re going to be based on whatever programming language you’re compiling, what you’re interpreting, what you’re making sense of. And so the tokenization process is really how we’re giving more context to these lexemes. This is where the lexer takes over. So if you had a scanner taking care of the scanning, the lexer is going to take over and sort of determine, “Oh, okay, this lexeme, this word is this token,” and it’s going to sort of classify everything and that’s doing the work of evaluating itself.
[00:14:18] SY: Okay. So we have our lexeme, now we’re tokenizing a lexeme. Now are we changing a lexeme into a token or what is the relationship between those two?
[00:14:31] VJ: Yeah, sort of. We get like a basic lexeme. Let’s say you have a string, the lexer is going to look at that and be like, “I need to classify this further. This is a string. I’m going to sort of convert this from just like a bunch of characters, and I’m going to represent it as a token.” And it has a name, which is string, and it has a value, which is like maybe the string perchance. And so it’s like sort of taking a really simple, little piece of data and turning it into a key value pair. And that’s what that token is. It’s adding more context to it. So if you imagine like if we did it for the English language, if we said, “To sleep, [comma] perchance to dream,” we would look at that comma and be like, “Oh, this is actually a punctuation token.” And then we’d create this new little object that had a key or like a name called comma or punctuation probably actually, and then its value would be the comma. And so like if you had the same thing with like perchance, we would say, “Oh, this is a token of type string and its value is perchance.” So we’re basically giving more context to each one of these lexemes and that’s going to be really helpful down the road.
[00:15:43] SY: Okay. So let’s walk through this process with some actual code this time. Can we do that?
[00:15:47] VJ: Yeah, let’s do it. Can we stick with Shakespeare?
[00:15:50] SY: Yeah, of course. Of course. So let’s take this idea of sleeping and dreaming, and let’s do var to sleep equals to dream. So the string “to dream”. And just to be clear, we’re treating the quote to sleep, perchance to dream just as a regular sentence. We’re not actually interpreting the meaning of the words. So our use of equals is just to simplify things because to take that perchance idea and like codify it would involve randomization and a couple of other things that we just don’t need to get into right now. So if we were to go through this process, first, we would scan everything and we kind of jumbled it up into one unit. So instead of the one, two, three, the five parts right there is var to sleep equal to dream. So we have those five parts. It’s just one big old part. Then we’re going to categorize them, put them in our awesome, nerdy spreadsheet where it says what line, what column each character belongs to, and then from there we divide them into individual substrings also called lexemes. And then now we’re ready to move on to our next part, which is evaluating. And for this evaluating step, we’re going to tokenize them, which means we’re going to look at this and we’re going to give it a token name and a value. And so for example, if we take our substring var and we were going to take that lexeme and tokenize it, we would say var is, it’s not a string, it is a keyword.
[00:17:27] VJ: Yup.
[00:17:27] SY: So we’ll say, “Var, you are a keyword and your value is var.”
[00:17:32] VJ: Exactly.
[00:17:33] SY: And then I might say to sleep, and to sleep is an identifier. So to sleep, your identifier, your value is to sleep. And then we have equal, which is our operator, to dream, which is our string literal. What would the colon be at the end of that?
[00:18:26] SY: Cool. What do we decide this is called, lexilize, lexirate?
[00:18:33] VJ: Lexical analysis.
[00:18:37] SY: We lexically analyze our little piece of code. So there we go.
[00:18:43] VJ: Yeah. And what’s kind of cool is like to sort of tie it back to the previous episode and to tie these two phases of compiling together, we’ll recall from last week’s episode we talked about parse trees and we did this example of like two times eight and we were like, “Okay, we’ll break it up and two is an expression and times is an operator and eight is an expression.” But we just happen to know how to do that because we’re humans and we already can parse and scan things in our mind. But for our machines and for the compiler of a language, how would it know like, “Oh, times is an operator, it’s not a string”? How does it know that? The reason it knows is because there was a step that came before it, and that step was the scanning and lexing step and the tokenizer and lexer actually figured it out first. It figured out that times as a special symbol, it figured out that it’s a token of type operator and it took all that information, it figured out that information after scanning it, and that information was already taken care of and given to the parser. So when it created its parse tree, it actually had more context. The scanner had the least amount of context. The scanner, it’s at the front of the battle lines. It’s like, “Whoa, I don’t know. I’ve got to make sense of this.” And then it’s like, “All right, pass it along to the lexer.” And lexer is like, “All right, I can sit down with this and sort of break it up and categorize everything.” And then the lexer passes along to the parser and the parser is like, “Oh, great. Everybody’s already done the hard work of scanning. They’ve given me context. I just got to build this tree.” So that’s how all of that sort of fits together and we started it with this parse tree thing, but actually now we took a step back and saw the things that had to happen before we could even build the parse tree.
[00:20:39] SY: Yeah. So many steps. But I love how all these little pieces are working together to compile our code.
[00:20:45] VJ: And what’s really cool is that as you learn more about compilers, you start to see that actually it’s an algorithm to understand and to write out like what you need to do for every single character that you see when you’re scanning, and the parse tree is just a data structure and it’s kind of cool because like when we started this whole podcast back in Season 1, we were like, “Oh, data structures and algorithms.” Like, “What do you do with them?” And now we’re like, “Oh, we’re tying them together, all of it together for compilers.” And the compilers are like the prerequisite to be able to type any code because that’s how our machine makes sense of it. We’re basically tying up all the loose ends in this final season, which is really great for me because I’m a big end tire upper. It makes me feel good to have like a ribbon on the top.
[00:21:38] SY: End tire upper. Yeah, ribbons are fun. Ribbons are awesome. And that’s the end of today’s show. If you liked what you heard, please leave us a review and make sure to check out Vaidehi’s blog post. Link to that is in your show notes. I also want to give a huge shout-out to DigitalOcean and Heroku. DigitalOcean is a simple developer-friendly cloud platform, which makes managing and scaling apps easy with an intuitive API, multiple storage options, integrated firewalls, load balancers, and more. There’s also a robust community that provides over 2,000 tutorials to help you stay up to date with the latest open source software, languages, and frameworks. Get started on DigitalOcean for free with a free $100 credit at DO.co/basecs. That’s DO.co/basecs. There’s a reason over nine million apps have been created and ran on Heroku’s cloud service. They not only manage over two million data stores, but they make over 175 add-on services available to you. So whether you’re creating a personal app, a free app, or an enterprise app, start building today with Heroku. This episode was edited and mixed by Levi Sharpe.
[00:22:51] VJ: Bye everyone.
[00:22:51] SY: Thanks for listening. See you next week.
Thank you for supporting the show!