Description

In this episode, we take our parse tree, an illustrated, pictorial version of the grammatical structure of a sentence, and we take a metaphorical broom to sweep away repetitive bits, sliming it down, and leveling it up by creating an abstract syntax tree (AST).

Show Notes

This episode of the Basecs podcast is based on Vaidehi's blog post, Leveling Up One’s Parsing Game With ASTs from her basecs blog series.

Transcript

[00:00:02] SY: Welcome to the Base.cs Podcast where we explore the basics of computer science concepts. I’m your host Saron, Founder of CodeNewbie.

[00:00:09] VJ: And I’m Vaidehi Joshi, Author and Developer.

[00:00:11] SY: And she is the brilliant mind behind the Base.cs Blog Series. Today, we’re talking about…

[00:00:16] VJ: Abstract Syntax Trees.

[00:00:19] SY: This season of the Base.cs Podcast is brought to you by DigitalOcean and Heroku. DigitalOcean’s cloud infrastructure is optimized to make it super intuitive and easy to build faster and scale easier. The pricing is predictable. They have flexible configurations and excellent customer support. Get started on DigitalOcean for free with the free $100 credit at DO.co/basecs. That’s DO.co/basecs.

[00:00:47] Did you know you can build, run, and operate applications entirely from the cloud? With Heroku’s intuitive platform, you can streamline development using the most popular open source languages, so you don’t have to worry about backend logistics. Also, you’re not walk-in to the service. So why not start building your apps today with Heroku?

[00:01:12] SY: We’ve talked a lot about trees. Trees are a huge part of computer science, it seems. Have we talked about this one before?

[00:01:19] VJ: We haven’t, no. We’ve talked about a sort of relative to it in a previous episode, actually, the last few episodes we’ve been talking about compilers and the process of parsing, content, and turning it into a tree, and we talked about parse trees, which is sort of like the cousin of the abstract syntax tree, but we’ve never actually talked about abstract syntax trees. I’m going to call them ASTs because I promise you I will mess it up. So I want to call it an AST.

[00:01:49] SY: Abstract Syntax Tree.

[00:01:51] VJ: Yeah. That’s a mouthful.

[00:01:52] SY: That’s a mouthful. Yeah. Okay. But they feel familiar. They sound familiar. Why is that?

[00:01:58] VJ: So you might’ve heard of an AST before because it’s like something that gets thrown around in computer science every once in a while. And actually like if you work in open source or if you work with certain languages, you might have seen people like use the term, “Oh, yeah, it generates an AST,” and you may have not known what that is.

[00:02:18] SY: Okay. So what is it?

[00:02:20] VJ: So an AST, abstract syntax tree, which is also sometimes just called syntax tree, which is a pro tip. It’s really like a slimmed down version, a condensed version of the parse tree. And we know about parse trees. We learned about them in the last few episodes. And I think just in case we needed a reminder, a parse tree is just a tree data structure that contains the grammatical structure of our code. It has like all the syntactic information that appears in the sentence of a line of code basically and it’s literally what we’re doing is we’re taking the grammar of the programming language and we’re using that to build out a tree structure and that’s a parse tree. An abstract syntax tree, an AST is a condensed version of that.

[00:03:08] SY: So when we talk about the parse tree, I remember that being very pictorial, right? It was a nice little diagram. We got to see where all of our numbers went. If the sentence we’re deconstructing as a mathematical expression where all of the letters went, like all those things, like that’s what we’re talking about when we talk about a parse tree, that picture that we built.

[00:03:30] VJ: We also called it a concrete syntax tree, right?

[00:03:33] SY: Yes, I remember that.

[00:03:33] VJ: Because it literally showed us the concrete syntax of the text. That’s basically what you were just describing, where you were saying like, “Oh yeah, it showed us like the parenthesis,” and it literally was like, “Here’s an expression.” The expression is just the number five, but it’s still an expression. So it’s very concretely telling us things exact…

[00:03:49] SY: Like literal.

[00:03:49] VJ: Yeah, very literally telling us what it is and that’s why they’re called CSTs, Concrete Syntax Trees. So they’re like really what you see is what you get.

[00:03:58] SY: Okay. So why would I want to abstract the tree? Because I feel like if I have an expression, then don’t I want all the information that goes into that expression? Why would I want to condense anything?

[00:04:13] VJ: That’s a great question. Sometimes would you work with a concrete syntax tree, which is that parse tree? When you’re working with it, you find there’s like a lot of repetition and then it gets annoying and you’re like, “Oh, this is like a lot of repetition and I don’t actually need to know all of this little details. What I really want is like the heart of what’s being communicated.” And so that’s why you want to abstract it out. You want to sort of condense it. And when we do an example, I think it’ll be more clear, but that’s really the short story of it.

[00:04:40] SY: Okay. So let’s walk through an example because I want to see this whole condensing part live. I want to see it in action. So let’s do a really simple mathematical expression. Let’s do parenthesis, just to spice things up, 1 times 12 closed parenthesis.

[00:04:59] VJ: Okay.

[00:04:59] SY: So that’s our mathematical expression.

[00:05:02] VJ: Seems good enough. Yeah.

[00:05:04] SY: So if we were going to parse it out, we would start with… okay, let me see if I get this. Okay. So the whole thing is one expression, right?

[00:05:13] VJ: Yeah.

[00:05:13] SY: So you’d probably start with the root node called like X expression.

[00:05:18] VJ: Yep.

[00:05:19] SY: And then from there, okay, if I break out (1 times 12), it almost feels like there’s three parts there. There’s like the open parenthesis part and then 1 times 12 kind of feels like its own little unit, like it’s its own idea.

[00:05:36] VJ: Yeah. It’s its own expression actually.

[00:05:38] SY: It’s its own expression. There you go, own expression, and then we have the closed parenthesis. So our root node would have three children. It would have the open parenthesis, another X, and then another node for the closed parenthesis.

[00:05:53] VJ: Yeah. So we basically have like that main root node that’s the expression and then three children. We’ll recall from the last few episodes and we’ve talked about parsing and turning things into trees, how we pay attention to things that are tokens. So in this case, if we’re talking about like the grammar of math, those open and close parentheses, they’re tokens and they’re not expressions. They can’t be broken down any further. So that’s why they’re going to be child nodes of that root level expression and nothing else is going to be its child. It’s basically going to be a terminating leaf.

[00:06:27] SY: Okay.

[00:06:27] VJ: But we can break down that 1 times 12 even further though.

[00:06:30] SY: Yes. Okay. So we have that expression. So from that expression, we’re talking about 1 time 12 now. There are three parts to that. There’s the number 1, the multiplication sign, and the number 12. So what is my number one considered? I mean, it’s a number, right? Like what is that?

[00:06:49] VJ: Yeah. Well, I mentioned this very briefly when we are learning about parse trees, but it’s important to remember that one itself is an expression. It’s not a token. It has a value, it in and of itself is also an expression. However, we can’t break it down any further than one. So we need to represent it as an expression, as a unit in the parse tree, as a node in the parse tree, but we are probably not going to be able to break it down any further than that, if that helps. Think about it.

[00:07:18] SY: Okay. So then to represent the one, I would have a node also called X.

[00:07:25] VJ: Yeah.

[00:07:25] SY: Interesting. And then that would point to, I guess the value of that X, which is just the one.

[00:07:31] VJ: Yeah. Yeah. So you have this expression 1 times 12 and then it has its own children, one of which is the expression one, but you first represent it as an expression, and then the value of it is the child of the expression one. You’ll notice, if you think about it, if you visualize it, it’s like a chain of nodes. There’s only one child of the expression one, which is the value one, and there’s really nothing else we’re doing there.

[00:07:59] SY: So back to the X that represents the 1 times 12, so we just broke out the one, now we’re going to break out the multiplication sign, which we talked about as a token.

[00:08:10] VJ: Yup.

[00:08:10] SY: Because it’s a sign and signs are token.

[00:08:12] VJ: In math, it is at least.

[00:08:14] SY: In math. Yes, in math. In real life, yeah. Okay. So then we have a node just with a multiplication sign. Okay. And the last one that we had to figure out is the number 12. So much number 1, number 12 is I assume also an expression. So we have the node, the X, and then that would point to the value of 12.

[00:08:34] VJ: Yeah. You took this expression, this mathematical expression, you built out a parse tree of it, and it’s interesting because that weird like chain that we saw where the expression of one pointed to the value of one, we’re seeing the same thing again with the number 12, right? We have an expression to represent 12, which has a child, which is the value 12, and these two chains, they don’t really feel like change, but they are there. There’s a name for that. It’s called a single successor node, which basically means you have a node and it has exactly one child. And if you think about this tree, like why do we have a single successor node really? Right? Like it’s a little bit redundant because we’re basically saying, “Okay, 1 times 12 has two expressions within it, 1 and 12, and they have values, 1 and 12.” But like, “Okay, we don’t really care if it’s just a single number.”

[00:09:28] SY: So it’s like a lot of unnecessary information.

[00:09:31] VJ: Yeah. Yeah. When it’s just a single number, we’re like, “Okay. I don’t really want to think about this as an expression. I just want to think about it as a number.” And so we’re repeating ourselves and we’re being kind of verbose and really what we could do, what we would like to do, if we want to be efficient about it, is like instead of having 1 times 12 be the expression that has two expressions that it points to, what if 1 times 12 just had three nodes, 1 times 12? And like, what if we sort of just condensed it? We’re like, “Oh, we don’t care about the fact that it’s an expression because it is literally just one number. There’s no other way to break it down any further.”

[00:10:10] SY: Yeah. That’s way better. And also just like easier on the eyes too, just looking at it…

[00:10:16] VJ: That’s really all that it’s about.

[00:10:20] SY: Beauty is a very important part of computer science.

[00:10:22] VJ: Well, it’s worth mentioning though that like when we were starting out and we were like, “Okay, I’m a computer and I’m going to scan all this text and I’m going to break it down into its individual parts and then I’m going to build a parse tree out of it,” it’s easy for us as humans to be like, “Oh, yes, 1 times 12. Easy.” But as a computer it’s not as obvious, right? So there is the value to break it down to its most concrete pieces’ step, and that’s what parse trees are meant to do. They’re meant to represent the program by its most distinct parts. But once you’ve done that and you’ve like noticed the parts that are redundant, now you can sort of clean it up. It’s like we need to like come in here with a broom and sweep up all the extra redundant single successor nodes.

[00:11:10] SY: But other than that, a node is just like a regular node, right? It just operates like our normal trees do?

[00:11:16] VJ: Sort of. It’s interesting that you bring that up. The nodes of an AST are also kind of unique and it’s not like anything too complicated, but it’s worth mentioning in case you ever encounter an AST. Every node in an AST has a reference to its next sibling node as well as its first child node. Like I like to think about it like a 90-degree right triangle where it’s like, “I know who my sister is and I know who my kid is,” or something like that where it’s like, “I know my next sibling and I know my first child.”

[00:11:46] SY: I know all my family.

[00:11:48] VJ: Not even all of them, right? It’s just the first child and the next sibling.

[00:11:50] SY: My immediate sibling.

[00:11:51] VJ: Yeah. Exactly.

[00:11:52] SY: Immediate family. Okay.

[00:11:53] VJ: And then every node also will contain like the token value. So like in the case of a parenthesis, it will be the parenthesis, multiplication, it will be the sign, it’ll contain whatever the value is of that node, and then those pointers to that next sibling and first child. So it’s a little bit different than other trees we’ve seen, but we don’t need to worry about that too much. Just worth noting.

[00:12:13] SY: Okay. So we have this parse tree. We just condensed it into an AST. We talked about its nodes. What is the final shape, I guess, of the AST? What does it actually look like now that we’re done constructing it?

[00:12:27] VJ: So once we have that parse tree and we condense it down so that we’re not repeating ourselves, we’re not being verbose and once we’re like focusing on the actual message that’s trying to be communicated rather than the actual little tiny bits of it, what we really care about is the operation and the elements of that operation. So really we could condense this parse tree with all those nodes down to just three individual nodes where the root node is the operation, the multiplication sign, and then it has two children. The left child is the first expression that we are trying to evaluate, 1, and the right child is the second expression, which is 12. And so basically we have a tree that is a multiplication sign at the root and then 1 and 12 as the left and right children. And interestingly…

[00:13:19] SY: Much better.

[00:13:20] VJ: Yeah. It’s representing the same exact expression, but it’s much better. It’s the same concept as the parse tree, but it’s basically abstracting away all of the little bits of information that actually weren’t relevant to understanding what the text was trying to tell us.

[00:13:36] SY: So we learned about the nodes of an AST. We learned about what it actually looks like. We learned about how it’s related to parse trees, but what is the big deal about them and why are they so commonly referred to and talked about in computer science? What’s their role in everything?

[00:13:52] VJ: So it’s interesting because we’ve sort of been talking about compilers and the different steps of compilers and the AST is like the final project, the culmination of all of that, where you have two steps of compilation. We have the lexical analysis step and we have the syntax analysis step. And at the end of syntax analysis, we eventually end up with an AST. So over the course of this season, we’ve been talking about the scanner and the parser and parse trees and now the last step of finishing the compilation process is turning that parse tree into the AST, the abstract syntax tree. And that whole process is known as the front end of the compiler. And a lot of the times, depending on what language you’re working with, you may not even have a concrete syntax tree. Some languages just go straight into building an abstract, abstract syntax tree. And that’s sort of like the final result of the syntax analysis phase of compilation. And so that’s why you might see it a lot.

[00:15:00] SY: So a parse tree isn’t necessarily a prerequisite for an AST.

[00:15:05] VJ: Yeah, that’s correct. You don’t need to have a CST, a concrete syntax tree, or parse street necessarily. It’s easier if you do. As you can imagine, if we didn’t do all the work of breaking down the tokens and the expressions into the most verbose version of the tree, it would’ve been a little trickier, but it’s possible. The more important thing is that you end up with an AST at the end of the whole thing.

[00:15:30] SY: Okay. So we’ve done a lot of comparisons between the AST and the parse tree, but in the example that we worked on, we start with the parse tree and then it transformed into an AST. Are there any other differences between the two?

[00:15:42] VJ: Yes. There’s a few. Namely, the AST doesn’t have a lot of things or it represents things in a different way. So for example, the AST is never going to have like syntactic details. In our example, we started with (1 times 12). And then I just was like, “Oh, by the way, in the AST, it’s just multiplying 1 and 12.”

[00:16:06] SY: Right. They’re gone.

[00:16:07] VJ: That parenthesis just disappeared. Well, that’s because they’re not actually adding anything to the content of the expression. They were syntactic information. So in an AST, you’ll never see those like commas, semi-colons, those won’t be there. And the AST also, as we saw, has a collapsed version of those single successor nodes. So you’re not going to see chains of nodes with one single child because then that basically means you didn’t do your job as an AST because you have redundant information. And the last thing to know is that this is a little bit obvious, but it’s worth saying, any operator tokens in an AST basically become parent nodes in the tree rather than leaves. And so we saw that too where in the parse tree, the more concrete version, the multiplication sign node was a leaf, but then when we turned it into an AST, it suddenly became the root. It became a parent node. So it doesn’t necessarily have to be a root note, but it needs to be a parent because in order to have an operation, you need to operate on something, and those would be the children nodes.

[00:17:08] SY: Okay.

[00:17:08] VJ: Basically, it’s like a more condensed, slimmer, cleaned up version of an otherwise verbose data structure.

[00:17:16] SY: Wonderful. Well, now when I casually mentioned AST, I will actually know what I’m talking about.

[00:17:22] VJ: Yeah, it’s going to be great.

[00:17:23] SY: Thank you. And that’s the end of today’s show. If you liked what you heard, please leave us a review and make sure to check out Vaidehi’s blog post. Link to that is in your show notes. I also want to give a huge shout-out to DigitalOcean and Heroku. DigitalOcean is a simple developer-friendly cloud platform, which makes managing and scaling apps easy with an intuitive API, multiple storage options, integrated firewalls, load balancers, and more. There’s also a robust community that provides over 2,000 tutorials to help you stay up to date with the latest open source software, languages, and frameworks. Get started on DigitalOcean for free with a free $100 credit at DO.co/basecs. That’s DO.co/basecs. There’s a reason over nine million apps have been created and ran on Heroku’s cloud service. They not only manage over two million data stores, but they make over 175 add-on services available to you. So whether you’re creating a personal app, a free app, or an enterprise app, start building today with Heroku. This episode was edited and mixed by Levi Sharpe.

[00:18:33] VJ: Bye everyone.

[00:18:34] SY: Thanks for listening. See you next week.

Thank you for supporting the show!

Thank you for supporting the show!