Transcript for:
Ambiguity in Language and Code

[Music] Huh? [Music] [Music] [Applause] [Music] What you just saw is a quirk in every single language. This is actually a subset of issues called linguistic ambiguity. If I'd had more context, I might have understood the original intent. But you know we don't always have those kinds of luxuries. Interestingly enough, programming languages face very similar challenges. We as programmers instead of communicating personto person are now communicating person to compiler. Whether you're coding in C++, Java, Rust or any programming language, compilers being a human-made invention are also imperfect at interpretation. This is why since they can face very similar issues, programming can also fundamentally be considered a linguistics issue. First of all, I have a question for you. I've got a very small program over here. What do you think the output of this program is going to be? We've got a few options here. First would be it prints out partially true. Second would be it prints out totally false. Or I guess a third option could be it prints out absolutely nothing. So, pause for a second and give me your best guess. might not be exactly what you would expect, which you probably got because I'm sitting here asking that as a question. So, let's go over. I'm going to compile my program. We're going to ignore our warning. I'm going to do /hm. We've got our answer totally false. Now, you might be thinking, looking back at the code, this really doesn't make any sense. I would probably guess that the output would be absolutely nothing. But instead, we're getting this totally false statement. And that's because this is part of a language ambiguity in the syntax called a dangling else. And we're going to get into a real world example of this a little bit later. Nearly every distinct programming language issue can be traced back to a very similar human language issue. In all these examples, I'm going to talk about the English language quirk and then I'm going to talk about the very equivalent programming language quirk and I'm going to kind of sort them into different buckets of the three distinct compilation phases which I'll also talk about in a second. Now, English in all languages for that matter can have a lot of different ambiguities and one of my favorite examples of this is actually a book from 1967 called Beyond Language: Adventures in Word and Thought. And this explores recreational linguistics. And actually, I think that would be a really good name for coming up with insults, like an official name. I'm not insulting you. I'm just practicing my recreational linguistics. One of the best sentences from the book, I think, is buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo. Yes, that is a real sentence. Don't even question me. Just go look it up. But that's because the word buffalo can act as both a noun and a verb. But relating this back to programming languages with compilers, these kinds of ambiguities are even more serious and important to clarify. If you think about it, compilers need to be correct 100% of the time, but they also don't have the advantage of being able to ask for more context when they need it. Let's take a minute to think about how compilers actually parse the syntax of a program. Now, compilers have it a little bit worse off than humans because they have to do this in multiple different phases of parsing. And you can kind of break this up into three distinct stages. The first stage is going to be the lexical analysis. The second is going to be the parsing stage. And then the final stage is going to be the semantic analysis stage. So during the lexical analysis, it's going through and it's breaking up the tokens of your program. So like the variables, the operators, and it's parsing those out into distinct tokens. And then the next phase is going to take all of those tokens in your program and turn those into an abstract syntax tree or an a. And during that second parsing phase, it's this is going to be the part where it's finding all those syntax errors that are so irritating in your code, like if you forgot to add a semicolon or something in your C++ program. Side note, when I was in school, I remember I came up with the ten commandments of computing and the first one was remember the semicolon to keep it holy. Don't forget guys, during the final stage, the compiler uses that a and parses it to look for semantic correctness in your code. For example, it does type checking or maybe checks for scope resolution inside of your code. So if you have any kind of logical issues with your code, this is where the compiler discovers those. Like for example, if you're trying to use a variable that's out of scope. Now, if you're looking for a very understandable example of how a compiler works, I highly encourage you to take a look at the local C compiler, which is available open source on GitHub, and this is going to be a lot more readable than trying to understand a really big compiler like GCC or clang for example, that might be spread across like literally hundreds of different files. So, this is really well documented and easy to understand. I'm going to go over to the LCC GitHub repository. And just side note on here, this has to be the oldest commit I've ever seen in my life from 23 years ago. That's absolutely insane. That just blows my mind. I'm going to go over to the source code and you can see we have all of the different files that kind of make sense to put it into the three three different phases of the compiler analysis. You can see like the tree parsing it into an abstract syntax tree and you can see the process where it's doing both semantic and syntactic analysis. If you look at tree, I think it's very descriptive over here. You can see it actually going through and parsing the different op codes inside of this and then constructing that a so it can do the remaining analysis on it. Going back over to our linguistics challenges, let's think about how these ambiguities can come into existence. Now, older languages are going to face a lot of these challenges because if you think about it, you have an old existing language and you're constantly adding new additions on top of that. So, it kind of makes sense how these issues can happen. But this can also happen a lot inside of younger languages as well, which are also not perfect. I think a really good example of this is the word literally. Now, if you think about it in old original English, literally literally meant not figuratively. So it actually happened. But as English has progressed, literally has started to change meaning to actually mean figuratively. And a lot of people use it now as kind of a synonym for very, which drives me literally nuts. Now, now going back over to some C++ examples, let's take a look at them. And we're going to use C++ since that's literally the best programming language. Remember those three different compiler phrases I talked about previously? Let's take a look at each one of those individually. So for our first bucket, those would be syntactic quirks. And in linguistics terms, you could call this a grammar ambiguity, aka what happens if the same token fits into multiple different parse trees. And C++ has a really good example of this one, and it's very fittingly called the most vexing parse. Think about this English sentence for a second. Visiting relatives can be boring. Now, if you look at the word visiting, is visiting acting as a verb here, like the act of visiting your relatives, or is visiting acting as an adjective describing a noun, as in you're visiting relatives, can be boring. If you see, the word visiting can have multiple different meanings, completely changing the structure of the sentence. Now, keeping that English example in your mind, let's go over to a quick C++ example. So for this example, I have a very short program right here. I have a couple declarations up top. We have a simple function definition that's going to be declaring a function named m that's going to be returning a vector of integers. And then we have our definition for that function down here. Keeping that in mind for the next few seconds or or minutes pro probably seconds, I hope seconds. Let's go down to our main function and instead let's say I want to declare a variable. And we're going to name our variable n. And this is of type vector. And if I want to default initialize this with a few values, I can initialize this with let's say 10 values all initialized to the integer five. And if I go down here, adding a couple of extra values just because why not. And I'm going to be printing out the size of that vector as well as all of the numbers inside of it. So let's just run this for sanity and make sure this works. So we do slash vex. And you can see we've just got our vector printed out as well as the size just like we would expect. But now take a second and think, what if we didn't want to have default initialized values inside of our vector? What if we just wanted to have an empty n vector? You might think we can just get rid of those values. And let's do something like this for example. This looks right, right? But why am I getting some syntax errors over here? Let me save this and run this. We'll compile and you see doesn't compile. What's going on here? So if I go up here, I can see invalid range expression of type vector int. No viable begin function available. Well, that's very interesting. See what's happening. If you look at this closely, there is no difference in the declaration of this vector n as well as this vector m over here. So the compiler is basically saying, I don't know whether this is a function definition or a function declaration or if this is declaring a new variable for a vector of integers. This is an example of a syntax ambiguity. Now the way to fix this, let's get rid of this. It looks it looks really right. So if you're a beginner, you would look at this and you would think, huh, what's wrong? And it's probably why it got the name the most vexing parse. But the way to fix this in old C++ was like this where you have to set your new variable equal to a vector of integers. But this is just like the longest syntax in the world. But it does indeed get rid of our syntax errors. But if you go over to C++ 11, luckily in modern times we have this declaration option available. we can just use curly braces and it default initializes our vector to nothing. No default values, just calls the default constructor. And if we want to make sure that this does actually compile, got my dummy values down here. So we should have our vector size two. So I'm going to recompile. Let's run our vex. And here we go. Worked just fine. Now we can unambiguously declare a vector of integers as well as a function returning a vector of integers. Confused? Hopefully not. Our next syntactic quirk is going to be the dangling else, which we talked about briefly at the beginning of this video. So, let's think about this sentence for a second. If you see Sam, if he's busy, leave a note. Otherwise, call me. Now, this is kind of ambiguous, but it's mostly related to a punctuation problem here. Like what if I don't see Sam at all? Am I supposed to call you or not? So if we rearrange the punctuation and the structure a little bit and we get this, if you see Sam, leave a note if he's busy, otherwise call me. The meaning becomes a lot clearer. So this kind of ambiguity of association happens a lot in programming as well. Going over to our C++ example for our dangling else, I've got a really simple program here. Now, if you got the question wrong in the beginning of this video, now you have a second chance. If you can't remember what the question was, you should rewind and come back because you'll probably get this one right. So, we have in our simple program, this is kind of simulating a real world scenarioish of logging into a machine. So, we have a couple statements here basically saying like login was successful, login wasn't successful, and let's go to our main function. I've set this to is logged in is true. So, we're going to successfully log in and we're not an admin. So, we should not be able to log in as an admin. And then this is our nice little login check over here. And then at the end after we've logged in, we're going to get like a nice little goodbye message. So, let's take a closer look at this if statement. And again, tell me what you think is going to print here. I should have kind of a a boolean and over here. It should be if is logged in and is admin, then we're going to log in as an admin. Otherwise, we're just going to completely fail. So, if I'm logged in, but I'm not admin, I should really just be able to log in and then get the goodbye message. But let's go and execute this. So, I'm going to compile this. Ignore our warning. It's going to be the the phrase of the day, ignore all warnings. But what you see here, even though we should be logged in, we're getting a failed to log in. Get out. And then we get our thanks for logging in with us today. I didn't code that part. Great. So, what's happening? If we look at this if else, it looks like it should be working. Right now, what's happening is this else is ambiguous. It doesn't know whether to associate this else with this if statement or with this if statement. Kind of a side statement here. I honestly, personal pet peeve. I cannot stand when you don't use the curly braces. like always use the curly braces because what space are you trying to save? You're you're you're saving like a bite of space. It's just absolutely not worth it and it makes your code so much less reason readable. So, let's go over to the fix of this. So, I'm going to log in or get rid of this login right here. And if we fix this with some beautiful curly braces, we can make this so much less ambiguous. Now we know if logged in and is admin. We have our login admin right here. Otherwise, we're going to fail that login if our login was unsuccessful. So if I save this and recompile, you can see we've managed to fix all of our challenges with the ambiguous else association. All right, drum roll. It's time to look at our next bucket. And this is a whole different can of worms. It's not worms, guys. It's just monster. It's not even monster worms. So, for our next phase, remember the first phase of the compiler parsing was that lexical analysis phase. So, that's taking in all of your characters and parsing them into their respective tokens. But this parsing can be pretty tricky if you don't have the proper context to know what token some characters are going to map out to. Think about this sentence for a second. He said, she said hi. Now, we have the auditory advantage of being able to hear like my tone and inflection, so you can probably understand what meaning I'm trying to convey. But let's say you read this sentence instead inside of a book or something and you're looking at all of the different punctuation inside of it. In English, we have this concept of nested generics. So, this is really convenient for conveying meaning because you can have your single quotes inside of your double quotes to show what you're actually trying to say. Now, programming languages, especially earlier ones, faced these same kind of trickiness and challenges. Now, relating this back over to our C++ example, we can also have this kind of idea of nested generics inside of the C++ programming language as well. So if I took my previous vector of integers n and I instead decided to nest this with an additional type and make this a vector of vector of integers then I can put this one type inside of another type and nest this. Now this looks perfectly valid and it's going through I'm adding some different vectors to this printing this out once again. But let's see what happens when we compile this. And I'm going to go back in time for a second. And I'm going to go compile this with C++ 98. So let's do this. Traveling back in time. And something interesting is happening. So we're actually getting an error here. And let's read that error message cuz it's really interesting. So error. A space is required between consecutive right angle brackets. What does that mean? Basically that means that these two right angle brackets are being recognized as a shift operation. So this is going to be a shift write operation and the compiler during the syntax parsing process is basically saying I don't know whether this is a nested type or if this is a shift operation. So it basically can't tell what these tokens are supposed to be mapped to. And as our C++ 98 compiler is so kindly telling us, the old fix for this was instead of doing this, you had to actually add a space inside of here so that it could tell it wasn't supposed to be this shift write operation. And you can do this and recompile with C++ 98, but it won't accept any of my nice pretty syntax over here in my for loops. and you have to do this awful iterator syntax. So, I'm not going to put you through that. Instead, we're going to travel forward in time and go to C++ 20 and compile this. So, I'm going to have my original version cuz they figured out how to parse this properly eventually. And instead of doing C++ 98, jumping forward to 20. And here we go. We are we have successfully compiled this and we can execute this. perfectly fine. Wow, fascinating vectors. All right, let's move on to our next bucket. And this is probably going to be our worst bucket, and this is semantics. Now, context, context, context is so important here, but I think English is probably one of the worst languages for this since we just have so many overloaded terms. So, take a second and think about this sentence. Polish ambassadors are rare. Imagine you were reading this and you couldn't hear how I was pronouncing the word Polish. Well, since I have the capital P at the beginning, do I mean Polish or Polish? I would say Polish ambassadors are probably pretty rare. So rare they might even be non-existent. Now, through pronunciation and context, you can tell what I'm trying to say, but when it's written down, the only subtle differentiator that's not always available to us as a differentiator is the capital P. So consider this sentence instead. Meeting Polish ambassadors is rare. Now we've changed this so we're able to explicitly type cast the the capital P to the word Polish. So we can say for sure we do mean a Polish ambassador. I think I'd enjoy being a Polish ambassador. When I was a kid and I was picking chores, I'd always pick polishing as a chore. It's very, very satisfying. I don't think I'm qualified to be a Polish ambassador unless you guys will take me. Now, a great programmatic example of this is dependent type names. So, if I go over to my C++ program, you can see I have a simple function template right here, passing in a container, getting an iterator for that container, and then this allows me to print out the first element inside of that. If I go down to my main function, you can see I've got two different vectors. One is a vector of integers, and one is a vector of strings. And I'm able to call the same function and pass in those different types of containers. Now, if I go up here, let's focus on this line just a little bit. I want to try to compile this code and let's see what happens. So, I'm going to press enter. Oh, and we're getting an error. And the error message is missing type name prior to dependent type name. So, it's really complaining about this particular line. Now I want to copy this and let's go over to our CPP reference and look at our specific container that I'm using in this case which is a vector. And if I go down I can see part of our member types we have this const iterator available. Now if I go back over to my code you can see I'm expecting this to be a type const iterator. The problem is this right here. These two colons can also be referring to potentially let's say a static member variable. So the compiler is saying I don't know which one you're trying to do. And how do we fix that? We can be explicit about it just like we were inside of the Polish Polish example where we used the capital P in the sentence to distinguish that we meant Polish. All we need to do is simply specify explicitly that this is a type name that we're expecting. So save this and now if we compile our code comp compilation works perfectly and we can run this and see we can either print out the first element as 42 the integer or 42 the string. Sometimes the smallest tokens make the biggest difference about how a statement is parsed. Really good example of this is the humble colon operator. Consider this sentence. pack the following snacks, water, maps versus pack the following snacks, water, maps. Very big difference between the two interpretations. So the colon here is kind of like a whoop whoop to your brain of telling you what you're supposed to expect. Otherwise, in the second sentence, I kind of feel like I'm about to start eating water or maps. Not really not really the most tasty meal I've ever had. If you like C++, don't we all? then you'll know that the template feature is one of the most powerful parts of the C++ programming language. And much like the colon, the keyword template in C++ is an extremely important marker for the language to understand. If you use the word template, it completely changes the interpretation of the C++ compiler. Now, let's move on to our last C++ example. And yes, it includes templates. Going to go over to my program. I have a strct tool right here with a function template inside called perform action. And if I go down here, I have another function template. It's going to be our run processor that's going to trigger our perform action over here. And then I have a processor strct that contains my tool strct declared inside. And my main function is simply going to trigger this entire run processor process. How many times can I say the word process in a sentence? So, if I go back up, let's find the problematic line right here inside of our run processor function template when we're calling this perform action. Pay attention to this particular syntax right here. I'm going to compile this and see what happens. Now, interestingly, I'm getting an error here. It says use template keyword to treat perform action as a dependent template name. Basically, what's happening here is the compiler does not realize that this is actually a template. Now, it's looking at this particular symbol right here, our less than sign, and it's thinking that it is actually a less than sign, because it's not taking in the entire context of this, that we're simply performing an action on an integer. Now, how do we fix this? So, what we need to do is we need to be explicit with the compiler and tell this you need to treat this as a template. So simply enough, we'll just say that this is a template and then we can trigger our perform action template with our integer and run our processor on our tool ID 42. Makes a lot of sense, guys. Trust me. So I'm going to recompile this. Here we go. Worked perfectly fine. If I run this, perform action on tool 42 successfully. Being explicit is very helpful when you're programming. Now, I've been throwing a lot of shade at compilers in this video, but compilers are extremely complicated and super difficult to write. So, it makes sense why they would have so many ambiguities and syntax problems. I actually have an entire video that talks about like the history of compilers and how they can be self-hosted that you should definitely take a look at. But, if I'm going to be talking so much smack about compilers, I should probably try writing my own kind of compiler, which you can do inside of C++. So I have created my own custom parser for my extremely sophisticated language and this is called Remmbercript. So if I go over to my main.cp, this is going to run my Remmber program. So it's going to take in the input file that it's going to execute. And if I go over to my parser.y, you can see my extremely sophisticated parser. What this is going to do, this has a vector of strings that has really good song lyrics inside of it. And here are my tokens. I have my Rember token and my semicolon token. So, you remember that tokenizing portion of the compilation process. Here I have my own custom tokens inside of this. It's actually really cool that you can do this. I think it's really fun. Now, if I go down here, here's the logic for my interpreter. You can see every time I am encountering the Remmber semicolon tokens together I am going to print a random line of this song. Now let's go through and let's execute rememberers script. Now I have my repository here locally. I've already made my parser so I can run dot /erscript. It makes me laugh every time. And then I need to pass in my custom remember code. Well first of all let me show you. Let's cat one of these files. Let's do test remembermber. Every time you can see I've got my uh tokens here. Remember semicolon remember semicolon. So each one of these is going to print a random song lyric. So let's do dot / remembermber script and let's pass in test.ber. Here we go. We have our song lyrics that are printing randomly to the console. And if we want to get really crazy, let me show you. We can even do inline remember script. So we can do remember remember all in the same line. So let's execute this dot / remembermber and then we'll pass in our test inline remember. Here we go. Now we've got our three random song lyrics both in line and on separate lines. Tell me you haven't seen that kind of sophistication before. I really got to stop making these examples really late at night with no sleep and multiple monsters. I think this was only funny in my head when I did it. I don't I don't remember why. Anyway, all jokes aside, I highly encourage you to try to write your own custom parser in C++ because I think it's really cool to be able to do. It kind of shows you the challenges of it and it's just very powerful when you can write your own custom syntax. I think that's really nice to be able to declare your own tokens. Do you remember 21st day December always remember happy day? The real question is what can we do about all these different ambiguities and complexities in languages? And actually formal language theory is a really fascinating field to study. Now, although there aren't really many true examples of contextfree languages, we can get pretty close. And some languages are going to be better or worse than other languages at reducing these kinds of lexical complexities. A really neat example of this is ASD STE 100, also known as simplified technical English. And this was created by the European airline industry as a standard form of English for writing technical manuals in. And this was created to be a very easily understandable language for non-English speakers. You can actually find this language online. It's super simple and really neat. It's only limited to a total of about 900 words. As for programming, outside of toy and experimental languages, lisp and lisp like programming languages are kind of the clear winners here. Now, a lot of people really don't like the multitude of parentheses inside of lisp, but it makes it really clear and super unambiguous for the compiler. I remember my programming teacher in college talking about when she was learning to program and she was learning in lisp and she had to count. She was told by the teacher to count on her fingers opening and closing parentheses so that you could keep track of how many parentheses you needed to close at like the very end of a statement which is absolutely crazy. But other than lisp, ADA and hasll are pretty good about this. And while C++ might be my favorite programming language, historically it's not great at this. It's had a lot of lexical confusion in the past. So what do you think? In this video, I mostly stuck to C++ and English examples, but I'm sure there's a ton of other interesting lexical ambiguities in many different languages. So, feel free to leave yours in the comments section below. In any case, this is a super tricky field, but it's also very interesting. So, as always, thank you so much for watching everyone. I hope you enjoyed this video. to where he wired out.