Transcript for:
Compiler Design Phases and Tools

hello everyone welcome back to the course of compiler design as promised today we will observe the different phases of the compiler with the help of an illustration so without any further ado let's get to learning now if we talk about the expected outcome of this particular lecture first we are going to have a brief overview of the various phases of the compiler thereafter we will learn about the various tools using which the different phases can be implemented well in the last lecture we have already observed that in order to convert the human readable source code into the machine code we need a language translate and the language translator has got four different phases the preprocessor then the compiler the assembler and finally the linker and loader now after going through the preprocessor the high level language code gets converted into the pure high level language code basically the preprocessor will embed the required header files with the source code omitting all the preprocessor directives from that now this pure high level language code will be given to the compiler which in turn will produce the equivalent assembly language code now this much we have already observed in the previous session so today we will take this pure high level language code and observe how it goes through the various phases of the compiler now to be really honest converting this entire piece of code will be a little time consuming so what we will do instead of the entire code let's consider this statement that is the arithmetic expression we will observe how this expression goes through the compiler and finally how this particular assembly language code segment is produced basically we will observe how this expression is treated by all the six phases of the compiler now coming to the first one that is the lexical analysis phase here the entire process is taken care of by the lexical analyzer so here the expression is given to the lexical analysis phase and the lexical analyzer takes lexums as inputs and generates the tokens now lexums are similar to words with only one small difference that is words individually have their own meanings whereas group of legazims in their entirety convey the meaning for instance this word analysis means a detailed examination of anything complex in order to understand its nature or to determine its essential features on the contrary this x individually doesn't convey any meaning until we consider the entire arithmetic expression now coming to tokens these are actually the meanings of the lexus so if we traverse this statement from left to right particularly in this statement x is an identifier then the equal symbol is an operator to be precise it is the assignment operator thereafter a is again an identifier the plus symbol is an arithmetic operator and so on so this is the output of the lexical analysis phase that is a stream of tokens and the job of the lexical analyzer is to find out the meaning of every lexing it recognizes the tokens with the help of regexes or regular expressions exemply gratia this is the regular expression for identifiers where l stands for letter d for digit and this special character is the underscore now allow me to illustrate this using its equivalent finite automata observe there are two regexes for identifier let's consider the first one so from the initial state that is q0 seeing a later we will go to the next state q1 and from this one seeing any number of letters or digits we will end up at the final state that is q3 coming to the next form starting from the initial state if we see an underscore we will move towards the next state that is q2 thereafter saying any number of letter or digits we will end up at the final state q3 many of you may know this that we cannot have an identifier name starting with digits and this is the logic which rejects the identifier names beginning with digits so for examining the leg zooms the lexical analyzer makes use of the type 3 or regular grammars of the family of grammars for formal languages now let's move on to the next phase here the syntax analyzer also known as the parser is in control the stream of tokens is passed to the syntax analysis phase the syntax analyzer depends on the type 2 or context-free grammars for this particular expression these are the cfg production rules that the parser will use in order to form the parse tree let me illustrate the formation of the parse tree consider the first production rule the start symbol s can be rewritten as id equals e semicolon it means an expression can be assigned to an identifier and the expression has to be followed by the statement terminator that is the semicolon so from s we can derive id equals operator e and semicolon coming to the next production rule e can be rewritten as e plus t that is expression can be expression plus term so from this e we can derive e plus t using the production rule now e can also be rewritten as t that is expression can also be a single term so using this production rule from e we can derive t coming to the next one t can be rewritten as t into f that means term can be a term multiplied by a factor so from this t we will derive t into f that is t then the multiplication operator and then f additionally t can also be written as f that is term can also be a single factor so from t we can derive f in these two instances finally consider the last production rule f can be rewritten as id that is factor is an identifier and using this we can derive id from all these f's so this is the parse tree now before observing the yield of the parse tree let me tell you a few things about this particular grammar here id the equals operator semicolon the plus and the multiplication operators are the set of terminals and the ones in the upper case that is s e t f these are the set of variables or non-terminals now in order to find out the yield of the parse tree we will have to traverse it top to bottom left to right taking notes of only the terminals so let's begin during traversal this id is the first terminal that we encounter remember top to bottom left to right so the next one is this equals operator next this id thereafter this plus operator then this id then this multiplication operator after that this id and finally this semicolon therefore the yield of the parse tree would be id equals id plus id into id semicolon and since the yield of the parse tree and the expression are the same the syntax analyzer will not produce any errors so in short taking the stream of tokens the syntax analyzer analyzes them following specific set of production rules and produces the parse tree and if the yield of the parse tree and the provided stream of tokens are the same then there is no error otherwise there is some syntax error in the statement now let's move on to the next phase in this phase the semantic analyzer takes the control the parse tree produced by the syntax analyzer is given to semantic analysis phase and the semantic analyzer in turn produces the semantically verified parse tree semantic analyzer is responsible for type checking array bound checking and the correctness of scope resolution basically it does the logical analysis of the parse tree like in this parse tree these three identifiers can be constants whereas this particular identifier because it is followed by this assignment operator cannot be a constant here the semantic analyzer will examine whether the type of this identifier is variable semantic analyzer detects type mismatch errors undeclared variables misuse of reserved words multiple declaration of a variable within a single scope accessing an out of scope variable mismatch between actual and formal parameters etc in simpler terms semantic analyzer looks for the meaningfulness of the parse tree and verifies that now the next phase is handled by intermediate code generator the semantically verified parse tree is given to the intermediate code generation phase where the intermediate code generator in turn produces the intermediate code so this was our parse tree and this is the yield of the parse tree which is similar to the pure high level language expression after being semantically verified the intermediate code generator will produce the intermediate code here we are using the very popular intermediate code that is the three address code dac now if you observe this parse tree carefully the precedence of the operators are visible here if we look at the tree from bottom to upwards so at the lowest level the multiplication operator is there and therefore this would be performed at first for this the intermediate code generator will create this temporary variable say t0 and assign the result of b into c to it coming to the next level the result of this is being added with this identifier which in the expression is a therefore in the intermediate code the result of a plus d0 is being assigned to another temporary variable t1 finally in the last level the result of the expression is being assigned to this identifier which in the arithmetic expression is the variable x so in the intermediate code t1 is being assigned to x now this intermediate representation of the code is called three address code because if you observe all the statements in here at most they have three addresses in them and by saying address we are mentioning the addresses of these variables now till this phase it is called the front end because taking this intermediate code if we want to generate the target code which is specific to the platform all we have to do is modify the next two phases that is the backend according to the platform which we want to produce our target code for now let's move on to the next phase shall we so here the code optimizer is in control basically the intermediate code is provided to the code optimization phase there the code optimizer in turn generates the optimized code now code optimization can either be machine dependent or machine independent we will observe them in details in due time for now for the sake of understanding let me illustrate the optimization procedure so as you can observe our intermediate code that is the three address code had three statements in the first one t0 is being assigned with the result of b into c then in the second one t1 is being assigned with the result of a plus d0 and finally in the third one we're assigning t1 to x now if you observe carefully the second and the third statements here instead of t1 if we assign a plus t0 directly to x we can reduce the length of the code so the optimized code will have only two statements and in a nutshell this is exactly what the code optimizer does it optimizes the intermediate code next up is the target code generator so the optimized code is given to the target code generation phase which happens to be the last phase of the compiler now taking this optimized code the target code generator will finally produce this assembly code segment allow me to walk you through this code segment here in the first line we have mov that is the mnemonic for moving then we have eax now eax is actually the extended version of the ax register which is a combination of ah and al ah can store 16 higher order bits and al stores remaining 16 lower order bits so eax can actually store 32 bits it is actually the accumulator register next we have d word ptr that is the data word pointer which is pointing at the register base pointer rbp with a scaling factor of minus 8. basically this pointer is pointing to the variable b and using this entire command the value held by variable b is being moved to the eax register now this imul is the mnemonic for sign multiplication d word ptr rbp minus 12 is pointing to the value stored by the variable c and using this command the value stored in variable c is being multiplied with the value stored in the register eax and the result is also stored in the eax register basically by the end of these two commands we will have the result of b into c in the accumulator coming to the next command here the contents of the eax register are being moved to another register edx which also is a 32-bit register similar to eax edx also has two different 16 bit segments where the higher order 16 bits are to be stored in dh and the lower order 16 bits will be stored in dl now in the next command we are moving the data world pointed by register base pointer with the scaling factor minus 4 that is the content of the variable a into the accumulator thereafter in this command we are adding the contents of the register eax and edx and also storing the result in the eax register that is the accumulator finally we are moving the content of the eax register to the address pointed by the data with pointer rbp minus 16 which is actually the address of the variable x so after execution of these commands we will have the calculated value in the x so this is the meaning of this assembly language code segment so this is how that arithmetic expression of our peer high level language code finally gets translated into this assembly language code segment after going through all these phases now let's observe the tools using which we can practically implement various phases of the compiler so the compiler has six phases in order to implement the lexical analysis phase we can use the program called lex lex is the standard lexical analyzer generator on many unix-based systems it reads an input stream specifying the lexical analyzer and writes the source code which implements the lexical analyzer for the c programming language yacc which stands for yet another compiler compiler is a lalr parcel generator we will study about the llr parsers in the chapter 4. anyway using yacc we can implement the syntax analysis phase lex is commonly used with yacc now we already know that among these six phases the first four are collectively called the front end and the last two are known as the back end using the software platform lance c compiler we can implement the entire front end of a c language compiler for an embedded processor interested learners are advised to go through the research paper lance a c compiler platform for embedded processors by dr reiner leipers of university of dortmund germany the link of the paypal has been provided in the description of this lecture all right so in this lecture we have gone through the overview of the various phases of compilers also we have learned about the tools using which we can implement different phases all right people that will be all for this session in the next session we will discuss about the symbol table so i hope to see you in the next one thank you all for watching [Applause] [Music] you