I. Code Generation A. overview ------------------------------------------ OVERVIEW OF CODE GENERATION .. ASTs...-> [ Static Analysis ] | | IR v [ Code Generation ] | | Machine Code | v Virtual Machine Execution The IR (= Intermediate Representation) records ------------------------------------------ 1. IR (Intermediate Representation) What kind of information is needed from a name's use in order to generate code? Should the parser create the lexical address of a name's use during paring? Is the symbol table unchanging (immutable)? ------------------------------------------ IR TREES An IR is a tree structure, Helps in modularizing compilers and code generation WITHOUT IR WITH IR Java ------>x86 Java ->x86 \/ ||| \ / C ------>MIPS C \\ /-->MIPS \\/ || ->IR C++ ------>Sparc C++ // \-->Sparc \\\/ / \ C# ------>A1 C#/ \>A1 ------------------------------------------ ------------------------------------------ OUR CHOICES FOR AN IR To keep things simple, we will use a modified AST type as an IR Parser: - records - provides Static analysis: - records ------------------------------------------ 2. General strategy ------------------------------------------ GENERAL STRATEGY FOR CODE GENERATION Don't try to optimize! Follow the grammar of ------------------------------------------ ------------------------------------------ FOLLOWING THE GRAMMAR Code resembles the grammar that When ------------------------------------------ How does this relate to the parser? Why is this useful? B. Translation target: code sequences ------------------------------------------ TARGET: CODE SEQUENCES Need lists of machine code Why? ------------------------------------------ Why are code sequences needed? ------------------------------------------ REPRESENTING CODE SEQUENCES IN C #include "instruction.h" // code that can be in a sequence typedef struct code_s code; // code sequences typedef code *code_seq; // machine code instructions typedef struct code_s { code_seq next; bin_instr_t instr; } code; ------------------------------------------ C. Designing Code Sequences 1. Overall strategies ------------------------------------------ STRATEGIES FOR DESIGNING CODE SEQUENCES Work backwards ------------------------------------------ ------------------------------------------ EXPRESSION EVALUATION Example: (E1 + E2) - (E3 / E4). Constraints: - Expressions have a result value - Binary operations (+, -, *, /) in the SRM need 2 registers Where should the result be stored? Can it be a register? ------------------------------------------ 2. use of registers ------------------------------------------ USE OF REGISTERS What if the register is already in use? e.g., $v0 for expression's value consider x := y + z Strategies: - use a different register - save and restore ------------------------------------------ ------------------------------------------ GENERAL STRATEGY FOR EXPRESSIONS Each expression's value goes To operate on an expression's value in a register r: ------------------------------------------ 3. Background on SRM instructions ------------------------------------------ BACKGROUND: SRM INSTRUCTIONS ADD s,t,d "GPR[d] = GPR[s]+GPR[t]" SUB s,t,d "GPR[d] = GPR[s]-GPR[t]" MUL s,t "HI,LO = GPR[s]*GPR[t]" DIV s,t "HI = GPR[s] % GPR[t]" and "LO = GPR[s] / GPR[t]" LW b,t,o "GPR[t] = memory[GPR[b]+4*o]" SW b,t,o "memory[GPR[b]+4*o] = GPR[t]" ADDI s,t,i "GPR[t] = GPR[s]+sgnExt(i) How to move value from r1 to r2? What limitations on immediate operands? What if the literal doesn't fit? ------------------------------------------ If the numbers are small enough, where is the result of a multiplication as a 32 bit integer located: HI or LO? How would you move the value in register r1 to register r2? Are there limitations on the immediate operands for ADDI? What can you do if you want a constant value that doesn't fit? D. Literal Table ------------------------------------------ LITERAL TABLE IDEA - Store literal values in - Keep mapping from - Initialize ------------------------------------------ ------------------------------------------ LITERAL TABLE IN EXPRESSION EVALUATION Idea for code for numeric expression, N: 1. Look up N in global table, 2. Receive N's 3. generate a load instruction into 4. ------------------------------------------ What's our goal for expression code? ------------------------------------------ LITERAL TABLE AND BOF DATA SECTION How to get the literals into memory with the assumed offsets? ------------------------------------------ E. Activation Record (AR) Layout Where should constants and variables for a block be stored? ------------------------------------------ LAYOUT OF AN ACTIVATION RECORD Must save SP, FP, static link, RA and registers $s0-$s7 Can't have offset of static link at a varying offset from FP Layout 1: FP --> [ saved SP ] [ registers FP ] [ static link ] [ RA ] [ $s0 ] [ ... ] [ $s7 ] [ local constants ] [ ... ] [ local variables ] [ ... ] [ temporary storage ] SP -->[ ... ] Layout 2: [ ... ] [ local variables ] [ ... ] FP -->[ local constants ] [ saved SP ] [ registers FP ] [ static link ] [ RA ] [ $s0 ] [ ... ] [ $s7 ] [ temporary storage ] SP -->[ ... ] Advantages of layout 1: Advantages of layout 2: ------------------------------------------ What are the advantages of layout 1? What are the advantages of layout 2? Any disadvantages? Which layout should we use? F. Compiling Expressions 1. deciding where to start What are the simplest cases for expressions? ------------------------------------------ TRANSLATING EXPRESSIONS Abstract syntax of expressions in PL/0 E ::= E1 o E2 | x | n o ::= + | - | * | / Simplest cases are: ------------------------------------------ 2. example translations a. numeric literals ------------------------------------------ TRANSLATION SCHEME FOR NUMERIC LITERALS ------------------------------------------ ------------------------------------------ TRANSLATION SCHEME FOR VARIABLE NAMES (AND CONSTANTS) ------------------------------------------ b. binary operations ------------------------------------------ TRANSLATION SCHEME FOR BINARY OPER EXPRS Goal is to get to an instruction like Example: E1 - E2 ------------------------------------------ So, for E1 - E2 what needs to be done? G. Declarations Where are constants and variables stored? ------------------------------------------ TRANSLATION SCHEME FOR PL/0 DECLARATIONS const c = n; var x; When do blocks start executing? What should be done then? How do we know how much space to allocate? How to initialize constants? How to initialize variables? ------------------------------------------ When are blocks executed in PL/0? When starting to execute a block, what should be done? Which should be allocated first: constants or variables? How do we know how much space to allocate? How to initialize constants? How to compute the value of the constant? How to initialize variables? H. Statements 1. Basic Statements What are the base cases in the grammar for statements? ------------------------------------------ TRANSLATION SCHEME FOR BASIC STATEMENTS skip x := E read x write E ------------------------------------------ For testing, want to know: What are the simplest cases? In general, can the "levels outwards" part of the lexical address be determined when the variable is declared? Does the same thing work for constants? Should we write a character with code E or the digits of E? I. Conditions 1. Overall conditions ------------------------------------------ GRAMMAR FOR CONDITIONS ::= odd | ::= = | <> | < | <= | > | >= So the code recursion structure is? Code looks like: ------------------------------------------ What should these functions return? a. Relational operator conditions ------------------------------------------ RELATIONAL OPERATOR CONDITIONS ::= A design for conditions: Goal: put true of false on top of stack for the value of the condition Consider E1 <> E2 [Evaluate E1 to top of stack] [Evaluate E2 to top of stack] [pop top of stack (E2's value) into $at] [pop top of stack (E1's value) into $v0] # jump past 2 instrs, # if GPR[$v0]!=GPR[$at] BNE $v0, $at, 2 # put 0 (false) in $v0 ADD $0, $0, $v0 # jump over next instr BEQ $0, $0, 1 # pub 1 (true) in $v0 ADDI $0, $v0, 1 # now $v0 has the truth value [code to push $v0 on top of stack] Consider E1 >= E2 [Evaluate E1 to top of stack] [Evaluate E2 to top of stack] [pop top of stack (E2's value) into $at] [pop top of stack (E1's value) into $v0] SUB $v0, $at, $v0 # $v0 = E1 - E2 # jump past 2 instrs, # if GPR[$v0]>=GPR[$at] # if E1-E2 >= 0 BGEZ reg, $at, 2 # skip 2 instrs # put 0 (false) in reg ADD $0, $0, reg # jump over jext instr BEQ $0, $0, 1 # pub 1 (true) in reg ADDI $0, reg, 1 ------------------------------------------ What would work for =? What would you do for < ? ------------------------------------------ CODE FOR BINARY RELOP CONDITIONS // file ast.h typedef struct { file_location *file_loc; AST_type type_tag; expr_t expr1; token_t rel_op; expr_t expr2; } rel_op_condition_t; // file gen_code.c // Requires: reg != $at // Generate code for evaluating condAST into reg // Modifies when executed: reg, $at code_seq gen_code_relop_cond( rel_op_condition_t condAST, reg_num_type reg) { } ------------------------------------------ J. Control Flow Statements (Compound Statements) Why is it useful to write the base cases first? ------------------------------------------ ABSTRACT SYNTAX FOR COMPOUND STATEMENTS S ::= begin { S } | if C S1 S2 | while C S So what is the code structure? Code looks like: begin S1 S2 ... end if C S1 S2 while C S ------------------------------------------ II. Code Generation for Procedures ------------------------------------------ SUPPORTING PROCEDURES AND CALLS Main issues: - storing their code Why? - knowing exactly where each starts Why? Another issue: - sending the right static link ------------------------------------------ What static link does a called procedure need? A. Where to store code for procedures? Where do we put the code sequences for each procedure? ------------------------------------------ WHERE TO PUT PROCEDURE CODE? Possible layouts in VM's code array: ------------------------------------------ How would you implement each? Which layout makes the most sense? B. how to find each procedure's starting address? ------------------------------------------ NESTED PROCEDURES ARE A PROBLEM procedure A; procedure B; begin # B's body code... call A # ... # ... end begin # A's body code call B # ... # ... end If lay out the code as [ code for A ] [ code for B ] How do we know the address of B to compile the call to B? What about the other direction? ------------------------------------------ ------------------------------------------ RECURSIVE PROCEDURES, SIMILAR PROBLEM procedure R; begin # R's body code ... call R # ... end Before storing code for R, how do we know where it starts? ------------------------------------------ ------------------------------------------ MUTUAL RECURSION procedure O; begin # O's body code... call E # ... end procedure E; begin # E's body code ... call O # ... One of these must before the other in the code area of the VM... ------------------------------------------ No matter which of O or E is put first, how is the call to the second one to know where the second one starts? 1. solutions ------------------------------------------ SOLUTION STRATEGIES FOR CALLS [Multiple passes]: 1. Generate code for each procedure (+ store offsets in symbol table, + layout procedure code in memory) 2. Gather table of addresses (map from names to addresses, using offsets and beginning address) 3. Patch up code addresses for calls (+ output code) [Lazy evaluation, labels]: 1. Generate code for each procedure with calls to labels (+ store or update labels in symbol table) (+ output code) ------------------------------------------ C. Multiple Passes as a Solution ------------------------------------------ GENERAL SOLUTION: MULTIPLE PASSES Problem: where does each procedure start? Solution idea: 1. Compile all procedure code (now know how big each procedure is) 2. Lay out procedure code in memory (now know where each starts) 3. Change each call instruction ------------------------------------------ What would a progrm need to do to change all the call instructions? D. Labels as a Solution ------------------------------------------ GENERAL SOLUTION: LABELS Use "labels" to allow Term "label" is from assembly language ; ... jmp L ; ... L: ; ... ------------------------------------------ ------------------------------------------ APPROACHES TO FIXING LABELS Problem: convert labels to addresses (1) Use multiple passes a. Generate code with labels b. Lay out memory for procedures (determine starting addresses) c. Change labels to addresses advantages: disadvantages: (2) Use shared mutable data (lazy eval.) a. labels are unique placeholders, shared by all uses (calls) b. when address is determined, update the placeholder (and all uses are updated) advantages: disadvantages: ------------------------------------------ 1. label data structure for lazy evaluation ------------------------------------------ LABEL DATA STRUCTURE FOR LAZY EVAL // file label.h #include "machine_types.h" typedef struct { bool is_set; address_type byte_addr; } label; // Return a fresh label that is not set extern label *label_create(); // Set the address in the label extern void label_set(label *lab, address addr); // Is lab set? extern bool label_is_set(label *lab); // Requires: label_is_set(lab) // Return the address in lab. extern address_type label_read(label *lab); ------------------------------------------ E. exercise III. Code Generation Demo A. context ------------------------------------------ CONTEXT VM changes (from HW4's VM): // words are ints or floats typedef enum {int_type, float_type} word_type; // words for this machine typedef struct word_s { word_type type_tag; union word_u { int i; float f; } data; } word; No MOD instruction Changed to RND instruction (rounding) FLOAT Language changes ::= '{' { } '}' | ... where ::= { } ::= float | bool ------------------------------------------ If the FLOAT language only has floats and bools, does it make sense to have a round expression? B. Enhanced ASTs as an IR What changes to the ASTs are needed for new syntax and for code generation? ------------------------------------------ ENHANCED ASTS AS AN IR See ast.h Changes to ASTs: // S ::= begin { VD } { S } typedef struct { AST_list vds; AST_list stmts; } begin_t; ------------------------------------------ 1. identifier uses ------------------------------------------ IDENTIFIER USES IN ASTs // E ::= x typedef struct { // name of a constant or variable const char *name; // set during static analysis, // includes info for lexical addr id_use *idu; } ident_t; // S ::= read x typedef struct { AST *ident; } read_t; // S ::= assign x E typedef struct { AST *ident; AST *exp; } assign_t; ------------------------------------------ ------------------------------------------ ID_USE STRUCTURES typedef struct { id_attrs *attrs; unsigned int levelsOutward; } id_use; ------------------------------------------ Why track attributes in the AST? 2. id attributes (id_attrs) ------------------------------------------ ID ATTRIBUTES // attributes of idents typedef struct { file_location file_loc; var_type vt; // type // offset from beginning of scope unsigned int loc_offset; } id_attrs; // where: typedef enum {float_t, bool_t} var_type; ------------------------------------------ C. scopes and symbol tables Where do nested scopes happen in this language? 1. scope checking begin statements ------------------------------------------ void scope_check_beginStmt(AST *stmt) { symtab_enter_scope(); // <******* scope_check_varDecls( stmt->data.begin_stmt.vds); AST_list stmts = stmt->data.begin_stmt.stmts; while (!ast_list_is_empty(stmts)) { scope_check_stmt( ast_list_first(stmts)); stmts = ast_list_rest(stmts); } symtab_leave_scope(); // <******** } ------------------------------------------ D. Writing a Code Generator ------------------------------------------ GENERATING CODE Done in the file gen_code.c - Functions arranged to walk the ASTs - All return a code_seq Useful files: ast.h id_attrs.h id_use.h code.h ------------------------------------------ What does a function in gen_code get as an argument? ------------------------------------------ STEPS FOR CODE GENERATION 1. start with the base cases 2. Write simplest tests possible 3. Design code sequences for the nonterminals involved 4. Write code for each node of the AST 5. Test it: a. check the output machine code b. check the VM's execution ------------------------------------------ E. Examples of generating code 1. expressions a. design ------------------------------------------ EXPRESSIONS What are the base cases? A very simple test: What code sequence do we want? ------------------------------------------ And what are the inductive cases? What instruction can we use to allocate space for x? How should we get the fp for x on top of the stack? How should we get the value of 1 on top of the stack LIT 1 So, that is our code sequence: INC 1 PBP 0 LIT 1 STO 0 Q: Anything else we are missing in this code? Anything else we are missing in this code? b. coding it ------------------------------------------ GENERATING CODE Where does execution start? What is the AST for our program? Where do we generate those code sequences? Let's write it! ------------------------------------------ c. more expression cases d. binary expressions F. statements Any more base cases for statments in FLOAT? What are the inductive cases? i. begin-statements without var decls ii. read and write statements iii. if statements G. nested scopes ------------------------------------------ NESTED SCOPES Example in FLOAT: # $Id$ float x; { float y; { float z; z = 0; y = 1; x = 2; } } What kind of code sequence for this? ------------------------------------------ What should be done to enter a scope? What should be done to exit a scope?