Compiler Project
Compilers
COP-3402
Table of Contents
Compiler overview
Source to process
- Recall how source code becomes a process
- Our project will be on the compiler part
- Translate source to machine code
Compiler input and output
(Diagram)
Input language (SimpleIR) -> Compiler (CodeGen) -> Output language (x86 assembly)
Example run of the compiler
x := x + 2
# load x and 2
mov -16(%rbp), %rax
mov $2, %rbx
# perform addition, i.e., rax = rax + rbx
add %rbx, %rax
# store result in x
mov %rax, -16(%rbp)
SimpleIR
Our intermediate representation.
Example: C program
int main(int argc, char **argv) {
int x;
int retval;
x = read_int();
retval = print_int(x);
return 0;
}
Example: SimpleIR
function main localVariables argc argv x retval parameters argc argv x := call read_int x := x * 2 retval := call print_int x return 0 end function
Compilation units
- One file is one compilation
- One compilation unit defines one function
Minimum SimpleIR
function main return 0 end function
function,return, andend functionmust appear in each SimpleIR program- Function takes a single name
- Return takes an integer literal or a variable name
Defines a new function called main.
With parameters
function main localVariables argc argv x y parameters argc argv # instructions go here return 0 end function
- localVariables declares all variable, including parameters
- parameters are a subset of localVariables
For instance, in C, parameters to the function are local variables. In SimpleIR you need to declare all local variables first, include all parameters. Then specify separately which local variables are the parameters.
Statements
Assignment
x := 2 y := x
- Sets the value of a variable to
- Use the
:=symbol - Assignment to either
- an to integer literal or
- another variable name
Arithmetic
x := x + 2
- A single operation, no arithmetic expressions
- Always assigns the result to a variable name
Function calls
x := call print_int x
- Works like a C function call
- No parentheses
- List arguments after the function name
Branching
ifgoto x > 0 goto endlabel x := x * -1 endlabel:
- No structured code, more like assembly
- goto for unconditional branches, ifgoto for conditional
- Define labels as targets of branching
Pointers
t1 := &x t2 := *t1 *t1 := 11
- Ampersand gets address
- t2 := *t1 dereferences t1 and gets its value
- *t1 dereferences t1 and assigns its value
Example program: exponents
function exponent localVariables base exp result parameters base exp result := 1 top: if exp <= 0 goto endlabel result := result * base exp := exp - 1 goto top endlabel: return result end function
SimpleIR Grammar
grammar SimpleIR;
unit: function;
function: 'function' functionName=NAME localVariables? parameters? statement* returnStatement end;
localVariables: 'localVariables' variables=NAME+;
parameters: 'parameters' formals=NAME+;
returnStatement: 'return' operand=(NAME | NUM);
end: 'end' 'function';
statement: assign | dereference | reference | assignDereference | operation | call | label | gotoStatement | ifGoto;
operation: variable=NAME ':=' operand1=(NAME | NUM) operatorKind=('+' | '-' | '*' | '/' | '%') operand2=(NAME | NUM);
assign: variable=NAME ':=' operand=(NAME | NUM);
dereference: variable=NAME ':=' '*' operand=NAME;
reference: variable=NAME ':=' '&' operand=NAME;
assignDereference: '*' variable=NAME ':=' operand=(NAME | NUM);
call: variable=NAME ':=' 'call' functionName=NAME actuals=NAME*;
label: labelName=NAME ':';
gotoStatement: 'goto' labelName=NAME;
ifGoto: 'if' operand1=(NAME | NUM) operatorKind=('=' | '!=' | '<' | '<=' | '>' | '>=') operand2=(NAME | NUM) 'goto' labelName=NAME;
NAME: [a-zA-Z_] ([a-zA-Z_] | [0-9])* ;
NUM: [0-9]+ ;
PLUS: '+' ;
MINUS: '-' ;
STAR: '*' ;
SLASH: '/' ;
PERCENT: '%' ;
EQ: '=' ;
NEQ: '!=' ;
LT: '<' ;
LTE: '<=' ;
GT: '>' ;
GTE: '>=' ;
WS: [ \t\r\n]+ -> skip ;
COMMENT : '#' ~[\r\n]* -> skip ;
Implementation
Semantic actions
- Recall how we define compilers/interpreters
- Define the grammar
- Define semantic actions for each construct
ANTLR
- Parsing framework
- Define grammar
- Write semantic actions
Listener model
- Read the ANTLR book
- Examples
Use some SimpleIR, e.g., assignment
Show syntax tree
Show translation specification
Finally show actual C++ code
Intel x86-64 assembly
Memory layout
wiki.osdev.org/File:Elfdiagram.png
Note that the heap and stack and created at runtime by the OS
https://wiki.osdev.org/ELF#Loading_ELF_Binaries https://stackoverflow.com/questions/9226088/how-are-the-different-segments-like-heap-stack-text-related-to-the-physical-me
Assembly file layout
- data
- Fixed size, global data section (bss section is zeroed out)
- rodata
- Immutable data, e.g., for string constants
- text
- Executable part
- Assembly file converted to an exectuable
- Loader copies executable segments into memory
x86 registers
| Register name | Description |
|---|---|
| %rax | General-purpose (A for accumulator) |
| %rbx | General-purpose (B for base) |
| %rdx | General-purpose (D for data) |
| %rsp | Stack pointer (SP), top of stack |
| %rbp | Base pointer (BP), local variable access |
https://wiki.osdev.org/CPU_Registers_x86
These are the only registers we'll need to know about for this class.
%rax is AT&T syntax, which is the default syntax used by gcc.
x86 operations
Variables, constants, pointers
| Operation | Description |
|---|---|
| mov | Move data to/from registers and memory |
Arithmetic
| Operation | Description |
|---|---|
| add | Integer addition |
| sub | Integer subtraction |
| imul | Integer multiplication |
| idiv | Integer division |
| cdq | Convert double-word to quad-word (for division) |
Control-flow
| Operation | Description |
|---|---|
| jmp | Jump to a given instruction address |
| cmp | Compare (used for conditional jumps) |
| je | Jump if cmp was equal |
| jne | Jump if cmp was not equal |
| jl | Jump if cmp was less than |
| jle | Jump if cmp was less than or equal |
| jg | Jump if cmp was greater than |
| jge | Jump if cmp was greater than or equal |
Functions
| Operation | Description |
|---|---|
| push | Push onto the stck |
| pop | Pop from the stack |
| call | Push the return address and jump (for functions) |
| ret | Pop the return address and jump to it (for functions) |
https://www.felixcloutier.com/x86/
These are the only ops we'll use in this class.
We will go over them in more detail for the parts of the SimpleIR that can use them.
x86 operand ordering
Destination operand is last, even for arithmetic operations.
| Operation | C-like syntax |
|---|---|
mov %rbx, %rax |
rax = rbx |
add %rbx, %rax |
rax = rax + rbx |
This is AT&T syntax, used by gcc.
Example SimpleIR
function main localVariables x y x := 3 y := 5 return 0 end function
Local variable layout
(Diagram)
Diagram
- Illustrate offsets and register for x and y
- Point out that the local variables are basically in an array
- Show how the %rbp register points to the first element (%rbp is like the array's name in C)
- We start in the second slot, because other info used by the function will be in the first and second ones
eli.thegreenplace.net/2011/09/06/stack-frame-layout-on-x86-64/
Symbol table
Record the memory offset of the variable
localVariables x y
| Variable | Offset |
|---|---|
| x | -16 |
| y | -24 |
Variable assignment
x := 3
# assign 3 to x mov $3, %rax mov %rax, -16(%rbp)
Array accesses
-16(%rbp) # rbp[-16]
Reminder about the array access
Diagram
- Show each part of the operand
- Relate it to a C array index
Assembly
.file "stdin"
.section .note.GNU-stack,"",@progbits
.text
.globl main
.type main, @function
main:
# prologue, update stack pointer
pushq %rbp # save old base ponter
movq %rsp, %rbp # set new base pointer
push %rbx # %rbx is callee-saved
# allocate stack space for locals
sub $24, %rsp
# assign 3 to x
mov $3, %rax
mov %rax, -16(%rbp)
# assign 5 to y
mov $5, %rax
mov %rax, -24(%rbp)
# set return value
mov $0, %rax
# epilogue
add $24, %rsp
pop %rbx # restore %rbx
pop %rbp # restore old base pointer
ret