Code Generation 2
COP-3402
Table of Contents
Overview
In this project you will compile function definitions and calls to x86 assembly. As with all programming projects, it will be submitted via git.
Setup the repo
ssh into eustis, replacing
NIDwith your UCF NID.ssh NID@eustis.eecs.ucf.edu
Clone the compiler template to
codegen2git clone https://www.cs.ucf.edu/~gazzillo/teaching/cop3402fall24/repos/compiler-template.git/ codegen2
Enter the repo
cd codegen2/
If this doesn't, double-check step (2) and make sure you put
codegen2as the second argument toclone.Add the URL of your personal remote repository, replacing
NIDwith your UCF NID.git remote add submission gitolite3@eustis3.eecs.ucf.edu:cop3402/NID/codegen2
Synchronize your local repo with the remote eustis3 repo.
git push --set-upstream submission master
You only need to do this once. Use
git commitandgit pushregularly to keep the remote repo up to date.
Setting up your development environment
Be sure you are in the repo directory.
cd ~/codegen2
If you receive a warning about a managed python installation, then double-check that you are on eustis.
Then create the development environment. This creates an "editable" installation of your project, so that you can modify its source and rerun without having to reinstall the project.
pipenv install -e ./
If you can't run pipenv but you've already installed it, trying logging out and back in again to eustis.
If you haven't installed pipenv yet, please review the
calcproject.Enter your pipenv development environment. Do this everytime you log in to eustis to work on your project.
pipenv shell
Double-check that you are in the environment. Your prompt should look something like this:
(compiler) NID@net1547:~/codegen2$
You can later exit the dev environment with
exit. You do not need to enter the dev environment again if you have already entered it.
Get ANTLR and build the parser.
make -C grammar/
Compiler project structure
| File | Description |
|---|---|
| Pipfile | pipenv settings |
| compiler/CodeGen.py | The code generator that you will write. Not provided by the template repo. |
| compiler/Interpreter.py | A SimpleIR interpreter for comparing output |
| grammar/Makefile | A build file for the grammar |
| grammar/SimpleIR.g4 | The SimpleIR grammar |
| pyproject.toml | python project settings |
.gitignore files are for git and __init__.py are for python modules.
Implementation
Skeleton code
Here is the start to compiler/CodeGen.py
import os
import sys
import math
from textwrap import indent, dedent
from antlr4 import *
from grammar.SimpleIRLexer import SimpleIRLexer
from grammar.SimpleIRParser import SimpleIRParser
from grammar.SimpleIRListener import SimpleIRListener
import logging
logging.basicConfig(level=logging.DEBUG)
# This class defines a complete listener for a parse tree produced by SimpleIRParser.
class CodeGen(SimpleIRListener):
def __init__(self, filename, outfile):
self.filename = filename
self.outfile = outfile
self.symtab = {}
self.bytewidth = 8
def enterUnit(self, ctx:SimpleIRParser.UnitContext):
"""Creates the object file sections"""
self.outfile.write(
f'''\t.file "{self.filename}"
\t.section .note.GNU-stack,"",@progbits
\t.text
''')
def enterLocalvars(self, ctx:SimpleIRParser.LocalvarsContext):
"""Allocates space for local variables, including input parameters"""
# get list of local variables
locals = [ local.getText() for local in ctx.NAME() ]
# create a list of offsets of the bytewidth
offsets = map(lambda x: (x+1)*-self.bytewidth, range(len(locals)))
# create a dictionary mapping locals to offsets
self.symtab = dict(zip(locals, offsets))
logging.debug(self.symtab)
# compute the size of the stack space
stackspace = len(self.symtab.keys()) * self.bytewidth
logging.debug(stackspace)
# ceiling to 8 bytes
stackoffset = math.ceil(stackspace / 8) * 8
# align to 16 bytes by adding 8 to account for rbx if already aligned to 16 bytes
stackoffset += (stackoffset + 8) % 16
logging.debug(stackoffset)
# Emit an instruction to allocate stack space for locals
self.outfile.write(indent(dedent(f'''\
# allocate stack space for locals
sub\t${stackoffset}, %rsp
'''), '\t'))
def enterAssign(self, ctx:SimpleIRParser.AssignContext):
"""Assign value to a variable"""
if SimpleIRParser.NAME == ctx.operand.type:
operand = f"{self.symtab[ctx.operand.text]}(%rbp)"
elif SimpleIRParser.NUM == ctx.operand.type:
operand = f"${ctx.operand.text}"
self.outfile.write(indent(dedent(f'''\
# assign {ctx.operand.text} to {ctx.NAME(0).getText()}
mov\t{operand}, %rax
mov\t%rax, {self.symtab[ctx.NAME(0).getText()]}(%rbp)
'''), '\t'))
def enterFunction(self, ctx:SimpleIRParser.FunctionContext):
"""Emits the label and prologue"""
# TODO (same as codegen1)
def exitFunction(self, ctx:SimpleIRParser.FunctionContext):
"""Emits the epilogue"""
# TODO (same as codegen1)
def enterParams(self, ctx:SimpleIRParser.ParamsContext):
"""Moves input parameters to their local variables"""
# TODO
def enterReturn(self, ctx:SimpleIRParser.ReturnContext):
"""Sets the return value"""
# TODO
def enterCall(self, ctx:SimpleIRParser.CallContext):
"""Function call"""
# TODO
def main():
import sys
if len(sys.argv) > 1:
filepath = sys.argv[1]
input_stream = FileStream(filepath)
filename = os.path.basename(filepath)
else:
input_stream = StdinStream()
filename = "stdin"
lexer = SimpleIRLexer(input_stream)
stream = CommonTokenStream(lexer)
parser = SimpleIRParser(stream)
tree = parser.unit()
if parser.getNumberOfSyntaxErrors() > 0:
print("syntax errors")
exit(1)
# print(tree.toStringTree())
walker = ParseTreeWalker()
walker.walk(CodeGen(filename, sys.stdout), tree)
if __name__ == '__main__':
main()
Emitting assembly code
The CodeGen class provides a self.outfile file to write to. In python, write a string using
self.outfile.write("The string to emit")
Alternatively, you can use a format string to make creating templates easier, where anything inside curly braces is evaluated, e.g., the following prints a string followed by the contents of a variable called name:
self.outfile.write(f"This is the what is in the name variable: {name}")
To retrive ANTLR parse tree contents, use the ctx context parameter provided to each listener using the name of the token, e.g., the following will get the NAME token from the syntax tree for a function production and store it in the name python variable.
name = ctx.NAME()
Laying out the assembly file
enterUnit
This function is given to you. It creates assembly code boilerplate for you.
Variables
Code to manage variables for this project is given to you.
enterLocalvars (given to you)
This function generates assembly instructions that allocate space on the stack for all local variables. In SimpleIR, all variables are 64-bit integers. enterLocalvars is given to you. It works by first collecting the set of variables to allocate space for from the syntax tree of the input program:
locals = [ local.getText() for local in ctx.NAME() ]
Then it creates space on the stack for each local variable. The memory location of each local variable can be determined by an offset from the base pointer %rbp per the System V AMD64 ABI. The compiler records each variables offset in a table that maps the name of the variable to its offset from %rbp, which the compiler will use to translate variable accesses and assignments whenever the variable is used in SimpleIR instructions. The given enterLocalvars function creates this dictionary with a little python cleverness, i.e.,
# create a list of offsets of the bytewidth offsets = map(lambda x: (x+1)*-self.bytewidth, range(len(locals))) # create a dictionary mapping locals to offsets self.symtab = dict(zip(locals, offsets))
self.symtab now contains a mapping from each local variable to its own offset (which are multiples of the 8-byte width), e.g., -8, -16, -24, etc. (Do you remember why the offsets are negative in this ABI?)
Finally, the functions computes the total amount of space required for the local variables (number of variables times 8 bytes each), aligns the stack to 16 bytes (required by the ABI), and emits a assembly instruction that updates the stack pointer to allocate the space, i.e.,
# compute the size of the stack space
stackspace = len(self.symtab.keys()) * self.bytewidth
logging.debug(stackspace)
# ceiling to 8 bytes
stackoffset = math.ceil(stackspace / 8) * 8
# align to 16 bytes by adding 8 to account for rbx if already aligned to 16 bytes
stackoffset += (stackoffset + 8) % 16
logging.debug(stackoffset)
# Emit an instruction to allocate stack space for locals
self.outfile.write(indent(dedent(f'''\
# allocate stack space for locals
sub\t${stackoffset}, %rsp
'''), '\t'))
Because of this function, self.symtab will tell you how to translate variable uses into offsets from the base pointer %rbp for generating equivalent assembly instructions.
enterAssign (given to you)
The syntax of the assignment statement is
assign: NAME ':=' operand=(NAME | NUM);
Notice that operand may either be a NAME (which is the syntax of an identifier) or a NUM (which is the syntax for an integer literal). An assignment takes either a constant integer literal or a variable identifier and copies that data to the left-hand side variable. This translates to either a mov immediate to a register (integer literal) or a load from memory to a register (variable name on right-hand side), followed by a store of that value to the memory location associated with the variable name.
If the right-hand side is a variable name, the compiler will generate a load from the memory location of the variable to a temporary register (we'll use %rax for the temporary register). To determine the memory location, recall that our compiler assigns each local variable an offset from %rbp that we stored in a offset table created by enterLocalvars. For example, if variable b has offset -24 from %rbp, to load that memory location's value into %rax, we use the following instruction:
mov -24(%rbp), %rax
-24(%rbp) means, take the value of %rbp (which is a pointer to the current stack) frame, add -24 to it, then dereference the address and load that value into %rax.
To implement this in our compiler, enterAssign first determies whether the operand is a NAME or a NUM, i.e.,
if SimpleIRParser.NAME == ctx.operand.type:
When the operand is a NAME, the compiler creates the %rbp-indirect offset argument, e.g., -24(%rbp), i.e.,
operand = f"{self.symtab[ctx.operand.text]}(%rbp)"
When the operand is a NUM, we only need to use an immediate value argument, i.e.,
elif SimpleIRParser.NUM == ctx.operand.type:
operand = f"${ctx.operand.text}"
Either way, the operand python variable now contains the corresponding assembly instruction argument for the right-hand side of the SimpleIR statement.
Finally, enterAssign generates the copy of the right-hand side to the temporary register %rax, then stores the value into the memory location of the left-hand side variable:
self.outfile.write(indent(dedent(f'''\
# assign {ctx.operand.text} to {ctx.NAME(0).getText()}
mov\t{operand}, %rax
mov\t%rax, {self.symtab[ctx.NAME(0).getText()]}(%rbp)
'''), '\t'))
The left-hand size variable's memory location can also be found using the offset table (self.symtab). The name of the variable is retrieved from syntax tree with ctx.NAME(0).getText(), since the left-hand side variable is the first NAME in the grammar for the assignment.
Defining functions
enterFunction (same as codegen1)
To create the function, emit the assembly pseudo-ops .globl and .type with the name of the function (ctx.NAME()), as well as a label for the function. Then emit the prologue. Use whatever the name of the function is for all three, e.g., for the function factorial the function creation and prologue would look like this:
.globl factorial
.type factorial, @function
factorial:
# prologue
pushq %rbp # save old base ponter
movq %rsp, %rbp # set new base pointer
push %rbx # %rbx is callee-saved
exitFunction (same as codegen1)
Emit the assembly function epilogue and return instruction, i.e.,
# epilogue mov %rbp, %rsp # restore old stack pointer pop %rbp # restore old base pointer ret
This is the same for all functions.
enterParams (new for codegen2)
In SimpleIR, the input program first declares all local variables, e.g.,
localvars x a b c d e f g h
then it declares which of those local variables is a parameter, e.g.,
params a b c d e f g h
In the above example, all local variables, except for x are also parameters. The set of parameters is a subset of the local variables (or the same set).
The enterLocalvars function emits code that creates space for all local variables, including parameters. enterParams then copies the parameter values set by the caller to the corresponding parameter local variables. For instance, for the above set of params, the following assembly code copies the caller's values to the callee's local variables:
# move register parameter a to local variable mov %rdi, -16(%rbp) # move register parameter b to local variable mov %rsi, -24(%rbp) # move register parameter c to local variable mov %rdx, -32(%rbp) # move register parameter d to local variable mov %rcx, -40(%rbp) # move register parameter e to local variable mov %r8, -48(%rbp) # move register parameter f to local variable mov %r9, -56(%rbp) # move stack parameter h to local variable mov 16(%rbp), %rax mov %rax, -72(%rbp) # move stack parameter g to local variable mov 24(%rbp), %rax mov %rax, -64(%rbp)
The caller passes parameters both via registers and the stack, the first six via registers and the rest via the stack. To copy these parameters to the memory locations for the callee's local variable, enterParams generates mov instructions for the first size parameters from the ABI-defined set of parameter registers, i.e.,
registers = [ "%rdi", "%rsi", "%rdx", "%rcx", "%r8", "%r9" ]
enterParams can retrieve the list of parameters names for the input program from the ANTLR syntax tree, i.e.,
params = [ param.getText() for param in ctx.NAME() ]
The list of register and stack parameters can be spliced with
register_params = params[0:6] stack_params = params[6:]
For each of the register parameters, generate a mov from the corresponding register to the local variable's memory location. The local variables' memory locations are offsets from the base pointer, as described in enterLocalvars, i.e., by passing the name of the variable to self.symtab to get the offset. Then generate the mov instruction from the register, e.g., from the example above
# move register parameter c to local variable mov %rdx, -32(%rbp)
This instruction stores the value in %rdx, i.e., the third parameter, into -32(%rbp), which is the memory location of the local variable c for the above program. See enterAssign for more details on getting the memory location of local variables.
For the stack allocated parameters, mov them from the stack to the local variable. Since this is move between two memory location, the instruction set requires first copying to a temporary register (we'll use %rax), resulting in two move instructions. The first instruction copies from the parameter's memory location on the stack to %rax, e.g.,
mov 16(%rbp), %rax
Then the next instruction copies from %rax to the location variable's memory location, given by self.symtab, e.g.,
mov %rax, -72(%rbp)
The offsets are positive from the %rbp (why is this the case?) and they start from 16. This is because %rbp plus 0 is the old base pointer, %rbp plus 8 is the return address, then %rbp plus 16 is the 7th parameter (the first after the register parameters), %rbp plus 24 is the 8th, etc.
Calling functions
For this project, you will need to support an arbitrary number of parameters to function calls and support storing return values. There are several steps to calling a function in assembly:
- Passing parameters
- Making the call
- Restoring the stack pointer
- Saving the return value
The following is an example of a SimpleIR function call. Note that this assumes that the variable names have already been declared (enterLocalvars) and defined (enterAssign), both of which are already given to you.
x := call paramtest a b c d e f g h
The first step, parameter passing, using both registers and the stack to pass the parameters (per the System V AMD64 ABI).
# pass the first size parameters via the pre-defined set of registers
mov -32(%rbp), %rdi
mov -40(%rbp), %rsi
mov -48(%rbp), %rdx
mov -56(%rbp), %rcx
mov -64(%rbp), %r8
mov -72(%rbp), %r9
# pass the remaining arguments via the stack
push -80(%rbp)
push -88(%rbp)
# make the call
call paramtest
# restore the stack pointer
add $16, %rsp
# save the return value
mov %rax, -24(%rbp)
There is one mov instruction for each of the first six parameters (recall that the System V AMD64 ABI passes the first six via registers instead of the stack). These instruction copy the local variable's value to one of the predefined registers. The local variables' memory locations are offsets from the base pointer, as described in enterLocalvars. The predefined set of registers for the first size parameters, in python, is
registers = [ "%rdi", "%rsi", "%rdx", "%rcx", "%r8", "%r9" ]
You can retrieve the left-hand side variable (to assign the return value to), the function name to call, and the parameters with the following python code:
call = [ name.getText() for name in ctx.NAME() ] varname = call[0] funname = call[1] params = call[2:]
param is a list containing all the parameters in order. Take the first six with
register_params = params[0:6]
and the remaining stack parameters with
stack_params = params[6:]
For the register parameters, generate a mov instruction that copies the local variable's value (from the stack frame, e.g., -32(%rbp)) to the corresponding parameter register (listed in registers above). Then for each stack parameters (if there are any), generate a push instruction that pushes the local variables value.
After processing the parameters, generate the call to the function (given in funname), e.g., call paramtest above. After the call, generate an instruction to remove any parameters from the stack (what instruction(s) could be used to do this? How do we know how much space was used on the stack?) This is done in the example above with
add $16, %rsp
Finally, the return value is stored in %rax according to the System V AMD64 ABI convention (which your functions will also follow). The left-hand side variable name is given in varname. Recall how to access the memory location of a variable, i.e., using the offset table (self.symtab). Use a mov to store the return value (%rax) into the memory location of the left-hand side variable, e.g.,
mov %rax, -24(%rbp)
Returning
SimpleIR allows for either a return of an integer constant or a variable. Similarly to enterAssign you can check for this by checking ctx.operand.type and use the same if-then-else statement to generate the assembly instruction argument (either a load from an offset from %rbp or an immediate constant). Then store this value into %rax, per the ABI, e.g., mov -8(%rbp), %rax.
Testing your compiler
This example compiles two functions, main and paramtest. If successful, you will have two assembly files, main.s and paramtest.s. main calls paramtest and returns the constant 3. While paramtest returns 3, this value is never used by main.
codegen << EOT | tee main.s function main localvars argc argv x a b c d e f g h params argc argv a := 1 b := 2 c := 3 d := 4 e := 5 f := 6 g := 7 h := 8 x := call paramtest a b c d e f g h return x EOT codegen << EOT | tee paramtest.s function paramtest localvars x a b c d e f g h params a b c d e f g h x := c return x EOT gcc -o main main.s paramtest.s ./main echo $? # you should see 3 as the exit code
Debugging with GDB
One way to help trace the function call is to use gdb. The following will rebuild the main program with debugging symbols on, run gdb, then step through each assembly instructions.
gcc -g -o main main.s func.s # compile with debugging symbols (-g) gdb main b main # setup breakpoint at main r # start running, breaks at main si # step instruction to see next instruction # hitting enter will repeat last command, e.g., si c # one done use c to continue running without stopping
Debugging tutorials
See these resources for more information on gdb.
- Basic Assembler Debugging with GDB
- Chapter 2 of 21st Century C
Full example
main.ir
function main localvars argc argv x a b c d e f g h params argc argv a := 1 b := 2 c := 3 d := 4 e := 5 f := 6 g := 7 h := 8 x := call paramtest a b c d e f g h return x
main.s
Running
codegen main.ir > main.sshould produce similar assembly output. Note that#denotes a comment..file "main.ir" .section .note.GNU-stack,"",@progbits .text .globl main .type main, @function main: # prologue, update stack pointer pushq %rbp # save old base ponter movq %rsp, %rbp # set new base pointer push %rbx # %rbx is callee-saved # allocate stack space for locals sub $88, %rsp # move register parameter argc to local variable mov %rdi, -8(%rbp) # move register parameter argv to local variable mov %rsi, -16(%rbp) # assign 1 to a mov $1, %rax mov %rax, -32(%rbp) # assign 2 to b mov $2, %rax mov %rax, -40(%rbp) # assign 3 to c mov $3, %rax mov %rax, -48(%rbp) # assign 4 to d mov $4, %rax mov %rax, -56(%rbp) # assign 5 to e mov $5, %rax mov %rax, -64(%rbp) # assign 6 to f mov $6, %rax mov %rax, -72(%rbp) # assign 7 to g mov $7, %rax mov %rax, -80(%rbp) # assign 8 to h mov $8, %rax mov %rax, -88(%rbp) # call paramtest mov -32(%rbp), %rdi mov -40(%rbp), %rsi mov -48(%rbp), %rdx mov -56(%rbp), %rcx mov -64(%rbp), %r8 mov -72(%rbp), %r9 push -88(%rbp) push -80(%rbp) call paramtest add $16, %rsp mov %rax, -24(%rbp) # set return value mov -24(%rbp), %rax # epilogue mov %rbp, %rsp # restore old stack pointer pop %rbp # restore old base pointer ret
paramtest.ir
Try assigning x to different input parameters to test that parameters are being passed correctly both through registers and through the stack.
function paramtest localvars x a b c d e f g h params a b c d e f g h x := c return x
paramtest.s
Running
codegen paramtest.ir > paramtest.sshould produce similar assembly output. Note that#denotes a comment..file "paramtest.ir" .section .note.GNU-stack,"",@progbits .text .globl paramtest .type paramtest, @function paramtest: # prologue, update stack pointer pushq %rbp # save old base ponter movq %rsp, %rbp # set new base pointer push %rbx # %rbx is callee-saved # allocate stack space for locals sub $72, %rsp # move register parameter a to local variable mov %rdi, -16(%rbp) # move register parameter b to local variable mov %rsi, -24(%rbp) # move register parameter c to local variable mov %rdx, -32(%rbp) # move register parameter d to local variable mov %rcx, -40(%rbp) # move register parameter e to local variable mov %r8, -48(%rbp) # move register parameter f to local variable mov %r9, -56(%rbp) # move stack parameter g to local variable mov 16(%rbp), %rax mov %rax, -64(%rbp) # move stack parameter h to local variable mov 24(%rbp), %rax mov %rax, -72(%rbp) # assign c to x mov -32(%rbp), %rax mov %rax, -8(%rbp) # set return value mov -8(%rbp), %rax # epilogue mov %rbp, %rsp # restore old stack pointer pop %rbp # restore old base pointer ret
More full examples
cd ~/codegen2 wget https://www.cs.ucf.edu/~gazzillo/teaching/cop3402fall24/files/compiler-examples.tar tar -xvf compiler-examples.tar
Submitting your project
Stage, commit, and push to the grading server
The only file you need to submit is compiler/CodeGen.py.
Once you have set up the repo, all you need to do is use git add, git commit, and git push to stage, commit, and sync your repo to the grading git server.
Self-check
You can check that you've submitted correctly by cloning, building, and testing your repo.
cd ~ git clone gitolite3@eustis3.eecs.ucf.edu:cop3402/NID/codegen2 codegen2_new cd codegen2_new pipenv install -e ./ pipenv shell make -C grammar/ codegen << EOT > main.s function main return 10 EOT gcc -o main main.s ./main echo $? # you should see 10 as the exit code
What should be the correct output, i.e., assignments of the variables, for these arithmetic operations?
(Only if instructed) Updating from the start repo
If the original repo gets updated after you have already started implementing your project, you can get those updates by pulling. Otherwise, you will never need to do this step. Be sure to commit any changes you have made before proceeding.
git pull origin master --rebase git push -f
If you encounter a conflict, it may be that you modified some files from the original repo that didn't need to be modified. Come to office hours if you need help resolving the conflict.
Troubleshooting
If you make a mistake in typing the URL, you can remove the submission remote and try the add step again:
git remote rm submission git remote add submission gitolite3@eustis3.eecs.ucf.edu:cop3402/NID/codegen2 # replace NID with yours
- Do not try creating a new repo if you make a mistake. You will not be able to push the new repo to gitolite3, since there already is one there. You can always make new changes and commit them to fix mistakes.
- If in self-check codegen2_new already exists, just use a fresh directory name.
The program must be run inside of the
pipenvenvironment. You can see that you have successfully entered the environment because your prompt is prefixed with(compiler), e.g.,(compiler) NID@net1547:~/codegen2$
You can exit the environment with
exit.
Self-grading
cd ~/ git clone https://www.cs.ucf.edu/~gazzillo/teaching/cop3402fall24/repos/codegen2-grading.git cd codegen2-grading make
Look at README.md for usage instructions.
Grading schema
| Criterion | Points |
|---|---|
| The git repo exists | 1 |
compiler/CodeGen.py exists |
1 |
| codegen runs the given example correctly | 2 |
| codegen runs new example inputs correctly | 2 |
| TOTAL | 6 |