Compiler Basics
Compilers
COP-3402
Table of Contents
Programming language implementation
Can your processor run C code?
gcc -o hello hello.c ./hello
Why are there two commands?
gcc
takes source code and converts it to a runnable program.
Keep in mind that gcc
itself is also a program and bash creates a process to exec the gcc
program
Two main implementation techniques
Compilers and Interpreters
Compiler or interpreter, either way, we have another program that takes our C code and makes it run on our machine.
Two general techniques, though these are actually blended for real-world compiled or interpreted languages.
Roughly two ways to design your programming language implementation:
- Compile or translate to the instructions that the machine does run, so that the machine can run an equivalent of the program
- Interpret or simulate the actions of the instructions to run the program and produce the output
Compiler
The compiler translates your source instructions to machine code.
Diagram
- .c -> gcc -> .exe, input -> cpu -> output
Compiler
The machine code should have the same outputs that the source program would.
Functional equivalence is possible when programs are deterministic. How can programs, such as random number generators, create apparent non-determinism with a deterministic program? They can by using random (or hard to guess) inputs.
Probabilistic and quantum programming languages have probability distributions as outputs.
Interpreter
The interpreter simulates the actions of the source instructions.
Diagram
- .py, input -> python -> (cpu ->) output
Emulators are a special kind of interpreter interpreters of machine code, which can be used to run code written for another processor.
Virtual machines are also interpreters of machine code, though sometimes virtualization is useful even for the same processor architecture.
Interpreter
The interpreted program should have the same outputs that the source program would.
Illustrating language implementation
Getting "1+2" to run on our machine.
We assume a syntax and semantics for arithmetic operations that is similar to what C-like languages have.
Compiler approach
Translate "1+2" to machine code.
// read the source code char left = getchar(); char operator = getchar(); char right = getchar(); // generate assembly code that performs the addition if ('+' == operator) { printf(" mov $%c, %%rax\n", left); printf(" mov $%c, %%rbx\n", right); printf(" add %%rbx, %%rax\n"); // print the addition instruction }
Interpreter approach
Interpret the result of "1+2".
// read the source code char left = getchar(); char operator = getchar(); char right = getchar(); // compute the result of the addition if ('+' == operator) { int leftnum = left - '0'; int rightnum = right - '0'; int result = leftnum + rightnum; // perform addition printf("%d\n", result); }
Observe that it's pretty straightforward to interpet addition, because we have the same representation for it in C.
Key difference
- Compiler program: prints out the machine code for addition
- Interpreter program: perform the addition operation
Run and show the output of language.c.
Interpreting exponents, e.g., "2^8"?
No machine instruction or C instruction for exponentiation.
Interpreter pseudo-code
read left operand read right operand read operator if operator == "^": result = 1 for i = 1 to right operand: result = result * left operand print result
Full example for + and ^
language.c
/** gcc -o language language.c echo "1+2" | ./language echo "2^8" | ./language */ #include <stdio.h> #include <stdlib.h> #define CHAR2INT(c) ((c) - '0') int main() { char left = getchar(); char operator = getchar(); char right = getchar(); printf("input: %c%c%c\n", left, operator, right); printf("compiler\n"); switch (operator) { case '+': printf(" mov $%c, %%rax\n", left); printf(" mov $%c, %%rbx\n", right); printf(" add %%rbx, %%rax\n"); break; case '^': printf(" mov $%c, %%rbx\n", left); printf(" mov $%c, %%rcx\n", right); printf(" mov $1, %%rax\n"); printf("loop:\n"); printf(" cmp $0, %%rcx\n"); printf(" jle end\n"); printf(" imul %%rbx, %%rax\n"); printf(" sub $1, %%rcx\n"); printf(" jmp loop\n"); printf("end:\n"); break; } printf("\n"); printf("interpreter\n"); int leftnum = CHAR2INT(left); int rightnum = CHAR2INT(right); int result; switch (operator) { case '+': result = leftnum + rightnum; printf("%d\n", result); break; case '^': result = 1; for (; rightnum > 0; rightnum--) { result *= leftnum; } printf("%d\n", result); break; } }
COP-5621 Compiler Construction
If you want to learn more about how programming languages are defined and implemented.
Compiler Internals
- Front-end: process input language
- Back-end: generate output language
- front-end
- input language processing
- produces a syntax tree (will go into this later intuitively and formally with a grammar)
- back-end
- output language generation
- produces assembly/machine code
(Diagram)
- overall: input language -> compiler -> output
- front/back ends:
- input -> front-end -> syntax tree -> back-end -> output
The "middle"-end
Many compilers use an intermediate language
Diagram
- input -> front -> syntax tree -> middle-end
- IR generator -> IR -> optimizer -> IR
- middle-end -> IR -> back-end -> output
- IR often like a machine code, but without machine-like specifics (e.g., register-based, but infinite registers)
Why an intermediate representation?
- Easier front-end development
- Easier back-end development
- Portability
- Machine-independent optimization
- Modularization
Multiple front-ends and back-ends
Classic phases of a compiler
Dragon book
Symbols vs. meaning
1 + 2 * 3
6 / 2(1 + 2)
一加二乘三
What gives symbols meaning?
- Moon vs. finger pointing at the moon
- "The map is not the territory"
- symbols have no instrinsic meaning
- the interpreter/compiler/human defines meaning
- compiler defines input language in terms of output language
- why does output language have meaning?
ASCII vs. machine code
od
gcc -S
- ASCII table
Recall interpreter last time
- Input language numbering?
- Machine code numbering?
Syntax trees
Human language
"The boy went to the store"
Diagram this sentence
Syntax: groups of words
- One sentence type:
- subject phrase, verb phrase, object phrase
Syntactic structure captures relationships between words
"The boy leaving behind his friend went to the store."
- who/what went? what word goes with "went"
- the boy, because nested structures allow us to go off on another clause and return to the parent clause
What does "store" mean?
"The store is closed"
"Squirrels store acorns for the winter".
Meaning depends on how it is used with other words:
- Store as a noun: a shop that sells products
- Store as a verb: to place for later use
Meaning depends on structure
Arithmetic expressions
9 + 3 - 2
- Same idea as spoken language
- Diagram this "sentence"
- Operations are nested inside of each other
- Do one operation at a time (even in your head)
What is the assembly code version of "9 + 3 - 2"?
- Operations are one at a time
- Order of operations depend on the syntax tree
- Inuitively, you can walk the tree and compute the meaning (or generate equivalent assembly!)
Evaluating the tree
- Post order traversal
- Leaf nodes: value of the number symbol
- Inner nodes: value is result of operation on child values
Ambiguity
1 + 2 * 3
- What is the meaning, e.g., resulting value, of this expression?
- What are the two typical ways this can be interpreted?
- Draw the trees and perform the interpretation on each
- Perform the compilation on one of the trees
Parsing
Problem: we don't say "noun phrase starts here", "subject is here"
How do we figure out syntactic structure?
- Syntactic structure (word groupings) defines meaning
- Spoken language and source code have no intrinsic structure
How is source code represented?
- Just a sequence of ascii codes
- No instrinsic structure
- No instrinsic meaning (ascii numbers aren't machine integers)
Parsing infers structure
Parsing infers structure using word order and groupings
Define what we need by syntax
Formal grammars
Terminals |
Nonterminals |
Productions |
Starting symbol |
- formal grammar
- two kinds of symbols
- terminals (words)
- nonterminals (structure names)
- productions (rules defining contents/children of structures)
- starting symbol (the top of the syntax tree)
- two kinds of symbols
Terminals
- Words or tokens
- What you see or hear, e.g., C source code
Nonterminals
- Names of syntactic structures, e.g., verb phrase, expression, function definition
- Not uttered explicitly
Productions
- Rules mapping nonterminals to their contents
- Defines legal groupings of symbols in the language
- Written using an arrow
nonterminal -> symbol1 symbol2 ...
Strictly speaking, context-free grammars are restricted to having a single nonterminal on the left-hand side of the arrow, but context-sensitive grammars can have more symbols on the left-hand side.
Starting symbol
- The first nonterminal in the syntax tree
Constructs can be nested
- For example,
E -> E + num
- Expressions may contains other expressions
- E.g., 9 + 3 - 2 nests 9 + 3 inside of the subtraction operation
Example: arithmetic expressions
E -> E + E E -> E - E E -> E * E E -> E / E E -> N N -> 0 N -> 1 N -> 2 N -> 3 N -> 4 N -> 5 N -> 6 N -> 7 N -> 8 N -> 9
The terminals, i.e., uttered words in the language, are the operators (+
, -
, *
, /
) and the digits (0
, 1
, 2
, 3
, 4
, 5
, 6
, 7
, 8
, 9
).
The nonterminals are the names of the grammar constructs, specifically E for expression, num for number, and op for operator.
The pipe symbol here means alternative, i.e., num -> 0 | 1
is equivalent to two rules, num -> 0
and num -> 1
.
The starting symbol, by convetion, is the nonterminal on the left-hand side of the first rule.
Deriving a tree
- Find a sequence of rule applications that
- Begins with the start symbol
- Ends with the string
Show derivation of the tree for 9 + 3 - 2
Defining an interpreter
- Walk the tree
- Evaluate each node of the tree
- Define computation for each grammar production
Expression interpreter rules
- Assume we have the following interpreter functions implemented
add
,sub
,mult
,div
to perform addition, subtraction, multiplication, and division computationsascii2int
to convert the ASCII character value to an integer represetation
Specifying the interpreter
Define computation for each grammar construct
E -> E + E { E.value = add(E1.value, E2.value) } E -> E - E { E.value = sub(E1.value, E2.value) } E -> E * E { E.value = mult(E1.value, E2.value) } E -> E / E { E.value = div(E1.value, E2.value) } E -> N { E.value = ascii2int(N.value) } N -> 0 { N.value = "0" } N -> 1 { N.value = "1" } N -> 2 { N.value = "2" } ... N -> 9 { N.value = "9" }
Defining a compiler
- Similar tree-walking approach
- Generate code instead of compute result
Compile infix to postfix
Assume we have string concat
function to join any number of strings.
E -> E + E { E.value = concat(E1.value, E2.value, "+") } E -> E - E { E.value = concat(E1.value, E2.value, "-") } E -> E * E { E.value = concat(E1.value, E2.value, "*") } E -> E / E { E.value = concat(E1.value, E2.value, "/") } E -> N { E.value = N.value } N -> 0 { N.value = "0" } N -> 1 { N.value = "1" } N -> 2 { N.value = "2" } ... N -> 9 { N.value = "9" }
Note that we no longer need to convert ascii to a machine integer, since we can just print the output program in ascii as well.