I. Overview A. theory 1. languages ------------------------------------------ LANGUAGES def: A *language* is ------------------------------------------ For a written natural language, what are the characters? For a spoken natural language, what are the characters? 2. hierarchy of language classes ------------------------------------------ LANGUAGE CLASSES Languages can be classified by Venn Diagram: |--------------------------------------| | Regular Languages | | | | | | |--------------------------------| | | | Contex-free Languages | | | | | | | | | | | | |--------------------------| | | | | | Context-sensitive | | | | | | Languages | | | | | | | | | | | | | | | | | | |--------------------| | | | | | | | Type 0 Languages | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |--------------------| | | | | | | | | | | | |--------------------------| | | | | | | | |--------------------------------| | | | |--------------------------------------| ------------------------------------------ a. relation of language classes to parts of a compiler ------------------------------------------ PHASES OF A COMPILER Programs allowed by a compiler's: |--------------------------------------| | Lexical Analysis (Lexer) | | | | | | |--------------------------------| | | | Parser | | | | | | | | | | | | |--------------------------| | | | | | Static Analysis | | | | | | | | | | | | | | | | | | | | | | | | |--------------------| | | | | | | | Runtime checks | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |--------------------| | | | | | | | | | | | |--------------------------| | | | | | | | |--------------------------------| | | | |--------------------------------------| ------------------------------------------ Is there a relationship to the previous diagram (about langauge classes)? 3. automation of syntax analysis ------------------------------------------ PROBLEM WRITING SYNTAX ANALYSIS Naive way to write a compiler, etc. Language Def.docx -- coding --> parser.c (v. 1) Def2.docx -- coding --> parser.c (v. 2) -- coding --> parser.c (v. 3) Def3.docx -- coding --> parser.c (v. 4) Def4.docx -- coding --> parser.c (v. 5) ... ... DefN.docx -- coding --> parser.c (v. N) Disadvantages: ------------------------------------------ ------------------------------------------ COMPUTER SCIENCE SOLUTION: AUTOMATION high-level description tool generated code lang.y -- bison --> lang.tab.c + lang.tab.h lang2.y -- bison --> lang.tab.c + lang.tab.h ... langN.y -- bison --> lang.tab.c + lang.tab.h Advantages: ------------------------------------------ 4. grammars a. definitions ------------------------------------------ GRAMMARS DESCRIBE LANGUAGES Grammars are high-level descriptions of languages/parsers def: a *grammar* consists of a finite set of rules (called "productions") and a start symbol (a nonterminal). Let V = nonterminals + terminals The rules have the form V+ -> V* where there is no symbol in both nonterminals and terminals def: The language generated by a grammar G with set of productions P is: {w | w is in terminals* and S =>* w} where S is the start symbol of G gAd => gBd iff g in V* and d in V* and A -> B is a rule in P and g =>* h iff either h = g or g -> i and i =>* h ------------------------------------------ b. BNF notation ------------------------------------------ BNF NOTATION FOR GRAMMARS ::= means | means is a Example ::= | ::= 0 | 1 ------------------------------------------ c. grammars as games ------------------------------------------ GRAMMARS AS RULES OF GAMES A grammar can be seen as describing two games: - A production game (Can you produce this string?) - A recognition/parsing game (Is this string in the language?) ------------------------------------------ i. production game ------------------------------------------ PRODUCTION GAME Goal: produce a string in the language from the start symbol Example Grammar: -> -> Johnny | Sue | Charlie -> is | can be -> good | difficult Can we produce "Johnny is good"? ------------------------------------------ Could you win the production game and produce the string "Johnny is good"? Could you produce the sentence "Charlie can be difficult"? ii. recognition or parsing game ------------------------------------------ RECOGNITION OR PARSING GAME Goal: determine if a string is in the language of the grammar Example Grammar: -> -> Johnny | Sue | Charlie -> is | can be -> good | difficult Is "Johnny is good" in this grammar? ------------------------------------------ Could you win the recognition game on the string "Johnny is good"? What does this have to do with parsing in a compiler? d. derivation trees ------------------------------------------ DERIVATION (OR PARSE) TREES def: a *tree* is a finite set of nodes connected by directed edges that is connected and has no cycles def: a *derivation tree* for grammar G is a tree such that: - Every node has a label that is a symbol of G - The root is labeled by the start symbol of G - Every node with a direct descendent, is labeled by a nonterminal - If the descendents of a node labeled by N have the following labels (in order): A, B, C, ..., K then G has a production of form N -> A B C ... K ------------------------------------------ ------------------------------------------ EXAMPLE DERIVATION TREE Example Grammar: -> -> Johnny | Sue | Charlie -> is | can be -> good | difficult String to parse: "Johnny is good" / | \ / | \ v v v Johnny is good ------------------------------------------ Why does the order of nodes matter? e. extensions to BNF (EBNF) ------------------------------------------ EXTENSIONS TO BNF (EBNF) Arbitrary number of repeats: { x } means 0 or more repeats of x :: = { } is equivalent to: ::= ::= | ::= {} is also written as: * or [ ] ... One-or-more repeats: x+ means 1 or more repeats of x :: = + is equivalent to: ::= ::= | + is sometimes written as ... or [ ] ... Optional element: [ x ] means 0 or 1 occurences of x ::= [ ] is equivalent to: ::= ::= | ------------------------------------------ What is EBNF notation like that you may have seen before? f. examples ------------------------------------------ READING A BNF GRAMMAR Example rules: ::= ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ::= 0 | ::= | ------------------------------------------ can you give an example of a ? ------------------------------------------ EBNF GRAMMAR FOR (SUBSET OF) PL/0 ::= . ::= ::= {} ::= const {} ; ::= = ::= , ::= {} ::= var ; ::= {} ::= , ::= {} ::= procedure ; ; ::= := | call | begin {} end | if then | while do | read | write | skip ::= ; ::= | else ::= ::= odd | ------------------------------------------ ------------------------------------------ EXAMPLES IN PL/0 Shortest program: skip. Factorial program: var n, res; # input and result procedure fact; begin read n; res := 1; while (n <> 0) begin res := res * n; n := n-1 end; write res end; call fact. ------------------------------------------ Does the factorial program parse correctly? Does PL/0 use semicolons to end statements or to separate them? How does C use semicolons? II. Lexical Analysis ------------------------------------------ MOTIVATION # $Id$\n .text start\nstart:\tADDI ... Want to: Approach: ------------------------------------------ ------------------------------------------ LEXICAL ANALYSIS Lexical means relating to the words of a language ------------------------------------------ A. Goals of Lexical Analysis ------------------------------------------ GOALS OF LEXICAL ANALYSIS - Simplify the parser, so it need not handle: - Recognize the longest match Why? - Handle every possible character of input Why? ------------------------------------------ Why recognize the longest match? Why handle all possible characters? ------------------------------------------ CONFLICT BETWEEN RULES Suppose that both "if" and numbers are tokens: What tokens should "if8" match? Fixing such situations: ------------------------------------------ What tokens should "<=" match? ------------------------------------------ WHICH TOKEN TO RETURN? If the input is "<=", what token(s)? If the input is "<8", what token(s)? If the input is "if", what token(s)? If the input is "//", what token(s)? Summary: ------------------------------------------ How would you program the longest match idea? How would you ensure that reserved words are favored over identifiers? (e.g., "if" is not an identifier) After finding an identifier, check to see if it's a reserved word and if so, then return the reserved word token ... In sum, favor is longest match, but give priority to reserved words over identifiers 1. Overview ------------------------------------------ THE BIG PICTURE tokens - source --> [ Lexer] ------> [ Parser] code / / / abstract / syntax v trees [ static analysis ] / / v [ code generator ] For the Lexer we want to: - specify the tokens using regular expressions (REs) - convert REs to DFAs to execute them but easy conversions are: - REs to NFAs - NFAs to DFAs ------------------------------------------ Explain what an the abbreviations mean: - NFA = nondeterministic finite state automaton - DFA = deterministic finite state automaton ------------------------------------------ HOW PARSER WORKS WITH LEXER Couroutine structure: Parser calls lexer: tok = yylex(); // call lexer /* ... use yylavl ... */ Lexer function remembers pointer to input stream returns next token (int code) Parser works... Parser calls lexer again: tok = yylex(); // call lexer /* ... use yylavl ... */ ------------------------------------------ It's a coroutine, because the lexer picks up where it left off when called again a. Use of Automated Tools ------------------------------------------ BISON AND FLEX, GENERATING A PARSER idea: ast.h (AST types) | | bison | -----> g.tab.c | / ^ yyparse function v / bison | g.y file -----> g.tab.h | tokens flex v defs. g.l file -----> g.c yytoken function ------------------------------------------ ... explain all of this: The context-free grammar (the .y file) is central to this, The .y file records: - grammar and - names/types of the tokens (that the parser needs) Bison is a parser generator it generated the unction yyparse in files *.tab.h and *.tab.c The ASTs are how the parse is recorded Flex is a lexical analyzer generator the .l file records - the lexical grammar (as REs) - how tokens produce ASTs (in yylval) B. Theory of Regular Languages, Regular Expressions There is a strong connection between theory and practice in syntax analysis, with some of the first and most useful results in CS theory! 1. Definitions ------------------------------------------ DEFINITIONS The lexical grammar of a language is regular, because def: a grammar is *regular* iff all its productions have one of these forms: ::= c or ::= c where c is a terminal symbol, and and are nonterminals def: a language is *regular* iff it can be defined using a regular grammar. Thm: Every regular language can be recognized by a finite automaton. Thm: Every regular language can be specified by a regular expression. ------------------------------------------ ... regular grammars can be parsed/recognized quickly using finite automata Regular languages are also called "type 3" languages (a name from the Chomsky hierarchy) A regular language is also called a *regular set*. Q: Does it follow that every regular expression can be recognized using a finite automaton? 2. Regular Expressions ------------------------------------------ REGULAR EXPRESSIONS The language of regular expressions: ::= | '|' | | emp | * | ( ) where is a character Examples: RE meaning ==================================== emp the empty string (0|1)*0 even binary numerals b*(abb*)*(a|emp) a's and b's without consecutive a's ------------------------------------------ a. extensions to regular expressions ------------------------------------------ EXTENSIONS TO REGULAR EXPRESSIONS [abcd] means a|b|c|d [h-m] means h|i|j|k|l|m x? means x|emp y+ means y(y*) ------------------------------------------ b. examples of regular expressions ------------------------------------------ FOR YOU TO DO Write a regular expression that describes: 1. The keyword "if" 2. The set of all (positive) decimal numbers 3. The set of all possible identifiers with underbars (_) (If you have time, write these as regular grammars also.) ------------------------------------------ What would be the regular expression for identifiers? c. Examples using regular expressions ------------------------------------------ EXAMPLE REGULAR EXPRESSIONS FOR PL/0 PATTERN RE ========================= DECDIGIT [0-9] LETTER [_a-zA-Z] ... ------------------------------------------ What would be the regular expression for identifiers? 3. finite automata a. example ------------------------------------------ EXAMPLE State diagram: < = -->[q0] --->[[q1]]---> [[q2]] | | > v [[q3]] ------------------------------------------ What token should be returned at state q2? What token should be returned at state q3? What should be done at state q1? ------------------------------------------ PSEUDO CODE FOR THIS LEXER CASE char c; ------------------------------------------ ------------------------------------------ DEALING WITH COMMENTS Suppose / is used for division // starts a comment to end of line (unlike PL/0!) What state diagram? How does whitespace fit in? ------------------------------------------ How does whitespace fit in? ------------------------------------------ TACTICS FOR IGNORING WHITESPACE, COMMENTS Goal: do not send ignored tokens to parser Can always get a non-ignored token: Return "tokens" that include ignored stuff to a loop that ignores them Giant DFA that goes back to start state on seeing something to ignore ------------------------------------------ b. definitions ------------------------------------------ NONDETERMINISTIC FINITE AUTOMATA def: A *nondeterministic finite automaton* (NFA) over an alphabet Sigma is a system (K, Sigma, delta, q0, F) where K is a finite set (of states), Sigma is a finite set set (the input alphabet), delta: is a map of type (K, Sigma) -> Sets(K) q0 in K is the initial state, & F is a subset of K (the final/accepting states). ------------------------------------------ ------------------------------------------ TRANSITION FUNCTION AND ACCEPTANCE p in delta(q,x) means that in state q, on input x the next state can be p p in delta*(q,s) where s in Sigma* is defined by: delta*(q,emp) = {q} (i.e., q in delta*(q,emp)) p in delta*(q,xa) = delta*(q2, a) where x in Sigma, a in Sigma*, q2 in delta(q,x) Lemma: for all c in Sigma, p in delta*(q, c) = delta(q, c) def: An NFA (K, Sigma, delta, q0, F) *accepts* a string s in Sigma* iff there is some q in delta*(q0,s) such that q in F ------------------------------------------ If each final state means to return a token, what token should be returned if there are many final states? c. example ------------------------------------------ EXAMPLE NFA 0,1 0,1 /---\ /---\ \ / \ / | v 0 0 | v -->[ q0 ] ---> [ q3 ] ---> [[ q4 ]] | | 1 v [ q1 ] | | 1 v [[ q2 ]] | ^ / \ \---/ 0,1 K = {q0,q1,q2,q3,q4} Sigma = {0,1} q0 is start state F = {q2,q4} delta(q0,0) = {q0,q3} delta(q1,0) = {} delta(q2,0) = {q2} delta(q3,0) = {q4} delta(q4,0) = {q4} delta(q0,1) = {q0,q1} delta(q1,1) = {q2} delta(q2,1) = {q2} delta(q3,1) = {} delta(q4,1) = {q4} ------------------------------------------ ------------------------------------------ EXTENDING FUNCTIONS TO SETS OF STATES Notation extending delta and delta* to sets of states d(Q,x) = union of d(q,x) for all q in Q so d({},x) = {} d{{q},x) = d(q,x) d({q1,q2},x) = d(q1,x)+d(q2,x) d({q1,q2,q3},x) = d(q1,x)+d(q2,x)+d(q3,x) etc. Note that delta*({q}, x) = delta*(q, x) = delta(q, x) ------------------------------------------ ------------------------------------------ EXAMPLE delta*(q0,010011) = delta*(delta(q0,0),10011) = delta*({q0,q3},10011) = delta*(q0,10011) + delta*(q3,10011) = delta*(delta(q0,1),0011) + delta*(delta(q3,1),0011) = delta*({q0,q1},0011) + delta*({},0011) = delta*(delta(q0,0),011) + delta*(delta(q1,0),011) + {} = delta*({q0,q3},011) + delta*({},011) + delta*({q2},011) = delta*({q0,q3},011) + {} + delta*({q2},011) = delta*({q0,q3},011) + delta*({q2},011) = delta*(delta(q0,0),11) + delta(q3,0),11) + delta*(delta(q2,0),11) = delta*({q0,q3},11) + delta*({q4},11) + delta*({q2},11) = delta*({q0,q3,q4},1) + delta*({q4},1) + delta*({q2},1) = delta({q0,q1,q4},1) + delta(q4,1) + delta(q2,1) = {q0,q1,q2,q4} + {q4} + {q2} = {q0,q1,q2,q4} ------------------------------------------ So does this machine accept 010011? What strings does this machine accept? d. deterministic finite automata ------------------------------------------ DETERMINISTIC FINITE AUTOMATA def: a *deterministic finite automaton* (DFA) is an NFA in which delta(q,c) is a singleton or empty for all q in K and c in Sigma. ------------------------------------------ ------------------------------------------ IMPLEMENTING DFAS How would you represent states? How would you implement a DFA? ------------------------------------------ 4. converting NFA to DFA ------------------------------------------ PROBLEM We want to specify lexical grammar using So we need to convert regular expressions into DFAs ------------------------------------------ Why DFAs? a. Converting Regular Expressions to NFAs ------------------------------------------ CONVERTING REs TO NFAs Definition based on grammar of Regular Expressions: Result of Convert(M) looks like this: -->(M q) where the "tail", -->, goes to the start state and q is the "head state" assume also Convert(N) is --->(N q') c Convert(c) = --->[ q ] Convert(M | N) = /--->(M q)--\ emp / \ ---->[ q ] -> [ q2 ] \ / \--->(N q')-/ Convert(M N) = --> (M q)-->(N q') emp Convert(emp) = -----> [ q ] emp /--------------\ / emp v Convert(M*) = -/ /->(M q)---->[ q2 ] / / \ emp / \-----------/ Convert((M)) = Convert(M) After conversion, make the "head state" be a final state ------------------------------------------ ------------------------------------------ EXAMPLE OF CONVERSION TO NFA Regular expression: (i|j)* i Convert(i) = --->[ qi ] j Convert(j) = --->[ qj ] Convert(i|j) = i /--->(qi)---\ emp emp / \ ---->[ q ] -> [ q2 ] \ j / \--->(qj)---/ emp Convert((i|j)*) = ------------------------------------------ In programming terms, do we need to use an NFA to recognize "if" and other reserved words? b. Converting NFAs to DFAs ------------------------------------------ CONVERTING AN NFA INTO A DFA Idea: Convert each reachable set of NFA states into How? Use the emp-closure of each state q = set of states reachable from q using emp Closure wrt emp: closure(S) is the smallest set T such that T = S + union {delta(s, emp) | s in T} can compute closure(S) as T <- S; do T2 <- T T <- T2 + union {delta(s, emp)| s in T2} while (T != T2) DFA Transitions: Let S be a set of states, then DFAdelta(S, c) = closure(union {delta(s,c)|s in S}) ------------------------------------------ Why do we need to take the closure wrt emp? ------------------------------------------ EXAMPLE CONVERSION OF NFA TO DFA NFA for if|[a-z]([a-z]|[0-9])* f [q2] ---> [[q3]] ^ emp i / /------\ / emp a-z / v -->[q1]------>[q4]----->[q5] [[q8]] ^ | emp| | | | a-z | | /-----\ | | / v / | |->[q6] [q7] | | \ 0-9 ^ | emp | \----/ / | / \ emp / \----------/ Converted to DFA: f [2,5,6,8] ---> [3,6,7,8] ^ \ i / \ a-z / a-h j-z a-z \ 0-9 -->[1,4]---->[5,6,8] 0-9 v /-| \----->[6,7,8] |a-z ^ |0-9 \ | \-| ------------------------------------------ Which states are final in the DFA? Which tokens would be returned in each state? C. Using Flex 1. Files and coordination with parser ------------------------------------------ USING THE FLEX TOOL TO GENERATE LEXERS Example: SRM assembler High-level description in asm_lexer.l Generated lexer asm_lexer.c + asm_lexer.h Wrapper for lexer: lexer.h declares functions lexer.c does nothing asm_lexer.l defines functions e.g., lexer_print_token ASTs defined in ast.h asm.y is Bison description file grammar == bison ==> - Declarations in asm.tab.h includes ast.h machine_types.h parser_types.h declares YYSTYPE lexer.h declares yytokentype eolsym = ... minussym = ... dottextsym = ... ... - Definitions in asm.tab.c defines yyparser() YYSTYPE yylval; ------------------------------------------ 2. structure of flex input ------------------------------------------ STRUCTURE OF FLEX INPUT FILE /* ... definitions section ... */ %% /* ... rules section ... */ %% /* ... user subroutines ... */ ------------------------------------------ ------------------------------------------ SECTIONS IN FLEX INPUT (.l file) Definitions section: Rules section: User subroutine section: ------------------------------------------ III. Context-free Parsing A. Why context-free grammars and parsing? ------------------------------------------ WHY CONTEXT-FREE PARSING? Can we define a regular expression to make sure expressions have balanced parentheses? e.g., recognize: (34) and ((12)+(789)) but not: (567))+82)))) ------------------------------------------ ------------------------------------------ WHAT IS NEEDED Needed for checking balanced parentheses: ------------------------------------------ How does this grammar ::= | ( ) | + ensure that parentheses are balanced? B. Parsing with Context-free Grammars ------------------------------------------ PARSING Want to recognize language with Goal: ------------------------------------------ 1. context-free grammar ------------------------------------------ CONTEXT-FREE GRAMMARS def: a *context-free* grammar (CFG) (N, T, P, S) has start symbol S in N and each production (in P) has the form: -> g where is a Nonterminal symbol, and g in (N+T)* Example: def: For a CFG (N,T,P,S), g in (N+T)* *produces* g' in (N+T)*, written g =P=> g', iff g is e f, g' is e h f, is a nonterminal (in N), h in (N+T)*, and the rule -> h is in P Example: ( ) =P=> ------------------------------------------ 2. derivation ------------------------------------------ DERIVATION def: a *derivation* of a terminal strings t from the rules P of G=(N,T,P,S), is a sequence (S, g1, g2, ..., gm), where gm = t, and for all i: gi in (N+T)*, S =P=> g1, and for all 1 <= j < m: gj =P=> g(j+1) Example: ------------------------------------------ ------------------------------------------ LEFTMOST DERIVATION def: a *leftmost derivation* of a string t in T* from a CFG (N,T,P,S) is a derivation of t from (N,T,P,S) (g0, g1, ..., gm) such that g0 = S, gm = t, and for all 0 <= j < m: when gj =P=> g(j+1) and the nonterminal is replaced in gj, then there are no nonterminals to the left of in gj. Example: ------------------------------------------ Would a rightmost derivation be? 3. parse trees ------------------------------------------ PARSE TREES AND DERIVATIONS def: A *parse tree*, Tr, for a CFG (N,T,P,S) represents a derivation, D, iff: - Each node in Tr is labeled by a nonterminal (in N) - the root of Tr is the start symbol S - an arc from to h in (N+T) iff -> ... h ... in P - the order of children of a node labeled is the order in a production -> ... in P ------------------------------------------ ------------------------------------------ EXAMPLES GRAMMAR: ::= | + | * Derivations of 3*4+2 ------------------------------------------ ------------------------------------------ EXAMPLE PARSE TREES Corresponding to leftmost derivation: Corresponding to rightmost derivation: ------------------------------------------ What does each parse tree mean? 4. ambiguity ------------------------------------------ AMBIGUITY def: a CFG (N,T,P,S) is *ambiguous* iff there is some t in T* such that there are two different parse tress for t ------------------------------------------ What's an example of an ambiguous grammar? Why do we want to avoid ambiguous grammars? Is ambiguity a property of a language or its grammar? a. Fixing ambiguous grammars ------------------------------------------ FIXING AMBIGUOUS GRAMMARS Idea: Rewrite grammar to Example: ------------------------------------------ C. Parsing techniques 1. recursive-descent parsing (LL parsing) ------------------------------------------ RECURSIVE DESCENT PARSING ALGORITHM For each production rule, of form: ::= g1 | g2 | ... | gm 1. Write a 2. This function ------------------------------------------ a. example ------------------------------------------ EXAMPLE RECURSIVE-DESCENT PARSER ::= if then else | begin S end | write ::= ; | ::= ::= = #include "pl0.tab.h" yytoken_kind_t tok = yylex(); void advance() { tok = yylex(); } void eat(yytoken_kind_t tk) { if (tok == tk) { advance(); } else { /* ... report error */ } } void parseCond() { eat(numbersym); eat(eqsym); eat(numbersym); } void parseStmt() { switch (tok.typ) { case ifsym: eat(ifsym); parseExp(); eat(thensym); parseStmt(); eat(elsesym); parsetStmt(); break; case beginsym: eat(beginsym); parseStmt(); parseList(); break; case writesym: eat(writesym); eat(numbersym); break; default: // report error } } ------------------------------------------ What kind of errors can eat() report? What kind of error message can the parser produce (in the default case)? How do errors get reported by parseExp()? b. terminology: LL(1) ------------------------------------------ LL(1) GRAMMARS A recursive-descent parser must: - choose between alternatives (e.g., ::= | ) def: A grammar is *LL(1)* iff ------------------------------------------ c. terminology: LR(1) ------------------------------------------ LR(1) GRAMMARS An LR(1) parser needs to decide when to: shift (push token on stack) or reduce uses a DFA based on stack + lookhead def: A grammar is *LR(1)* iff ------------------------------------------ i. LALR(1) parsers ------------------------------------------ LALR(1) Parsing Smaller tables than LR(1) - merges states of the DFA if only differ in lookahead ------------------------------------------ d. ambiguity problems ------------------------------------------ PROBLEM: AMBIGUITY Consider: ::= := | if then ::= | else and the statement: if b1 then if b2 then x := 2 else x := 3 Is this parsed as: / / \ -----|---\ if then | | | \ | | | / | \--------\ | / | \ \ \ b1 if then / | \ \ \ ... | \ x := 2 else / | \ ... x := 3 or as: / / \-----|--------\ if then | |\ / \ b1 /| \ else / | \ / | \ | | | ... | | | x := 3 | | | if then | /|\ \ b2 ... x := 2 ------------------------------------------ ------------------------------------------ FIXES FOR AMBIGUITY Change the language: a. Always have an else clause: ::= if then else (use skip if don't want to do anything) b. Use an end marker ::= if then else fi | if then fi Give precedence to one production: ::= if then ::= else // priority! | So we only get the parse tree: / / \ -----|---\ if then | | | \ | | | / | \--------\ | / | \ \ \ b1 if then / | \ \ \ ... | \ x := 2 else / | \ ... x := 3 ------------------------------------------ IV. Using the bison parser generator ------------------------------------------ BISON: A LALR(1) PARSER GENERATOR Input: pl0.y | | bison pl0.y v Output: pl0.tab.c and pl0.tab.h yyparse (def.) token decls. parse tables extern decls. yylval (user code) ------------------------------------------ A. big picture ------------------------------------------ THE BIG PICTURE tokens - source --> [ Lexer] ------> [ Parser] code / / / ASTs / | v symbol <---- [ static table ----> analysis ] / / v [ code generator ] ------------------------------------------ B. using bison and flex ------------------------------------------ BISON AND FLEX, GENERATING A PARSER idea: ast.h (AST types) | bison v -----> pl0.tab.c / ^ yyparse (def.) / bison | pl0.y ----------> pl0.tab.h | token enum. | (decl.) flex v pl0_lexer.l-----> pl0_lexer.c yylex (def.) ------------------------------------------ 1. file connections for hw3 ------------------------------------------ HOW IT ALL FITS TOGETHER IN HW3 machine_types.h: // ... typedef unsigned int address_type; typedef unsigned char byte_type; typedef int word_type; #define BYTES_PER_WORD 4 file_location.h: // location in a source file typedef struct { const char *filename; unsigned int line; // of first token } file_location; ast.h: #include "machine_types.h" #include "file_location.h" // types of ASTs (type tags) typedef enum { /* ... */ } AST_type; // typedefs for types N_t, // where N is a nonterminal // ... typedef struct ident_s { file_location *file_loc; AST_type type_tag; struct ident_s *next; // for lists const char *name; } ident_t; typedef struct { file_location *file_loc; AST_type type_tag; const char *text; word_type value; } number_t; // ... typedef struct block_s { file_location *file_loc; AST_type type_tag; const_decls_t const_decls; var_decls_t var_decls; proc_decls_t proc_decls; stmt_t stmt; } block_t; // ... typedef union AST_u { generic_t generic; block_t block; const_decls_t const_decls; const_decl_t const_decl; const_defs_t const_defs; const_def_t const_def; var_decls_t var_decls; var_decl_t var_decl; idents_t idents; // ... expr_t expr; binary_op_expr_t binary_op_expr; token_t token; number_t number; ident_t ident; empty_t empty; } AST; // ... extern block_t ast_block( const_decls_t const_decls, var_decls_t var_decls, proc_decls_t proc_decls, stmt_t stmt); // ... extern ident_t ast_ident( file_location *file_loc, const char *name); extern number_t ast_number( token_t sgn, word_type value); extern empty_t ast_empty( file_location *file_loc); // ... parser_types.h: #include "ast.h" typedef AST YYSTYPE; pl0.y (also pl0.tab.h): #include "ast.h" #include "machine_types.h" #include "parser_types.h" #include "lexer.h" // more below ------------------------------------------ 2. abstract syntax trees and their use in parsing a. background: how the parser works ------------------------------------------ CONNECTING THE PARSER AND THE ASTs Parser model A stack of (terminals + nonterminals) A parallel stack of ASTs 1 token of lookahead Steps in parsing: - shift: 1. push lookahead on parse stack 2. push yylval on AST stack - reduce using a rule nt : a b c { $$ = f($1,$2,$3); }; 1. take a,b,c off parse stack 2. take aval,bval,cval off AST stack ntval = f(a,b,c) 3. push nt on parse stack 4. push result (ntval) on AST stack ------------------------------------------ b. use of ASTs in the grammar (.y) file ------------------------------------------ CONNECTION WITH ASTs IN GRAMMAR FILE /* $Id: pl0.y ... */ %code requires { #include "ast.h" #include "machine_types.h" #include "parser_types.h" /* ... */ } /* ... */ %token identsym %token numbersym %token plussym "+" %token minussym "-" %token multsym "*" %token divsym "/" %token periodsym "." %token semisym ";" %token eqsym "=" %token commasym "," %token becomessym ":=" %token constsym "const" %token varsym "var" /* ... */ %token lparensym "(" %token rparensym ")" %type program %type block %type constDecls %type constDef %type varDecls %type varDecl %type idents %type procDecls %type empty /* ... */ %type expr %type relOp %type term %type factor %type posSign %start program ------------------------------------------ ------------------------------------------ PUTTING A TOKEN VALUE ON THE AST STACK To put yylval on the AST Stack when the field name is "token" in the .y file have: %token somesym "some" the generated parser has: #include "ast.h" #include "parser_types.h" // typedef AST YYSTYPE; YYSTYPE yyvsa[]; // the AST stack yyvsa[yyi].token = yylval; ------------------------------------------ ------------------------------------------ PUSHING AST FOR A NONTERMINAL ON AST STACK To put yylval on the AST Stack when the field name is "const_def" in the .y file have: %type constDef the generated parser has: #include "ast.h" #include "parser_types.h" // typedef AST YYSTYPE; YYSTYPE yyvsa[]; // the AST stack yyvsa[yyi].const_def = yylval; ------------------------------------------ c. Example, the const language ------------------------------------------ THE CONST LANGUAGE programs all look like: const ident = 3402 ------------------------------------------ ------------------------------------------ ASTs FOR THE CONST LANGUAGE /* $Id: ast.h ... */ /* ... */ // types of ASTs (type tags) typedef enum { const_def_ast, token_ast, number_ast, ident_ast } AST_type; typedef struct ident_s { file_location *file_loc; const char *name; } ident_t; typedef struct { file_location *file_loc; const char *text; word_type value; } number_t; typedef struct { file_location *file_loc; const char *text; int code; } token_t; typedef struct const_def_s { file_location *file_loc; ident_t ident; number_t number; } const_def_t; typedef union AST_u { generic_t generic; const_def_t const_def; token_t token; number_t number; ident_t ident; } AST; extern const_def_t ast_const_def( ident_t ident, number_t number); // ... ------------------------------------------ ------------------------------------------ THE CONST.Y FILE FOR CONST LANGUAGE /* ... */ %code requires { #include "ast.h" #include "machine_types.h" #include "parser_types.h" #include "lexer.h" /* ...*/ } /* ...*/ %token constsym "const" %token identsym %token eqsym "=" %token numbersym %type program %type constDef %start program %% program : constDef { setProgAST($1); } ; constDef : "const" identsym "=" numbersym { $$ = ast_const_def($2,$4); }; %% // Set the program's ast to be t void setProgAST(const_def_t t) { progast = t; } ------------------------------------------ Do we really need 2? When are line numbers recorded? d. example of changing the grammar How would we make the grammar be ::= { } ?