COP 5021 Lecture -*- Outline -*- * Abstract Syntax Trees (ASTs) in XText ------------------------------------------ ABSTRACT SYNTAX TREES = OBJECT STRUCTURES input: x := x+1 AST: [AssignS | v: * | aexp: * | label: * ] / | v [AExpression | left: * [EString | * ] / | v v [VarRefExpr | vname: *] [String | 1 | * ]<--------------------/ / | ... | op: *->"+" | | | right: * ] | | | /-------------/ | v /--------/ [ NumLitExp | num: * ] v / [CharBuf | 'x' ] v 1 ------------------------------------------ Note that AExpression's AST is broken into 2 parts, the second part continues at ... op: and right: ** Importance of ASTs ------------------------------------------ IMPORTANCE OF ASTS ASTs are: the key ------------------------------------------ ... interface between parser and rest of compiler (or static analysis) ** Overview in XText ------------------------------------------ ABSTRACT SYNTAX TREES IN XTEXT Concrete syntax is described in .xtext file (e.g., WhileLang.xtext) ASTs are described in XText tool generates Java classes for all AST nodes (in edu.ucf.cs.whilelang.whileLang, which in Eclipse is found under the edu.ucf.cs.whilelang.WhileLang project > src-gen > edu.ucf.cs.whilelang.whileLang) At runtime, each AST is a Java object. (constructed by the parser) ------------------------------------------ ... the same .xtext file (indirectly) ** WhileLang.xtext File (an example) ------------------------------------------ WhileLang.xtext FILE grammar edu.ucf.cs.whilelang.WhileLang with org.eclipse.xtext.common.Terminals generate whileLang "http://www.ucf.cs.edu/~leavens/WhileLang" Program: 'proc' name=ID '(' (args=Formals)? ')' 'is' body=Stmt; Formals: names+=ID (',' names+=ID)*; /* ... */ ------------------------------------------ The grammar, import and generate lines are boilerplate The first rule name is Program : is the "produces" symbol (-> or ::= in BNF) 'proc', ID, '(', ')', 'is' are lexical tokens name, args, body, names are attributes (fields) Each of Program, Formals is the name of: - a rule (a nonterminal name), and - a generated (Java) class Q: What's the syntax of the parsing rules? ':' * ';' Q: What does a program in the While Language look like? proc demo(x) is { x := x+1 } is an example ------------------------------------------ STATEMENT GRAMMAR (in WhileLang.xtext) Stmt returns S: Assignment | Skip | Block | While | If; Block returns CompoundS: '{' stmts+=Stmt (';' stmts+=Stmt)* '}'; Assignment returns AssignS: (v=ID) ':=' (aexp=Expression) | '[' (v=ID) ':=' (aexp=Expression) ']' '^' (label=INT); Skip returns SkipS: 'skip' {SkipS} | '[' 'skip' ']' '^' (label=INT); While returns WhileS: 'while' (bexp=LabeledExp) 'do' (block=Block); If returns IfS: 'if' (bexp=LabeledExp) 'then' (s1=Block) 'else' (s2=Block); LabeledExp returns LabeledExp: (be=Expression) | '[' (be=Expression) ']' '^' (label=INT); ------------------------------------------ The 'returns' specifies the AST node type, which can be named differently than the rule name, in this case S S is generated as an interface in Java, since it's the supertype of the (classes) CompoundS, AssignS, SkipS, WhileS, IfS The '|' means "or" and separates alternatives in the grammar Q: What does an Assignment statement look like? either x := E or [ x := E ] ^ 3 the later syntax gives the label explicitly Q: Why don't all statements have labels? because they aren't all elementary blocks Q: What is a LabeledExp for? to optionally attach a label to an expression (in a test) Look at some of the generated output .java files in src-gen > edu.ucf.cs.whilelang.whileLang esp. Program, S, AssignS, IfS, etc. ------------------------------------------ EXPRESSION GRAMMAR in WhileLang.xtext Expression returns Expr: BDisj; BDisj returns Expr: BConj ({BDisj.left=current} op=OR right=BConj)*; BConj returns Expr: BRelExp ({BConj.left=current} op=AND right=BRelExp)*; BRelExp returns Expr: AExpression ({BRelExp.left=current} op=OP_R right=AExpression)?; AExpression returns Expr: Factor ({AExpression.left=current} op=OPPLUS right=Factor)*; Factor returns Expr: Primary ({Factor.left=current} op=OPMUL right=Primary)*; Primary returns Expr: VarRefExpr | NumLitExpr | SignedNum | BoolLitExpr | NotExpr | '(' Expression ')'; VarRefExpr: vname=ID; SignedNum: (sign=OPPLUS) (nval=Primary); NumLitExpr: num=INT; BoolLitExpr: bval='true' | bval='false'; NotExpr: 'not' (bexp=Primary); ------------------------------------------ Q: What operators bind more tightly than others? highest is not, next tighest is the OPMUL, then OPPLUS, etc. Q: Do the operators associate to the left or to the right? To the left for OPMUL and OPPLUS, but not at all for OP_R To the right for unary + and - (OPPLUS) in SignedNum This is due to the pattern used Show how the associativity works in the running Eclipse editor by looking at the output of something like x := x - x - x also try if (0 < b < 3) then ... Q: What does the rule for Expression do? it makes Expr a supertype of all the other expression types this is convenient for polymorphism. ------------------------------------------ NON-EXECUTABLE (NOT CALLED) RULE ElementaryBlock: Assignment | Skip | LabeledExp; generates an interface package edu.ucf.cs.whilelang.whileLang; import org.eclipse.emf.ecore.EObject; public interface ElementaryBlock extends EObject { } which is implemented by AssignS, SkipS, and LabeledExp package edu.ucf.cs.whilelang.whileLang; public interface AssignS extends S, ElementaryBlock { /* ... */ } ------------------------------------------ Q: Is S also an interface? yes Q: Why is S an interface? due to the rule for Statement *** picture of runtime AST ------------------------------------------ PARSED AST EXAMPLE Parsing the program proc astEx(x,y) is { if x < 5 then y := 2 else skip } gives the AST: [Formals | names: * ] ^ | \ v \ [EList \ | "x", "y" ] \ [Program | body: * | args: * ] / v [IfS | bexp: * | s1: * | s2: *] / | / / | v | | [SkipS | | | label: 3] v v [LabeledExp | [AssignS | label: 1 | label: 2 | be: * ] v: "y" | | aexp: *] | | | v | [NumLitExp | num: 2] v [BRelExp | left: * | op: * | right: *] / | | v v | [VarRefExp | "<" / vname: "x" ] v [NumLitExp | num: 5] ------------------------------------------ I'm not showing all the types of records for the strings or lists *** lexical syntax ------------------------------------------ TOKENS (LEXICAL SYNTAX) IN XTEXT standard token rules from grammar directive in WhileLang.xtext grammar edu.ucf.cs.whilelang.WhileLang with org.eclipse.xtext.common.Terminals ------------------------------------------ This specifies a lot of standard Java-like conventions ------------------------------------------ grammar org.eclipse.xtext.common.Terminals hidden(WS, ML_COMMENT, SL_COMMENT) import "http://www.eclipse.org/emf/2002/Ecore" as ecore terminal ID: '^'?('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'_'|'0'..'9')*; terminal INT returns ecore::EInt: ('0'..'9')+; terminal STRING: '"' ( '\\' . /* 'b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\' */ | !('\\'|'"') )* '"' | "'" ( '\\' . /* 'b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\' */ | !('\\'|"'") )* "'" ; terminal ML_COMMENT : '/*' -> '*/'; terminal SL_COMMENT : '//' !('\n'|'\r')* ('\r'? '\n')?; terminal WS : (' '|'\t'|'\r'|'\n')+; terminal ANY_OTHER: .; ------------------------------------------ Remember, the lexical grammar is describing the characters (each one!) in tokens This shows various conventions in common.Terminals hidden says which tokens are not given to the parser Note the returns on the INT rule ------------------------------------------ CUSTOMIZED LEXICAL RULES FOR WHILELANG @Override terminal SL_COMMENT: '%' !('\n'|'\r')* ('\r'? '\n')?; terminal OPPLUS: '+' | '-'; terminal OPMUL: '*' | '/'; terminal OR: 'or'; terminal AND: 'and'; terminal OP_R: '=='| '!=' | '<=' | '>=' | '<' | '>' ; ------------------------------------------ The @Override notes which rules are different than the imported ones The last rule takes precedence Q: How does one write a negative number down? Using unary operator and an INT