CS 541 Lecture -*- Outline -*- * Introduction to Grammars and Syntax ** Why study syntax? *** syntax matters (e.g., compare FORTRAN to LISP to Pascal) *** tools important for design (as in Algol 60, regularity, etc.) *** well understood theory (leading to implementation techniques) *** Need description/documentation (of any user interface) ** Basics *** Language: set of strings in some alphabet. E.g. English (Roman alphabet, phonemes), Ada (ASCII) Common Practice Notation (for music), button pushes... *** Sentence: a string in a language ** Syntactic descriptions (Grammars, parse trees) *** History: Algol effort Backus presented paper describing Algol 58, using formal notation Naur used modified form of Backus's notation for Algol 60 report, called BNF (now) *** Example parse tree ------------------ A PARSE TREE sentence | ___________|__________ NP VP | | _________|____ | article NP verb | | | | ____|________ | | adjective NP | | | | | The little program works --------------------- *** Grammar: rules for parsing (or generating) sentences in a language **** Example grammar (BNF) --------------------- Sentence ::= NP VP NP ::= article NP NP ::= adjective NP NP ::= noun article ::= The adjective ::= little noun ::= program VP ::= verb verb ::= works --------------------- **** Parts of grammar: nonterminals (or variables): syntactic categories from which strings can be derived -sentence, NP terminals: strings from the alphabet. -program, works written in typewriter font in the book sometimes surrounded by single quotes (note: a symbol cannot be both a terminal and a nonterminal.) productions: relationships between strings of nonterminals and terminals. -NP ::= article NP start symbol (sentence symbol): name of nonterminal that generates the strings in the language. -sentence conventionally at top of grammar *** Can be used for generation of sentences too. -"program works." -"little program works." -"The little little program works." etc. ** Kinds of grammars: *** phrase structure grammar: l.h.s. contains at least 1 nonterminal S ::= A B C A ::= a a B ::= b C ::= c generates as follows S = ABC = aBC = bC = bc -can describe all languages (Church's thesis) *** context sensitive rules have forms: (1) \alpha A \gamma ::= \alpha \beta \gamma where \alpha, \beta and \gamma are strings of symbols and \beta is nonempty. (\alpha and \gamma are the context) (2) sentence ::= and if this second rule occurs, sentence does not appear on the right hand side. S ::= A B C A B ::= A b C b C ::= b c c A b ::= a a b A ::= a c C ::= c c generates as follows: ABC = AbCC = AbccC = Abccc = abccc (other order: ABC = AbCC = abCC = abccC = abccc) ABC = aBC ABC = AbCC = aabCC = aabccC = aabccc -can describe conditions such as declarations matching uses. *** context-free: rules have only one nonterminal on left-hand side. -important because can draw parse (derivation) trees -can describe matching constructs (e.g., formulas, parentheses) -can express nesting through recursion --------------------- A CONTEXT-FREE GRAMMAR statement ::= assign statement ::= compound assign ::= x <- y compound ::= begin statement list end statement-list ::= statement statement-list ::= statement ; statement-list --------------------- **** BNF (Backus Naur form or Backus Normal form) (particular convention of context-free) four meta characters ::= |. nonterminals historically denoted: , but we use some-string replacement operator: ::= or symbol: | Examples of BNF ------------------- A BNF GRAMMAR A ::= () | ( A ) This generates "()", "(())", ... ------------------- -customary to only use | for or, not separate listings *** linear: productions have a terminal on either end (1) A ::= u B v or (2) A ::= u where u and v are strings of terminals (possibly empty) -not important for us except as intro to... *** right linear: one terminal, may be followed by a nonterminal E.g., A ::= a | a A (generates "a", "aa", ...) *** left linear: rules of form A ::= B v | u -these are equivalent, can describe anything finite, simple patterns (regular sets) ** Kinds of languages: Def: A language is context-free if there is a context free grammar that generates it. Def: A language is regular if there is a right-linear (or left-linear) grammar that generates it. ------------------------- Containment of language types: ________________________________________ | recursively-enumerable | | _________________________________ | | | context-sensitive | | | | __________________________| | | | | | context-free | | | | | | __________________ | | | | | | | regular | | | | ________________________________________ Different approximations to a programming language: ________________________________________ | strings of legal tokens | | _________________________________ | | | syntactically correct | | | | __________________________| | | | | | statically correct | | | | | | __________________ | | | | | | | dynamically ok| | | | ________________________________________ ------------------------ Typically describe syntax of programming langauge using two grammars (for compilation, but not necessarily for user's manual) lexical syntax: syntax of tokens such as identifiers, numbers, etc. context-free syntax: syntax not taking into account consistency checking (e.g., variables declared) Quiz on the above (leads into regular expressions): 1. Describe the languages generated by the following grammars a. S ::= 1 | 1 S b. S ::= 0 | 0 S | 1 S c. S ::= 1 S 1 | 0 2. Consider the following grammar: re ::= char | \epsilon | re re | re '|' re | re * | re + | ( re ) char ::= x | y | z For each of the following strings, answer whether it is a sentence in the language generated by the above grammar a. xyz b. (x|y)* c. x||y* d. x+y e. ((x+)y|\epsilon)