CS 342 Lecture -*- Outline -*- * Introduction to Grammars and Syntax ** Motivation Why study syntax? *** syntax matters (e.g., compare new pseudo-code syntax to old) *** tools important for design (as in Algol 60, regularity, etc.) *** well understood theory (leading to implementation techniques) *** Need description/documentation (of any user interface) *** Basics *** Language: set of strings in some alphabet. E.g. English (Roman alphabet, phonemes), Ada (ASCII) Common Practice Notation (for music), button pushes... *** Sentence: a string in a language ** Syntactic descriptions (Grammars, parse trees) *** History: Algol effort Backus presented paper describing Algol 58, using formal notation Naur used modified form of Backus's notation for Algol 60 report, called BNF (now) *** Example parse tree ------------------ Example: "The program works." | ___________|__________ | | __________|____ |
| | | | ____|________ | | | | | | | The little program works. --------------------- *** Grammar: rules for parsing (or generating) sentences in a language **** Example grammar (BNF) --------------------- ::= ::=
::= ::=
::= The := little ::= program ::= ::= works --------------------- **** Parts of grammar: nonterminals (or variables): syntactic categories from which strings can be derived -, terminals: strings from the alphabet. -program, works (note: a symbol cannot be both a terminal and a nonterminal.) productions: relationships between strings of nonterminals and terminals. - ::=
start symbol (sentence symbol): name of nonterminal that generates the strings in the language. - conventionally at top of grammar *** Can be used for generation of sentences too. -"program works." -"little program works." -"The little little program works." etc. ** Kinds of grammars: *** phrase structure grammar: l.h.s. contains at least 1 nonterminal ::= ::= a a ::= b ::= c generates as follows => => a => b => bc -can describe all languages (Church's thesis) *** context sensitive rules have forms: (1) \alpha \gamma ::= \alpha \beta \gamma where \alpha, \beta and \gamma are strings of symbols and \beta is nonempty. (\alpha and \gamma are the context) (2) ::= and if this second rule occurs, does not appear on the right hand side. ::= ::= b b ::= b c c b ::= a a b ::= a c ::= c c generates as follows: => b => bcc => bccc => abccc (other order: => b => ab => abcc => abccc) => a => b => aab => aabcc => aabccc -can describe conditions such as declarations matching uses. *** context-free: rules have only one nonterminal on left-hand side. -important because can draw parse (derivation) trees -can describe matching constructs (e.g., formulas, parentheses) -can express nesting through recursion --------------------- ::= ::= ::= x <- y ::= begin end ::= ::= ; --------------------- **** BNF (Backus Naur form or Backus Normal form) (particular convention of context-free) four meta characters < > ::= |. nonterminals denoted: replacement operator: ::= or symbol: | Examples of BNF ------------------- ::= () | ( ) generates "()", "(())", ... ------------------- -customary to only use | for or, not separate listings *** linear: productions have a terminal on either end (1) ::= u v or (2) ::= u where u and v are strings of terminals (possibly empty) -not important for us except as intro to... *** right linear: one terminal, may be followed by a nonterminal E.g., ::= a | a (generates "a", "aa", ...) *** left linear: rules of form ::= v | u -these are equivalent, can describe anything finite, simple patterns (regular sets) ** Kinds of languages: A language is context-free if there is a context free grammar that generates it. A language is regular if there is a right-linear (or left-linear) grammar that generates it. ------------------------- Containment of language types: ________________________________________ | recursively-enumerable | (phrase structure gram.) | _________________________________ | | | context-sensitive | | | | __________________________| | | | | | context-free | | | | | | __________________ | | | | | | | regular | | | | (left or right linear) ________________________________________ Different approximations to a programming language: ________________________________________ | strings of legal tokens | (regular) | _________________________________ | | | syntactically correct | | (context_free syntax) | | __________________________| | | | | | statically correct | | | (context_sensitive) | | | __________________ | | | | | | | dynamically ok| | | | ________________________________________ ------------------------ Typically describe syntax of programming langauge using two grammars (for compilation, but not necessarily for user's manual) lexical syntax: syntax of tokens such as identifiers, numbers, etc. context-free syntax: syntax not taking into account consistency checking (e.g., variables declared) Quiz on the above (leads into regular expressions): 1. Describe the languages generated by the following grammars a. ::= 1 | 1 b. ::= 0 | 0 | 1 c. ::= 1 1 | 0 2. Consider the following grammar: ::= | \epsilon | | '|' | * | + | ( ) ::= x | y | z For each of the following strings, answer whether it is a sentence in the language generated by the above grammar a. xyz b. (x|y)* c. x||y* d. x+y e. ((x+)y|\epsilon)