CS 342 Lecture -*- Outline -*-

* Introduction to Grammars and Syntax

** Motivation
        Why study syntax?
***             syntax matters (e.g., compare new pseudo-code syntax to old)
***             tools important for design (as in Algol 60, regularity, etc.)
***             well understood theory (leading to implementation techniques)
***             Need description/documentation (of any user interface)

*** Basics
***        Language: set of strings in some alphabet.
                E.g. English (Roman alphabet, phonemes), Ada (ASCII)
                        Common Practice Notation (for music), button pushes...
***        Sentence: a string in a language

** Syntactic descriptions (Grammars, parse trees)

***     History: Algol effort
                Backus presented paper describing Algol 58,
                        using formal notation
                Naur used modified form of Backus's notation
                        for Algol 60 report,
                called BNF (now)

***	Example parse tree
------------------
        Example: "The program works."

                              <sentence>
                                  |
                       ___________|__________
                      <NP>                  <VP>
                       |                     |
             __________|____                 |
         <article>       <NP>              <verb>
             |            |                  |
             |        ____|________          |
             |     <adjective>   <NP>        |
             |          |         |          |
            The      little     program    works.
---------------------

***	   Grammar: rules for parsing (or generating) sentences in a language

****        Example grammar (BNF)
---------------------
<sentence> ::= <NP> <VP>
<NP> ::= <article> <NP>
<NP> ::= <adjective> <NP>
<NP> ::= <noun>
<article> ::= The
<adjective> := little
<noun> ::= program
<VP> ::= <verb>
<verb> ::= works
---------------------
****	    Parts of grammar:
                nonterminals (or variables): syntactic categories from
                                which strings can be derived
                        -<sentence>, <NP>
                terminals: strings from the alphabet.
                        -program, works
                (note: a symbol cannot be both a terminal and a nonterminal.)
                productions: relationships between strings of
                                nonterminals and terminals.
                        -<NP> ::= <article> <NP>
                start symbol (sentence symbol): name of nonterminal
                                that generates the strings in the language.
                        -<sentence>
                        conventionally at top of grammar

***        Can be used for generation of sentences too.
                -"program works."
                -"little program works."
                -"The little little program works." etc.

** Kinds of grammars:

***      phrase structure grammar: l.h.s. contains at least 1 nonterminal

                        <S> ::= <A> <B> <C>
                        <A> ::= a
                      a <B> ::= b
                        <C> ::= c
                generates as follows
                        <S> => <A><B><C> => a<B><C> => b<C> => bc

                -can describe all languages (Church's thesis)
***      context sensitive
	  rules have forms:
          (1)           \alpha <A> \gamma ::= \alpha \beta \gamma
                where \alpha, \beta and \gamma are strings of symbols
                        and \beta is nonempty.
                        (\alpha and \gamma are the context)
          (2)           <sentence > ::=
                and if this second rule occurs, <sentence> does not appear
                        on the right hand side.
                        <S> ::= <A> <B> <C>
                    <A> <B> ::= <A> b <C>
                      b <C> ::= b c c
                      <A> b ::= a a b
                        <A> ::= a
                      c <C> ::= c c
                 generates as follows:
                    <A><B><C> => <A>b<C><C> => <A>bcc<C> => <A>bccc => abccc
                    (other order:
                    <A><B><C> => <A>b<C><C> => ab<C><C> => abcc<C> => abccc)
                    <A><B><C> => a<B><C>
                    <A><B><C> => <A>b<C><C> => aab<C><C> => aabcc<C> => aabccc
                -can describe conditions such as declarations matching uses.

***    context-free: rules have only one nonterminal on left-hand side.
                -important because can draw parse (derivation) trees
                -can describe matching constructs (e.g., formulas, parentheses)
                -can express nesting through recursion
---------------------
<statement> ::= <assign>
<statement> ::= <compound>
<assign> ::= x <- y
<compound> ::= begin <statement list> end
<statement list> ::= <statement>
<statement list> ::= <statement> ; <statement list>
---------------------
****        BNF (Backus Naur form or Backus Normal form)
                (particular convention of context-free)
                        four meta characters < > ::= |.
                        nonterminals denoted: <some string>
                        replacement operator: ::=
                        or symbol: |
	    Examples of BNF
-------------------
<A> ::= () | ( <A> )
       generates "()", "(())", ...
-------------------
            -customary to only use | for or, not separate listings

***	linear: productions have a terminal on either end
           (1)          <A> ::= u <B> v
           or (2)       <A> ::= u
                where u and v are strings of terminals (possibly empty)
                -not important for us except as intro to...
***      right linear: one terminal, may be followed by a nonterminal
                E.g., <A> ::= a | a <A>
                (generates "a", "aa", ...)
***      left linear: rules of form <A> ::= <B> v | u
                -these are equivalent, can describe anything finite,
                        simple patterns (regular sets)

** Kinds of languages:
        A language is context-free if there is a context free grammar that
                generates it.
        A language is regular if there is a right-linear (or left-linear)
                grammar that generates it.

-------------------------
        Containment of language types:
        ________________________________________
        |           recursively-enumerable     |      (phrase structure gram.)
        |   _________________________________  |
        |   |         context-sensitive     |  |
        |   |   __________________________| |  |
        |   |   |      context-free       | |  |
        |   |   |  __________________     | |  |
        |   |   |  |     regular    |     | |  |      (left or right linear)
        ________________________________________


	Different approximations to a programming language:

        ________________________________________
        |          strings of legal tokens     |      (regular)
        |   _________________________________  |
        |   |       syntactically correct   |  |      (context_free syntax)
        |   |   __________________________| |  |
        |   |   |    statically correct   | |  |      (context_sensitive)
        |   |   |  __________________     | |  |
        |   |   |  |  dynamically ok|     | |  |
        ________________________________________

------------------------
        Typically describe syntax of programming langauge using two grammars
                (for compilation, but not necessarily for user's manual)
        lexical syntax: syntax of tokens such as identifiers, numbers, etc.
        context-free syntax: syntax not taking into account
                consistency checking (e.g., variables declared)


Quiz on the above (leads into regular expressions):
	1. Describe the languages generated by the following grammars
	    a.	<S> ::= 1 |  1 <S>
	    b.  <S> ::= 0 | 0 <S> | 1 <S>
	    c.  <S> ::= 1 <S> 1 | 0

	2. Consider the following grammar:
 	<re> ::= <char>
	  | \epsilon
	  | <re> <re>
	  | <re> '|' <re>
	  | <re> *
	  | <re> +
	  | ( <re> )
	<char> ::= x | y | z

	For each of the following strings, answer whether it is a sentence
		in the language generated by the above grammar
	    a. xyz
	    b. (x|y)*
	    c. x||y*
	    d. x+y
	    e. ((x+)y|\epsilon)