CS 541 Lecture -*- Outline -*-

* Introduction to Grammars and Syntax

** Why study syntax?
*** syntax matters (e.g., compare FORTRAN to LISP to Pascal)
*** tools important for design (as in Algol 60, regularity, etc.)
*** well understood theory (leading to implementation techniques)
*** Need description/documentation (of any user interface)

** Basics
*** Language: set of strings in some alphabet.
                E.g. English (Roman alphabet, phonemes), Ada (ASCII)
                        Common Practice Notation (for music), button pushes...
*** Sentence: a string in a language

** Syntactic descriptions (Grammars, parse trees)

*** History: Algol effort
                Backus presented paper describing Algol 58,
                        using formal notation
                Naur used modified form of Backus's notation
                        for Algol 60 report,
                called BNF (now)

*** Example parse tree
------------------
   A PARSE TREE

                      sentence
                          |
               ___________|__________
              NP                  VP
              |                   |
     _________|____               |
 article         NP              verb
     |           |                |
     |       ____|________        |
     |       adjective   NP       |
     |       |           |        |
    The    little     program    works
---------------------

*** Grammar: rules for parsing (or generating) sentences in a language

**** Example grammar (BNF)
---------------------
Sentence ::= NP VP
NP ::= article NP
NP ::= adjective NP
NP ::= noun
article ::= The
adjective ::= little
noun ::= program
VP ::= verb
verb ::= works
---------------------
**** Parts of grammar:
                nonterminals (or variables): syntactic categories from
                                which strings can be derived
                        -sentence, NP
                terminals: strings from the alphabet.
                        -program, works
			written in typewriter font in the book
			sometimes surrounded by single quotes 
                (note: a symbol cannot be both a terminal and a nonterminal.)
                productions: relationships between strings of
                                nonterminals and terminals.
                        -NP ::= article NP
                start symbol (sentence symbol): name of nonterminal
                                that generates the strings in the language.
                        -sentence
                        conventionally at top of grammar

*** Can be used for generation of sentences too.
                -"program works."
                -"little program works."
                -"The little little program works." etc.

** Kinds of grammars:

*** phrase structure grammar: l.h.s. contains at least 1 nonterminal

                        S ::= A B C
                        A ::= a
                      a B ::= b
                        C ::= c
                generates as follows
                        S = ABC = aBC = bC = bc

                -can describe all languages (Church's thesis)
*** context sensitive
	  rules have forms:
          (1)           \alpha A \gamma ::= \alpha \beta \gamma
                where \alpha, \beta and \gamma are strings of symbols
                        and \beta is nonempty.
                        (\alpha and \gamma are the context)
          (2)           sentence  ::=
                and if this second rule occurs, sentence does not appear
                        on the right hand side.
                        S ::= A B C
                    A B ::= A b C
                      b C ::= b c c
                      A b ::= a a b
                        A ::= a
                      c C ::= c c
                 generates as follows:
                    ABC = AbCC = AbccC = Abccc = abccc
                    (other order:
                    ABC = AbCC = abCC = abccC = abccc)
                    ABC = aBC
                    ABC = AbCC = aabCC = aabccC = aabccc
                -can describe conditions such as declarations matching uses.

*** context-free: rules have only one nonterminal on left-hand side.
                -important because can draw parse (derivation) trees
                -can describe matching constructs (e.g., formulas, parentheses)
                -can express nesting through recursion
---------------------
   A CONTEXT-FREE GRAMMAR
statement ::= assign
statement ::= compound
assign ::= x <- y
compound ::= begin statement list end
statement-list ::= statement
statement-list ::= statement ;
                   statement-list
---------------------
**** BNF (Backus Naur form or Backus Normal form)
                (particular convention of context-free)
                        four meta characters   ::= |.
                        nonterminals historically denoted: <some string>,
				but we use some-string
                        replacement operator: ::=
                        or symbol: |
	    Examples of BNF
-------------------
    A BNF GRAMMAR

A ::= () | ( A )

This generates "()", "(())", ...
-------------------
            -customary to only use | for or, not separate listings

***	linear: productions have a terminal on either end
           (1)          A ::= u B v
           or (2)       A ::= u
                where u and v are strings of terminals (possibly empty)
                -not important for us except as intro to...
***      right linear: one terminal, may be followed by a nonterminal
                E.g., A ::= a | a A
                (generates "a", "aa", ...)
***      left linear: rules of form A ::= B v | u
                -these are equivalent, can describe anything finite,
                        simple patterns (regular sets)

** Kinds of languages:
        Def: A language is context-free if there is a context free grammar that
                generates it.
        Def: A language is regular if there is a right-linear (or left-linear)
                grammar that generates it.

-------------------------
  Containment of language types:
  ________________________________________
  |           recursively-enumerable     |
  |   _________________________________  |
  |   |         context-sensitive     |  |
  |   |   __________________________| |  |
  |   |   |      context-free       | |  |
  |   |   |  __________________     | |  |
  |   |   |  |     regular    |     | |  |
  ________________________________________


Different approximations to
      a programming language:

  ________________________________________
  |          strings of legal tokens     |
  |   _________________________________  |
  |   |       syntactically correct   |  |
  |   |   __________________________| |  |
  |   |   |    statically correct   | |  |
  |   |   |  __________________     | |  |
  |   |   |  |  dynamically ok|     | |  |
  ________________________________________

------------------------
        Typically describe syntax of programming langauge using two grammars
                (for compilation, but not necessarily for user's manual)
        lexical syntax: syntax of tokens such as identifiers, numbers, etc.
        context-free syntax: syntax not taking into account
                consistency checking (e.g., variables declared)


Quiz on the above (leads into regular expressions):
	1. Describe the languages generated by the following grammars
	    a.	S ::= 1 | 1 S
	    b.  S ::= 0 | 0 S | 1 S
	    c.  S ::= 1 S 1 | 0

	2. Consider the following grammar:
 	re ::= char
	  | \epsilon
	  | re re
	  | re '|' re
	  | re *
	  | re +
	  | ( re )
	char ::= x | y | z

	For each of the following strings, answer whether it is a sentence
		in the language generated by the above grammar
	    a. xyz
	    b. (x|y)*
	    c. x||y*
	    d. x+y
	    e. ((x+)y|\epsilon)