Lecture 6

Table of Contents


  • Let's be more precise about syntax, e.g.,
    • order of operations
    • specifying numbers, identifers, etc.

Formal grammars specify syntax


Regular expressions

String pattern matching

  • phone numbers
  • email addresses
  • programming language tokens

How would you write this program?

  • Search for an email address in a text file?


  • Area code is three digits


  • A digit is either a 0, 1, 2, .., 9

Parentheses are just used to make order of operations explicit, just like in arithmetic


  • Any number of characters before the @ sign

Optional elements

  • E.g., country code

Wild cards

  • Allow any character (or some subset of characters)

Regular expression language

  1. concatenation, e.g., ab
  2. alternation, e.g., a|b
  3. Kleene closure, e.g., a*

The order of operations, from highest to lowest, is Kleene closure, concatenation, and alternation.

One way to remember order of operations, is that alternation is like addition or logical or, concatenation is like multiplication or logical and, and Kleene closure is like exponentiation.

Regular expressions in practice

Finite state automata

  • AKA
    • Finite state machines
    • Finite automaton
    • State machine

Pattern matching equivalent to many automation tasks

  • String pattern matching problem
    • Capture each possible string prefix in a state

First abstract machine model

  • Formal language: potentially infinite set of strings
  • Each string drawn from a finite alphabet
  • Each string element itself is finite

Here's our first abstract machine model

You'll see more in discrete 2

Other state machine applications

  • Traffic lights
  • Turnstiles
  • Vending machines

Any machine with some predefined set of states and events that transition between states

Implementing finite state automata

  • Graph
  • If and while
  • Table-based (diagram)

Automatically generating automata from regular expressions

This is how the flex tools works under the hood.

Limitations of regular expressions

  • Regexes match patterns in strings
  • Can match infinite set of strings
  • Don't support certain patterns

Regexes match an infinite set of strings with a finite expression

Matching curly braces

  • Curly braces make nested scopes (in C-like languages)
  • Is there a regex to ensure matched braces?
{   {   {    }   }   }  

We can make a regex that accepts all programs with matched curly braces, but there is none that will match only an arbitrarily nested string.

Finite state automata "can't count"

  • Has a finite number of states
  • But nesting is arbitrary
  • Need a new state for each level of nesting depth
  • (Diagram)

Show how you need to keep adding states for each level of nesting you want to match. Need an unbounded number of states for an unbounded language.

Hierarchical syntax

  • Natural language
    • The person walks home
    • The person I went to school with walks home
  • Syntax: the valid arrangements of words (tokens)

The nesting structure we see in programming languages is just like that of natural language. Although we use such structure so automatically, that we may not even be aware unless we hear a particularly complicated or ambiguous sentence.

In these examples, a listener has no confusion about whether "with" or "school" is doing the walking.

Nesting, stacks, recursion

  • The person { I went to school with } walks home
  • Maintain state of sentence before entering the "I went to school with" clause.
  • Tree walking: record state on call stack while processing children

We can think of this computationally as holding onto some state.

Where have we seen this kind of state saving in computer science?

Function calls, recursion, stacks.

Parsers infer the hierarchical syntax from a list of words

  • Nested structure is implicit in list of words
  • Parser infers structure by knowing syntax rules

The nested structure is implicit in each utterance of the language.

The parser can infer this structure even though the input does not explicitly express it, because the parser has the syntactic rules.

Technically a recognizer checks whether a string matches the syntax (like a finite state machine checks whether a string matches a regular expression), while a parser is a recognizer that also produces a syntax tree.

Grammars describe syntax

  • Grammars describe all possible sentences (strings) in a language
    • with a finite set of rules
  • Grammars make implicit structure explicit
    • Language constructs have their own symbols

Even if the language has an unbounded number of strings, the grammar can describe them in a finite bound.

Examples of grammar

  • sentence → subject verb object
  • subject → nounphrase
  • nounphrase → "the" noun
  • noun → "person"
  • noun → "store"
  • verb → "walks to"
  • object → nounphrase

"the", "person", "store", "walks to" are all that we see explicitly in the language

sentence, subject, nounphrase, noun, verb, object are the language constructs that are unspoken, but implied.

Special symbols represent language constructs

  • Unspoken, but implied by the structure of a language
  • Project 1 makes these symbol explicit in a tree representation
  • the person { some other clause } walks to the store



  • noun had multiple rules
  • Language constructs can have many variations
    • e.g., if statements with and without an else branch
  • Just like regular expressions
    • bison even uses | for syntax alternatives

Context-free grammars

  • Terminal symbols are the words, the spoken parts, e.g, "person", "the"
  • Nonterminal symbols are the unspoken representations of structures, e.g., sentence, nounphrase
  • Productions are rules
    • They map a nonterminal to a sequence of other symbols (terminal or nonterminal)
    • E.g., nounphrase → "the" noun
  • Starting symbol is the top of the hierarchy

Derivations: generating a string from the grammar

  1. Start from the starting nonterminal
  2. Pick a production for the nonterminal and substitute the symbol with the right-hand-side symbols
  3. Repeatedly replace any new nonterminals according to production rules until only terminals remain


While parsers infer structure from a string, a generator produces a string from the grammar.

Notice the recursive nature of this process?

This notion of a derivation is where the terms nonterminal and terminal come from AFAIK. Nonterminals continue derivation until terminals stop the process.

Parsing: finding a derivation for the given string

  • Recall: the string has no explicit syntax information
  • Parser knows grammar rules
  • Parser discovers derivation that produces the given string
    • Proof that string is in language
    • Recovery of explicit syntax with nonterminal symbols

If derivation is generating a string from the grammar, parsing is finding a derivation for some string. If there is a derivation, the string is in the language, otherwise it's not.

Correspondence between language and computation

Sorry, your browser does not support SVG.


ANTLR setup

ANTLR is a parser generator

  • Takes formal grammar, produces parser
  • Parser takes any valid input and creates parse tree

Hand-writing parsers can be tricky

ANTLR grammar format

  • Mix of context-free grammar with regular expressions
    • Regex permitted in productions

ANTLR run-time

  • Support for inspecting and visualizing parses
  • Lots of infrastructure for language processing

We'll see more when we start type-checking and generating code from our ANTLR parse tree

Visitor pattern is common for abstracting away language tree traversals

LabeledExpr example

Homework review if needed

Compiler project

Hand-write your ANTLR parser using the screen shot of the SimpleC grammar and push SimpleC.g4 to your compiler repo. Be sure to test it on example programs that you write.

Author: Paul Gazzillo

Created: 2022-02-14 Mon 11:49