$Id: LexerTools.lhs,v 1.10 1999/02/08 08:19:12 leavens Exp leavens $

A lexical analyzer based on Paulson's book "ML for the working Programmer"
(Cambridge, 1991) Figure 9.2.  Translated into Haskell by Gary T. Leavens.

A few differences from Paulson's version:
 - Alphanumerics must start with an alphabetic character,
 - _ is considered to be an alphanumeric, not a special character
 - I added numeric tokens and string literals
 - This module isn't as reusable and adaptable as Paulson's,
   because Haskell doesn't have ML's functors.
 - Keywords that consist entirely of symbols are tokenized so that the
   longest one in the list of symbolic keywords is found
   (instead of the shortest one, which is found in Paulson's version).
 - Comments starting with the two characters -- and ending in a newline
   are recognized and discarded as in comment_start below

> module LexerTools where
> import ParserFunctions

This lexical analyzer recognizes three kinds of tokens:
reserved keywords, identifiers (alphanumerics), and numbers (integer literals).
Keywords can be either symbols, such as "if", or operators, such as ":=".

> data Token = Key String | Id String | Number Integer | StrTok String
>              deriving (Eq, Show)

The following auxiliary functions are useful for recognizing various
classes of tokens.

> isKey (Key _)       = True
> isKey _             = False
> isId (Id _)         = True
> isId _              = False
> isNumber (Number _) = True
> isNumber _          = False
> isStrTok (StrTok _) = True
> isStrTok _          = False

The following are some helpful parsing combinators.
(See the ParserFunctions for the general idea of parsing combinators.)

By far the most useful is p_check.  It allows one to check for the occurrence
of a particular keyword.  For example one would write p_check "if" to parse
the keyword `if'.

> p_check :: String -> Parser Token String
> p_check k ((Key s):ts)
>       | k == s    = [(s,ts)]
>       | otherwise = []
> p_check _ _ = []

The following can be used to parse an identifier token (one that is not
a keyword).

> p_Id      :: Parser Token String
> p_Id ((Id s):ts) = [(s,ts)]
> p_Id _           = []

The following can be used to parse a number.  Note that the number is
converted into an Integer, and its string representation is returned.

> p_Number  :: Parser Token Integer
> p_Number ((Number i):ts) = [(i,ts)]
> p_Number _ = []


The main function that performs lexical analysis is called `scanning' below.

It uses the following set of special characters to determine what
characters can potentially compose symbolic keywords.
Note that _ is not in this set.

> specials = "!@#$%^&*()+-=[]:\"|;`\\,./?'~<>{}"

The scanning function itself is a curried function that takes:
 a list of alphanumeric keywords (like "if"),
 a list of symbolic keywords (like ":=" or "+"),
 an accumulated list of tokens (usually [] to begin with,
                                accumulated in reverse), and
 a String representing the input to be tokenized.
It returns a list of tokens, in order.
The implementation is a simple finite state machine.

> scanning :: [String] -> [String] -> [Token] -> String -> [Token]
> scanning keys syms toks cs = scan_loop toks cs
>  where 
>    scan_loop :: [Token] -> String -> [Token]
>    scan_loop toks [] = reverse toks
>    scan_loop toks (c:cs) =
>     if comment_start (c:cs)
>     then scan_loop toks (eat_comment (c:cs))
>     else if letter c
>     then -- identifier or keyword
>          let (id, cs2) = (alphanumeric [c] cs)
>          in  scan_loop (tokenof id : toks) cs2
>     else if c == '"'
>     then -- string literal
>          let (str, cs2) = (strlit [] cs)
>          in scan_loop (StrTok str : toks) cs2
>     else if c `elem` specials
>     then -- symbolic keyword
>          let (sy, cs2) = specials_run [c] cs
>              (prefix_sy, cs3) = longest_prefix_symbol sy cs2
>          in if prefix_sy == ""
>             then -- not a keyword, so skip the first character and try again
>                  scan_loop toks cs
>             else scan_loop (Key prefix_sy : toks) cs3
>     else if digit c
>     then let (num, cs2) = numeric [c] cs
>          in scan_loop (Number (read num) : toks) cs2
>     else -- skip rest, for example blanks
>          scan_loop toks cs
>
>    comment_start :: String -> Bool
>    comment_start [] = False
>    comment_start [_] = False
>    comment_start ('-':'-':cs) = True
>    comment_start _ = False
>
>    eat_comment :: String -> String
>    eat_comment [] = []
>    eat_comment ('\n':cs) = cs
>    eat_comment (_:cs) = eat_comment cs
>  
>    ignore :: Char -> [Token] -> String -> [Token]
>    ignore c toks cs = scan_loop toks cs
> 
>    alphanumeric :: String -> String -> (String, String)
>    alphanumeric id [] = (id, [])
>    alphanumeric id (c:cs) = if letter c || digit c || c == '_'
>                             then alphanumeric (id++[c]) cs
>                             else (id, c:cs)
>
>    tokenof id = if id `elem` keys then Key id else Id id
>
>    specials_run :: String -> String -> (String, String)
>    specials_run sy [] = (sy, [])
>    specials_run sy (c:cs) = if not (c `elem` specials)
>                             then (sy, c:cs)
>                             else specials_run (sy++[c]) cs
>
>    longest_prefix_symbol :: String -> String -> (String, String)
>    longest_prefix_symbol [] unread = ([], unread)
>    longest_prefix_symbol sy unread =
>      if sy `elem` syms
>      then (sy, unread)
>      else longest_prefix_symbol (init sy) (last sy:unread)
>    
>    numeric :: String -> String -> (String, String)
>    numeric num [] = (num, [])
>    numeric num (c:cs) = if digit c
>                         then numeric (num++[c]) cs
>                         else (num, c:cs)
>
>    strlit :: String -> String -> (String, String)
>    strlit str [] = error "unterminated string literal"
>    strlit str ('"':cs) = (str, cs)
>    strlit str (c:cs) = strlit (str ++ [c]) cs
