Notes on Yapps 2 Python Parser Generator

2007.08
Michael Breen

Annotated Example
Further Tips
Bugs

Yapps is a lightweight LL(1) parser generator that produces human-readable parsers written in Python. It's pretty neat and it generally does what you would expect (and want). Amit Patel has made it available under a free licence but seems to have stopped maintaining it. This is a quick reference; it also includes some details not in the manual.

Version info and URL from yapps2.py in yapps2.zip:

  # Yapps 2.0 – yet another python parser system
  # Amit J Patel, January 1999
  # See http://theory.stanford.edu/~amitp/Yapps/ for documentation and updates
   ...
  # v.2.0.4 changes (July 2003)
   ...

Use yapps2.py to generate a parser from a grammar file:

  [michael yapps2 9]$ python yapps2.py examples/expr.g
  Input Grammar: examples/expr.g
  Output File: examples/expr.py
  [michael yapps2 10]$

The generated file includes scanner and parser classes derived from the base classes Scanner and Parser in yappsrt.py – the “Yapps 2.0 Runtime” which must be available in the same directory; it also defines SyntaxError and NoMoreTokens exceptions and functions for printing error messages.

Annotated Example

An example based on calc.g included in the distribution is a calculator that supports interaction like:

  >>> set x 2
  x = 2
  >>> x * 4
  = 8
  >>> let x = 1 in x * 4
  = 4
  >>> x * 4
  = 8
  >>> 3 * (6 + 4)
  = 30
  >>>

The grammar file for this is:

  #!/usr/bin/env python
  
  # ... any other code to be copied straight over –
  # typically variables and functions invoked by code
  # attached to the rules of the parser:
  
  globalvars = {}       # We will store the calculator's variables here
  
  def lookup(map, name):
      "get variable value. map:local variables; name: variable id"
      for x,v in map:
          if x==name: return v
      if name not in globalvars.keys():
          print 'Undefined:', name
      return globalvars.get(name, 0)
  
  %%
  # Parser section after the '%%' separator
  # (comments in this section are not copied to the .py file)
  
  parser Calculator:
      # Without this option, Yapps produces a context-sensitive
      # scanner: the parser tells the scanner what tokens it
      # expects – so, e.g., a keyword could be read in as an
      # identifier where the keyword token wasn't expected.
      # However, if a context-sensitive scanner is not needed
      # then it's probably better for debugging to have the
      # simpler context-insensitive scanner.
      option:  "context-insensitive-scanner"
      # 'ignore' really means 'treat as token separators'
      # Note all these strings are regular expressions.
      ignore:    '[ \r\t\n]+'
      ignore:    '#.*?\r?\n'    # line comment
      token NUM: '[0-9]+'
      token VAR: '[a-zA-Z_]+'
      # Even if it doesn't appear in the rules,
      # an END token is usually needed: otherwise, with most
      # grammars, the scanner will keep trying to read beyond
      # the end of the string.
      token END: '$'
  
      # The goal production is specified when the parser is
      # invoked (i.e., it doesn't have to be named 'goal'
      # or be the first one listed).
      # The END token usually needs to be specified in the
      # goal rule. (In fact, for reasons to do with the
      # recursive nature of this grammar, it's sufficient
      # for END to be defined as a token – but it does no
      # harm to include it in the goal rule too.)
      rule goal: goal2 END
  
      # Rules of the form  NonTerminal<<Parameters>>: ...
      # allow one or more attributes to be passed in.
      # In this case, the attribute is the list of calculator's
      # local variables defined using the 'let' alternative of
      # the 'term' production below; there are no locals to
      # begin with so we pass an empty list to expr.
      rule goal2: expr<<[]>>
                    # Only a single statement can be included in each
                    # {{ code fragment }} attached to the grammar.
                    # The return value of rule 'expr' is in 'expr'.
                    {{ print '=', expr }}
                    # This could be omitted – 'goal' doesn't use
                    # the return value.
                    {{ return expr }}
                # 'set' becomes an anonymous token for the scanner;
                # it is added at the beginning of the list of tokens
                # and so takes precedence over VAR above
                | "set" VAR expr<<[]>>
                    # The text of the terminal symbol VAR is in VAR
                    {{ globalvars[VAR] = expr }}
                    {{ print VAR, '=', expr }}
                    {{ return expr }}
  
      # V holds the calculator's local variables (see comment above).
      rule expr<<V>>:   factor<<V>>         {{ n = factor }}
                       ( "[+]" factor<<V>>  {{ n = n+factor }}
                       |  "-"  factor<<V>>  {{ n = n-factor }}
                       )*                   {{ return n }}
  
      rule factor<<V>>: term<<V>>           {{ v = term }}
                       ( "[*]" term<<V>>    {{ v = v*term }}
                       |  "/"  term<<V>>    {{ v = v/term }}
                       )*                   {{ return v }}
  
      rule term<<V>>:
                   NUM                      {{ return atoi(NUM) }}
                 | VAR                      {{ return lookup(V, VAR) }}
                 | r"\(" expr<<V>> r"\)"    {{ return expr }}
                 | "let" VAR "=" expr<<V>>  {{ V = [(VAR, expr)] + V }}
                   "in" expr<<V>>           {{ return expr }}
  %%
  # If is second '%%' separator is present then the first one
  # must be too, even if there's no code before the parser.
  # Anything here is copied straight to the .py file after
  # the generated code.
  # If this section (and the '%%') is omitted, Yapps inserts
  # test code.
  
  if __name__=='__main__':
      print 'Welcome to the calculator sample for Yapps 2.0.'
      print '  Enter either "<expression>" or "set <var> <expression>",'
      print '  or just press return to exit.  An expression can have'
      print '  local variables:  let x = expr in expr'
      # We could have put this loop into the parser, by making the
      # `goal' rule use (expr | set var expr)*, but by putting the
      # loop into Python code, we can make it interactive (i.e., enter
      # one expression, get the result, enter another expression, etc.)
      while 1:
          try: s = raw_input('>>> ')
          except EOFError: break
          if not strip(s): break
          parse('goal', s)
      print 'Bye.'

Further Tips

The usual EBNF constructions are allowed in the grammar (not all are shown in the example above):
```
     [ optional ]
     ( oneOrMore )+
     ( zeroOrMore )*
```
A null production is just a blank (normally after the last | in a set of alternatives).
The scanner tries to return the longest token it can. There is no problem with a list of regular expressions like
```
      token LT: '<'
      token EQ: '='
      token LTEQ: '<='
```
For tokens of equal length, the first one matched takes precedence.
The usual LL(1) restrictions apply: the grammar must be constructed so that one token of lookahead is enough for the parser to choose the correct production without backtracking.
All the alternatives for the same non-terminal must be given in one rule, e.g.,
```
      rule stmt: expr | "if" stmt
```
is fine but the following won't work (and Yapps issues no warning):
```
      rule stmt: expr
      rule stmt: "if" stmt
```
With ambiguous productions, the first alternative is preferred; also the generated parser tries to match as many tokens as it can for the current rule. This means that a null production, if present, should normally be the last alternative in a rule and a convenient but ambiguous LL(1) grammar will be parsed deterministically – e.g., one with a “dangling else” problem. Note that Yapps issues no warnings for ambiguous grammars.
The parser can be customized by subclassing; a custom scanner could also be used (though if you reach that point you might want to consider using something else).
The text scanned in for a rule (or part of a rule) can be retrieved conveniently from the scanner – see "5.3 Internal Variables" in the Yapps manual.

Bugs

A bug in yappsrt.py: function wrap_error_reporter tries to return return_value even if it hasn't been assigned (i.e., when an exception has been caught). Fix:
```
  163c163
  <         return_value = getattr(parser, rule)()
  ---
  >         return getattr(parser, rule)()
  173d172
  <     return return_value
```

A bug in calc.g, one of the examples included in the package, causes an exception when parentheses are used. The fix is:

  <                | "\\(" expr "\\)"         {{ return expr }}
  ---
  >                | "\\(" expr<<V>> "\\)"    {{ return expr }}

A bug in yapps2.py means that a line comment immediately before the '%%' separator at the end of the parser section of a grammar file will not be recognized as a comment (because the newline is stripped from then end of s by the code that searches for the DIVIDER).
A minor annoyance is that the source files in the distribution contain a mix of tabs and spaces (run Unix 'expand' utility).