The lr1 module

Design of the module

The parser is structured as a list of productions, each of which describes a consecutive series of symbols that can be “reduced” to a new symbol.

Traditionally an LR1 parser is defined as a collection of head symbols, each of which has a collection of alternatives, each a string of symbols, e.g.:

expr <- number | binary_expr | '(' expr ')'
binary_expr <- unary_expr | binary_expr binary_op unary_expr
unary_expr <- expr | unary_op unary_expr
binary_op <- '+' | '-'
unary_op <- '-'

Here we take a different approach by defining the LR1 parser as a collection of productions, each of which reduces to a unique head symbol. Then instead of referring to a head symbol which has alternatives, at each reference we list out the alternative head symbols explicitly. For example, based on the above, we would first have to split each head symbol into several indepedent ones:

expr0 <- num
expr1 <- bin_expr
expr2 <- '(' expr ')'
bin_expr0 <- un_expr
bin_expr1 <- bin_expr bin_op un_expr
un_expr0 <- expr
un_expr1 <- un_op un_expr
bin_op0 <- '+'
bin_op1 <- '-'
un_op0 <- '-'

Then, each symbol would have to be changed into a set of acceptable symbols:

expr0 <- {num}
expr1 <- {bin_expr0, bin_expr1}
expr2 <- {'('} {expr0, expr1, expr2} {')'}
bin_expr0 <- {un_expr0, un_expr1}
bin_expr1 <- {bin_expr0, bin_expr1} {bin_op0, bin_op1} {un_expr0, un_expr1}
un_expr0 <- {expr0, expr1, expr2}
un_expr1 <- {un_op0} {un_expr0, un_expr1}
bin_op0 <- {'+'}
bin_op1 <- {'-'}
un_op0 <- {'-'}

Whilst our representation is equivalent, it is more efficient in several ways:

  • It can handle simple character input streams itself without needing a lexical analysis front end. With the lexical analysis front-end, the parser receives a stream of lexical tokens. A typical lex definition for how to recognize a token NUM and return it to the parser might look like:

    [0-9]+ return NUM;

    Whereas, without the front-end, the parser receives a stream of characters, these might be ASCII characters in the range 0..0xff. An equivalent rule to the above could be written in the traditional parser language as follows:

    digit <- '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
    num <- digit | num digit

    Or in our parser language (using symbol sets rather than alternatives) by:

    num0 <- {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}
    num1 <- {num0, num1} {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}

    We find the second way to be more efficient, since the parser now naturally supports character classes, and repurposes the same mechanism to implement alternatives. This principle is carried through to the generated automaton, so that the LR1DFA recognizes character classes the same way as a regular DFA automaton (such as would be generated for the lex-based example).

  • It is common to see (non-recursive) head symbols where all alternatives have length one, and these can be expanded in-place and eliminated, since the framework has natural support for sets. Furthermore, after splitting each head symbol into separate productions, some productions might be eliminated:

    expr1 <- {bin_expr0, bin_expr1}
    expr2 <- {'('} {num, expr1, expr2} {')'}
    bin_expr0 <- {un_expr0, un_expr1}
    bin_expr1 <- {bin_expr0, bin_expr1} {'+', '-'} {un_expr0, un_expr1}
    un_expr0 <- {num, expr1, expr2}
    un_expr1 <- {'-'} {un_expr0, un_expr1}

Symbols are mapped to integers. The usual mapping has the following ranges:

  • 0..N_CHARACTERS - 1 represent single characters, or possibly if a lexical analyzer front-end is being used, shorthand for particular lexical symbols;

  • N_CHARACTERS..n_terminals - 1 represent terminal symbols which are not single characters, usually meaning a lexical analyzer front-end is used;

  • n_terminals..n_terminals + len(productions) - 1 represent nonterminal symbols, i.e. the head symbols to which the productions can be reduced.

A special symbol called “eof_terminal” is defined, which must be in the range N_CHARACTERS <= eof_terminal < n_terminals.

If a lexical analyzer front-end is not used, eof_terminal will be the only terminal other than the characters, so assuming 8-bit ASCII, the characters will be 0..0xff, EOF will be 0x100, and nonterminals will be 0x101 and above.

In our scheme, nonterminal symbols do not have alternatives, meaning that each production reduces to a unique nonterminal. Hence there is a 1-1 mapping between productions and nonterminal symbols, and the symbol number is obtained by taking the position of the production in the productions list and adding the symbol number of the first nonterminal, which we call n_terminals above. Thus it is not necessary to store what symbol a production reduces to.

Since symbols are integers, symbol sets are sets of integers. We represent a set of integers using a list of breaks – see the bisect_set module.

Then the internal representation of a grammar is quite simple, as it is just a list of productions, each of which is a list of symbol sets, each of which is a list of breaks (i.e. starting symbol numbers of included/excluded symbols). Production number 0 is always the start production, and it must end with EOF.

We keep symbol sets separated into terminals and non-terminals, so each symbol set is actually two lists, not one. This is convenient for computing lookahead sets and generating the automaton, since terminals must be treated differently, so it would be annoying to have to split the list into two at each step. Note that in this format the non-terminals are not offset by n_terminals, so the numbers in the non-terminals set are simply the indices into productions.

We allow the user to store a reference data item with each production, and we also keep a workspace for computing the production’s lookahead sets and whether the production can be empty. This is calculated by scanning the production in a forward direction, checking for each element of each symbol set the element’s lookahead set and can-be-empty status. We save the intermediate results of the scan, i.e. the lookahead set and nullability of each prefix of each production.

Additionally, in a separate data structure we keep a map from nonterminals to associativity and precedence, for resolving shift/reduce conflicts. For compatibility, the algorithm is deliberately identical to YACC. The map is kept in a similar way to the sets discussed earlier – see the bisect_set module for how we keep sets of integers. For maps from integers to arbitary objects we keep two lists: a list of breaks, and a list of objects. Then symbol symbol maps to objects[i] for the smallest i such that symbol < breaks[i], or if none of these holds, it maps to objects[len(breaks)].