In the previous installment I gave some context about the existence of Mercurium as a tool. In this chapter we will start digging into the parsing technology used.
The solved problem that isn't
Some people claim that parsing is a solved problem that actually is not. The reason is that we have good formalisms for parsing and theoretically they are good but in practice they are not. This has led the compiler industry to ignore most of the parsing developments and handmade written parsers are used instead. In fact, those handmade parsers are the fastest in performance and can give great diagnostics but often they are a pain to maintain. Eventually, given that most languages evolve relatively at a slow pace, the maintenance effort gets amortized.
Given a context-free grammar (which is a generative tool, it encodes how to generate the language) there are two big families of algorithms for parsing: LL and LR. LL, standing for Left to right Leftmost derivation, is very similar to a handmade parser and for many people are the most intuitive ones. Many variations of LL exist, LL(k), predicated LL(k) and Generalized LL. The latter exist to solve a problem of LL parsers: they do not allow rules of the form
X → X a. GLL is a relatively recent development and there is not much experience in the field implementing them. A representative tool of LL parsing is ANTLR.
The other family, LR, is probably most known due to the UNIX tool
yacc. LR, standing for Left to right Rightmost derivation, is a different way of parsing. The idea is that the input is shifted into a stack and when one or more symbols from the top of the stack can be generated by the
Y of a rule like
X → Y, then that top of the stack is replaced by
X. There are many algorithms as well of this family: LR(k), LALR(1) implemented by
yacc (and its clone GNU Bison), SLR and GLR. GLR is relatively newer and attempts to solve a problem that the other algorithms cannot cope: conflicts caused by ambiguous grammars.
Why is it context-free parsing more interesting than context-sensitive parsing? There are two reasons. First, LL and LR algorithms work on context-free grammar thus the context-freedom is more or less mandatory. Second, context-sensitive parsing more or less involves a relatively complicated step where the input has a different meaning. It is much easier describing a language using a context-free grammar even if this actually generates a larger language than the real programming language. Context-free grammars are very useful to encode structure, less so for semantics.
No matter how simple a programming language is, a context-free grammar will unlikely retain interesting properties like «the variable is not declared». This means that some amount of semantics is lost in the description of the grammar. Sometimes this is not a problem, because given a sequence of symbol of the programming language, there is only one syntactic way to build it. But it may well be that this is not the case. When this happens the language is ambiguous and no LR parser can cope with it. We mentioned above that GLR can, but at the cost of forcing you to disambiguate eventually at some point. The upside is that at least you recognize some structure. The downside is that sometimes the amount of possible interpretations can be overwhelming.
Consider the following C++ snippet.
It looks like a variable declaration of
c with type
A<B>. But what if this snippet is found inside a function and the lines above it are
now the meaning is completely different! It is an expression statement that (uselessly) computes
(A<B)>c. If we ignore all the semantic knowledge (i.e. that
c are variables of type
int), which we do when using context-free parsing, the sequence of symbols
A < B > c; is ambiguous and, if turns out to be valid, may have at least two meanings! Conversely,
has only a single meaning the declaration of a variable
x with type
In Mercurium a GLR parser is used to tackle this problem but it comes with its own warts.
Ambiguous grammars, like the one of C++, do not have a LR parser because the algorithm will run into a conflict. Fortunately an extension of LR, called GLR (standing for Generalized LR) can be used instead. This algorithm, instead of choking when facing a conflict, considers both. There are three possible outcomes: no syntactic interpretation leads to a valid parsing, only one option leads to a valid parsing, more than one leads to a valid parsing. The first case happens when, the parser encounters a conflict but eventually no interpretation is possible. Retaking the example above.
In this case, two options will considered after the first
<: declaration or expression, so the parser will try to parse both options. But the last
< (the one right before the
;) will make both interpretations fail. This is a legitimate syntax error at this point.
The second case happens when several interpretations fail, leading to a single valid one. This case is the ideal, because it means there was some dubious syntax but the doubt has eventually gone. Consider this case.
Again, two options will be considered after the first
< but this time when we encounter
+ the declaration interpretation will fail and only the expression interpretation will succeed.
The third case is the complex one: both interpretations are feasible. The example above
A<B> c; showcases this. In fact, due to the C++ nature, there are lots of cases that are surprisingly feasible. For instance:
Looks like this can only be declaring
c to be a variable with pointer type to
A<B>. Right? What if before that piece of C++ whe had this?
Now this can only be an expression! If you do not understand what is going on, read the diagnostic of
This is calling a function due to operator overloading!
When multiple interpretations are syntactically valid, the parser just annotates all of them. A later phase, which will do the semantic checks, will verify both cases and disambiguate as needed. This approach seems sensible except that it has an unpleasant consequence: these multiple interpretations may happen inside other multiple interpretations, which may mean that sometimes we may end with lots of valid interpretations. Some other cases it means that those interpretations are only potential and eventually just a few are feasible. Consider the following case, which is not as contrived as it may seem. It actually is based on a real example in a Boost header.
The first < (in
L1<E2) is relatively simple: declaration or expression. But the second one (in
L3<E4) causes havoc for the cause of the declaration because this declaration now might be intepreted in two different ways: it might be a nontype template argument (similar to an expression) or it might be a type template argument (in case L3 were a template name) thus we have to take into account this. This problem repeats for
L7<E8. Due to the number of
M, the expression case is not valid anymore (but it would be if it were only two
> as in
>>). All the potential different interpretations for declarations will eventually fail as we encounter more
>, so eventually only one interpretation is valid. But when we encountered
L7<E8 we already have 8 potential declaration interpretations. The parser must remember all of them, even if eventually all of them will fail. Consider a similar case:
This is potentially also valid if
L1 is a template with 5 template parameters of which the 3 in the middle must be nontype template arguments. And again all the possible interpretations will fail when we encounter
; but until then the parser will have to consider them.
This means two things. First the parser overhead increases in a proportional way to the number of feasible interpretations. This is overhead in terms of time, more interpretations more time devoted to all of them. And second, and less obvious but necessary for parsing, we cannot assume anything has been fully recognized until the parser reaches back to a state where only one interpretation is feasible (which for highly ambiguous programs would be at the end of the program, not the case of C++ though). This means that the parser needs a notion of tentative/non-deterministic parsing mode where it just advances through the input remembering what it saw but not committing anything to a real parse. This is overhead in terms of space (i.e. memory).
If you are thinking, well but this does not happen often, except in those pesky Boost headers. Well, the following expression also exposes this problem, due to the function call/cast ambiguity in C.
So this is a tough problem if we want to parse in a context-free fashion. Note that most of these problems go away when writing a handmade parser which can use arbitrary context, though C++ still has some legitimate ambiguities and those parsers will have to account for them. For instance
In both cases a C++ parser must account for an unclear possibility of a declaration/expression ambiguity here. For the first case, the standard says that if both are valid, then it resorts to a declaration. The second case can only be an expression once
++ is encountered. (As a curiosity
g++ handles this case incorrectly).
Using context-free parsing, the above case does not bear special treatment. It is just an ambiguous interpretation that will have to be disambiguated later.
GNU Bison GLR
Bison has a rather good GLR implementation that I'd like to call stack-based GLR. This technique, minimally departs from a traditional LALR and in general the algorithm is comparatively fast as the existing LALR implementation in GNU Bison under regular operation. When faced with a conflict, Bison GLR algorithm duplicates the stack of symbols it has seen and then it advances them. Each stack may have to be duplicated if further conflicts are encountered. For the example above, this can explode exponentially, leading to an exponential number of stacks and causing the parser to advance very, very slowly through the input.
Mercurium currently uses the GLR implementation of Bison so it suffers of this problem. Some Boost headers expose this problem: while the number of valid parsings are not exponential, the intermrediate parsing process explodes exponentially. In some causes several thousands of parsing stacks were created for the Boost headers, leading to a 1 second to parse a single declaration. Unfortunately that form was repeated in the Boost headers 5 times. This meant that parsing alone takes 5 seconds for that file. This is clearly suboptimal.
In contrast to the GNU Bison GLR algorithm, the Elkhound algorithm by Scott McPeak uses a graph (sometimes called a Graph Structured Stack, GSS) so it can share as much items from the stack as possible. This has the advantage to avoid the exponential behaviour of the GNU Bison GLR algorithm but at expense of a higher overhead of manipulating a graph.
As an experiment, I took the Bison GLR algorithm skeleton and modified so it implements the Elkhound algorithm. I did not implement a few of the suggested improvements by McPeak. I verified that, indeed, this algorithm does not incur on exponential behaviour. The problem is that for non problematic inputs, this algorithm has to pay the penalty of handling a graph. After some effort minimizing memory allocations as much as possible, I ended with a parser that was 1.5X slower in average that GNU Bison GLR (except for the inputs that show exponential behaviour which was orders of magnitud faster). This is a bit disappointing, considering how large some C++ headers are. Part of the overhead might be caused by the Bison expectations in the parser behaviour (we're mimicking a bit what the original Bison GLR does which may also contribute to the overhead).
I don't think GNU Bison upstream would accept an alternate GLR implementation like that, given that in general their GLR implementation is very competitive. If you are curious of what I did you can check the code here. Be warned this is very alpha and obviously does not implement all what a Bison GLR skeleton does. It should give you a good idea, though.
Parsing C++ is hard and no matter which approach we take is going to be painful. Applying a context-free approach using the LR algorithm turns out to be very complicated. We can use GLR but the GNU Bison implementation may sometimes expose exponential behaviour. An algorithm that avoids this happens to add some unpleasant overhead.
Production-grade C++ parsers nowadays use hand made, descent recursive parsers, which are a pain to maintain but can cope with parsing C++. This is the case of clang and GCC today. GCC used to have a Bison based C++ parser but it had lots of limitations. They switched to a handmade parser in GCC 3.4.