How (not) to write a C++ front end – Part 3
In the previous installment we talked about the parsing technology we used, which looks like the canonical academic approach to parsing. In this chapter we will see some dificulties we encountered along the years.
A tiny preprocessor
Mercurium supports C, C++ and Fortran. C and C++ share many similarities so it is handy to share as many things as possible between the two front ends. Many of the tools we used work on text file inputs. Unfortunately they do not provide any sane parameterization nor modularization mechanism. So we wrote a very small tool that we called the tiny preprocessor tpp
. It acts like a C++ preprocessor but does not handle C macros or anything: its goal is basically mix input split in several files into a single file. It also allows conditional expansion.
Example outputs
1
2
3
4
5
6
7
8
9
$ tpp test.in -o out
$ cat out
k l m
contents of another.file
1
2
3
4
5
6
7
8
9
10
$ tpp test.in -DSOMETHING -o out
$ cat out
a b c
contents of a.file
k l m
contents of another.file
This tool is used to generate the lexers of C and C++ out of a single unified lexer file and to generate the grammars of C, C++ and Fortran. The main language is kept in the main grammar file, but subparts of it are stored in different files, so the grammar is easy to modularize. More on grammars later.
Lexing, should be easy, right?
The right angle
As a first step, the input is tokenized. Fortunately C++ is relatively easy to tokenize. That said there are a few special things we want to support correctly. One of them is the > token. Historically, in C++03, the following did not compile.
the reason being the >>
which was interpreted as the shift right operator (as in a >> 1
). So the programmer was forced to add a blank like in std::vector<std::vector<int> >
. C++11 lifted this restriction so it had to be possible to allow >>
in either usage. This is something very easy to support by a handmade parser but it is not so easy in our case.
The solution to this problem is (ab)using a feature of the Flex lexer. By doing some lookahead we can distinguish the different kind of token. A single > is just >
but if >
is just before another >
then, we consume the first one and we call it AB1
(angular bracket 1).
actually there is a case where we do not want to do this: >>=
. Again we will consume only the first >
.
all other possible cases are handled as usual.
Now when we define the C++ syntax, we will make template-arguments to start by <
and end either by >
or AB1
. As a side effect, an expression of the form a >> b
will be formed by the tokens AB1
followed >
so we may have to allow this case as well.
Raw strings
Other challenges introduced by C++11 are raw strings and standard attributes. Raw strings are a bit hard to tokenize because they have arbitrary delimiters. So the parser must remember the delimiter seen and when it sees the delimiter, finish the special tokenization of a raw string.
Here the tokenizer must identify FOO
as the delimiter. The string above will be printed as
To do this, when the tokenizer finds the start of a raw string literal it determines the delimiter. Then it moves to the raw string body mode where all characters but )
are consumed. When a )
is found, then it goes to a potential suffix mode where potentially we may find the delimiter. It may happen that it is not found, so the characters are appended to the raw literal as if nothing happened and we go back to raw string body mode. If the delimiter is found, then the lexer returns to regular operation and the raw literal ends there.
Standard attributes
Standard attributes also require special handling in the lexing phase. The reason is that a standard attribute is of the form [[ identifier anything-but-two-right-square-brackets-that-is-correctly-parenthesizedopt ]]
. This means that anything that is not (
, [
, {
or }
, ]
and )
has to be handled like opaque text, but the brackets have to be handled as syntax (so the parser will be able to enforce the correctly parenthesized part). This again requires special states in the lexer itself that keep track of the parentheses to tell whether a ]]
marks the end of the attribute.
Except for these annoying issues, lexing C/C++ is not particularly complicated. Recall that we do context-free parsing so the most complicated thing the lexer will have to do is to preserve a minimal amount of syntax (like it happens with raw string literals or standard attributes).
Parsing simple declarations is not simple
We already saw some of the problems that arise when parsing C++ using a context-free approach and what happens when using Bison GLR algorithm. In this post I want to show some other infelicites that arise due to the lax definition of C++ in its syntax.
One of the most complicated parts here is the syntax of simple-declaration
. Do not let its name confuse you, a simple-declaration
can be incredibly complex.
simple-declaration: decl-specifier-seqopt init-declarator-listopt;
A first interesting thing is that ;
alone is a valid simple-declaration
(though most compilers will warn you that this does not declare anything). Examples of simple-declaration
are.
A first problem appears in a trivial declaration like this one.
Given that one of the possible decl-specifier
s in a decl-specifier-seq
is an identifier that describes a type-name
(e.g. a class-name
or a typedef-name
) a parser that follows the grammar as described in the C++ standard will give two possible interpretations: A
and b
as type-specifiers
and an empty init-declarator-list
and A
as a type-specifier and b
as the init-declarator-list
. The parser becomes much more creative if we dare to write
Now the interpretations are: 3 type-specifiers (A
, ::B
and c
), 2 type specifiers (A
and ::B
) plus one init-declarator-list
(c
), 1 type-specifier
(A::B
) and one init-declarator-list
(c
). As you can see, again, the number of interpretations starts to grow unnecessarily.
At this point we can only enforce some specific syntax. Given that a type-specifier
must appear only once, we use it as an anchor. So our simple-declaration
looks like this.
This is, we force the presence of a type-specifier
and allow other nontype-specifier
around it. This solves the problem except for two issues. First: signed
, unsigned
, short
and long
can appear more than once and can also act as type-specifiers
if appear alone. This can be fixed with an extra-rule for type-specifier
that only allows signed
, unsigned
, short
and long
and does not let appear them in nontype-decl-specifier-seq
. Later on internally we will add an int
type-specifier
.
The second problem is caused by the associativity of ::
. C++ says that it has to be parsed to build the longest qualified name. This means that A ::b;
will always be parsed as A::b
. But our parser is not doing this because Bison GLR does not allow specifying the associativity of operators (annoying!). So we need to make sure that if our decl-specifier-seq
part ends with something that may be part of a qualified-name
, the init-declarator-list
does not start with an identifier. For instance
has to be parsed as int
in the decl-specifier-seq
and then ::A
in the init-declarator-list
while
has to be parsed as A<int>::D
in the decl-specifier-seq
and an empty init-declarator-list
. In order to consider ::D
a declarator it should be written as A<int> (::D)
.
This again means more cases. So what started being a simple-declaration ends being in the Bison grammar something ghastly like this.
Here decl_specifier_seq_0
means it does not end with something that could be pasted to a successive ::
to form a qualified name. Conversely, decl_specifier_seq_ended_with_named_type_spec
means exactly this case and then we force it to be followed by a nonglobal_init_declarator_list
which means, this init-declarator-list
cannot start with ::
. Also decl_specifier_alone_seq
makes sure that there is a type-specifier
: we must accept struct A { int x; };
and struct A;
here but not const;
. Note that int;
would be accepted but a warning would be emitted as nothing is declared here either.
At this point one may start thinking if using a grammar was a good idea in first place and maybe a handwritten recursive-descent parser would have been better.