How (not) to write a C++ front end

In the previous installment we talked about the parsing technology we used, which looks like the canonical academic approach to parsing. In this chapter we will see some dificulties we encountered along the years.

A tiny preprocessor

Mercurium supports C, C++ and Fortran. C and C++ share many similarities so it is handy to share as many things as possible between the two front ends. Many of the tools we used work on text file inputs. Unfortunately they do not provide any sane parameterization nor modularization mechanism. So we wrote a very small tool that we called the tiny preprocessor tpp. It acts like a C++ preprocessor but does not handle C macros or anything: its goal is basically mix input split in several files into a single file. It also allows conditional expansion.

/*!if SOMETHING*/
a b c
/*!include "a.file" */
/*!endif*/
k l m
/*!include "another.file" */

Example outputs

1
2
3
4
5
6
7
8
9
$ tpp test.in -o out
$ cat out




k l m

contents of another.file

1
2
3
4
5
6
7
8
9
10
$ tpp test.in -DSOMETHING -o out
$ cat out

a b c

contents of a.file

k l m

contents of another.file

This tool is used to generate the lexers of C and C++ out of a single unified lexer file and to generate the grammars of C, C++ and Fortran. The main language is kept in the main grammar file, but subparts of it are stored in different files, so the grammar is easy to modularize. More on grammars later.

Lexing, should be easy, right?

The right angle

As a first step, the input is tokenized. Fortunately C++ is relatively easy to tokenize. That said there are a few special things we want to support correctly. One of them is the > token. Historically, in C++03, the following did not compile.

#include <vector>
std::vector<std::vector<int>> a;

the reason being the >> which was interpreted as the shift right operator (as in a >> 1). So the programmer was forced to add a blank like in std::vector<std::vector<int> &gt. C++11 lifted this restriction so it had to be possible to allow >> in either usage. This is something very easy to support by a handmade parser but it is not so easy in our case.

The solution to this problem is (ab)using a feature of the Flex lexer. By doing some lookahead we can distinguish the different kind of token. A single > is just >

>           { parse_token_text(); update_location(); return '>'; }

but if > is just before another > then, we consume the first one and we call it AB1 (angular bracket 1).

>/>         { parse_token_text(); update_location(); return AB1; }

actually there is a case where we do not want to do this: >>=. Again we will consume only the first >.

>/>[^=]     { parse_token_text(); update_location(); return AB1; }

all other possible cases are handled as usual.

>>          { parse_token_text(); update_location(); return RIGHT; }
>=          { parse_token_text(); update_location(); return GREATER_OR_EQUAL; }
>>=         { parse_token_text(); update_location(); return RIGHT_ASSIGN; }

Now when we define the C++ syntax, we will make template-arguments to start by < and end either by > or AB1. As a side effect, an expression of the form a >> b will be formed by the tokens AB1 followed > so we may have to allow this case as well.

Raw strings

Other challenges introduced by C++11 are raw strings and standard attributes. Raw strings are a bit hard to tokenize because they have arbitrary delimiters. So the parser must remember the delimiter seen and when it sees the delimiter, finish the special tokenization of a raw string.

const char* c2 = R"FOO(one\n
two\n)FOA\n
three)FOO";

Here the tokenizer must identify FOO as the delimiter. The string above will be printed as

one\n
two\n)FOA\n
three

To do this, when the tokenizer finds the start of a raw string literal it determines the delimiter. Then it moves to the raw string body mode where all characters but ) are consumed. When a ) is found, then it goes to a potential suffix mode where potentially we may find the delimiter. It may happen that it is not found, so the characters are appended to the raw literal as if nothing happened and we go back to raw string body mode. If the delimiter is found, then the lexer returns to regular operation and the raw literal ends there.

Standard attributes

Standard attributes also require special handling in the lexing phase. The reason is that a standard attribute is of the form [[ identifier anything-but-two-right-square-brackets-that-is-correctly-parenthesized_opt ]]. This means that anything that is not (, [, { or }, ] and ) has to be handled like opaque text, but the brackets have to be handled as syntax (so the parser will be able to enforce the correctly parenthesized part). This again requires special states in the lexer itself that keep track of the parentheses to tell whether a ]] marks the end of the attribute.

Except for these annoying issues, lexing C/C++ is not particularly complicated. Recall that we do context-free parsing so the most complicated thing the lexer will have to do is to preserve a minimal amount of syntax (like it happens with raw string literals or standard attributes).

Parsing simple declarations is not simple

We already saw some of the problems that arise when parsing C++ using a context-free approach and what happens when using Bison GLR algorithm. In this post I want to show some other infelicites that arise due to the lax definition of C++ in its syntax.

One of the most complicated parts here is the syntax of simple-declaration. Do not let its name confuse you, a simple-declaration can be incredibly complex.

simple-declaration: decl-specifier-seq_opt init-declarator-list_opt;

A first interesting thing is that ; alone is a valid simple-declaration (though most compilers will warn you that this does not declare anything). Examples of simple-declaration are.

const int x, y; // decl-specifier-seq=«const int»  init-declarator-list=«x, y»
struct A { int w; }; // decl-specifier-seq=«struct A { int w; }» init-declarator-list=«»
struct B {
   B(int); // decl-specifier-seq=«» init-declarator-list=«B(int)»
};

A first problem appears in a trivial declaration like this one.

A b;

Given that one of the possible decl-specifiers in a decl-specifier-seq is an identifier that describes a type-name (e.g. a class-name or a typedef-name) a parser that follows the grammar as described in the C++ standard will give two possible interpretations: A and b as type-specifiers and an empty init-declarator-list and A as a type-specifier and b as the init-declarator-list. The parser becomes much more creative if we dare to write

A::B c;

Now the interpretations are: 3 type-specifiers (A, ::B and c), 2 type specifiers (A and ::B) plus one init-declarator-list (c), 1 type-specifier (A::B) and one init-declarator-list (c). As you can see, again, the number of interpretations starts to grow unnecessarily.

At this point we can only enforce some specific syntax. Given that a type-specifier must appear only once, we use it as an anchor. So our simple-declaration looks like this.

simple-declaration: decl-specifier-seq ;
          | init-declarator-list ;
          | decl-specifier-seq init-declarator-list ;
decl-specifier-seq : type-specifier
          | nontype-decl-specifier-seq type-specifier-seq
          | type-specifier-seq nontype-decl-specifier-seq
          | nontype-decl-specifier-seq type-specifier-seq nontype-decl-specifier-seq

This is, we force the presence of a type-specifier and allow other nontype-specifier around it. This solves the problem except for two issues. First: signed, unsigned, short and long can appear more than once and can also act as type-specifiers if appear alone. This can be fixed with an extra-rule for type-specifier that only allows signed, unsigned, short and long and does not let appear them in nontype-decl-specifier-seq. Later on internally we will add an int type-specifier.

short s; // becomes internally as if 'short int s;'
const long long static a; // becomes internally as if 'const long long int static a;'

The second problem is caused by the associativity of ::. C++ says that it has to be parsed to build the longest qualified name. This means that A ::b; will always be parsed as A::b. But our parser is not doing this because Bison GLR does not allow specifying the associativity of operators (annoying!). So we need to make sure that if our decl-specifier-seq part ends with something that may be part of a qualified-name, the init-declarator-list does not start with an identifier. For instance

int ::A;

has to be parsed as int in the decl-specifier-seq and then ::A in the init-declarator-list while

A<int> ::D;

has to be parsed as A<int>::D in the decl-specifier-seq and an empty init-declarator-list. In order to consider ::D a declarator it should be written as A<int> (::D).

This again means more cases. So what started being a simple-declaration ends being in the Bison grammar something ghastly like this.

non_empty_simple_declaration : decl_specifier_seq_0 init_declarator_list ';'
 | decl_specifier_seq_ended_with_named_type_spec nonglobal_init_declarator_list ';'
 | init_declarator_list ';'
 | decl_specifier_alone_seq ';'
;

Here decl_specifier_seq_0 means it does not end with something that could be pasted to a successive :: to form a qualified name. Conversely, decl_specifier_seq_ended_with_named_type_spec means exactly this case and then we force it to be followed by a nonglobal_init_declarator_list which means, this init-declarator-list cannot start with ::. Also decl_specifier_alone_seq makes sure that there is a type-specifier: we must accept struct A { int x; }; and struct A; here but not const;. Note that int; would be accepted but a warning would be emitted as nothing is declared here either.

At this point one may start thinking if using a grammar was a good idea in first place and maybe a handwritten recursive-descent parser would have been better.