How (not) to write a C++ front end

As part of the work I did in my previous employer, we had to develop a C++ front end. This is never an easy task so I will use this series to share some experiences while developing it.

Context

My previous employer was the Barcelona Supercomputing Center. I worked in the Programming Models group of the Computer Science department. The main goal of that group is research and develop new programming models for parallelism in the context of high performance computing. For this goal there were two tools developed there: a runtime called Nanos++ and a compiler called Mercurium. The research was focused around OpenMP and BSC's own programming model OmpSs.

I used to be the main developer and maintainer of Mercurium. Mercurium is a source-to-source compiler for C/C++/Fortran that we used to implement OpenMP and OmpSs so we could apply our proposals and ideas to existing codes, including those developed in-house in the BSC.

A bit of history

Historically systems featuring multiprocessors with shared memory (i.e. SMP or NUMA) were scarce and very expensive (in the 90s). This changed, when the power wall was hit and became clear that cranking up the frequency of a processor would not bring any further benefits (around the early 2000's). At this point, hardware vendors turned to parallelism inside the CPUs as a way to increase performance. Eventually multicores would become more and more affordable and thus popular. Nowadays it is very easy to find multicore processors even in the processors of our smartphones.

The problem with parallelism is mainly programming it. In the 90s, High Performance Fortran was a failed attempt to provide a model. It failed not because the model was wrong per se, but because the compilation technology that was expected to support it never made it. From that failed project, OpenMP was born. OpenMP put a lower bar to compilers so the programming model was much more straightforward and less demanding to the compilation technology. This means that vendors could implement it without heroic efforts. For years OpenMP kept a low profile. But then the multicore machines appeared and being shared memory machines were a good match for the expectations of OpenMP. This is the reason why OpenMP has seen a boost in the last 15 years. It is still not a mainstream programming model but it is well known in HPC environments. This renewed interest in OpenMP spawned research both from academia and the industry. OpenMP (and OmpSs) where the main goal in academic research of the Programming Models. But research needs some amount of development, and this is where I enter.

Early in 2005, Alex Duran, a colleague (and later manager) of mine there in the BSC for about 7 years, had hacked a simple C front end that using some templates (actually like mini scripts in a small language) was able to implement many parts of OpenMP. The transformation was done in a source-to-source fashion. Alex coined that small tool Mercurium (admittedly not sure why, maybe because ancient god Mercurius/Hermes is the messenger of the gods but I fail to see how this relates to a source-to-source tool :). The compiler was called mcc. But OpenMP supports C, C++ and Fortran so in 2005 I was hired by the BSC, after earning my degree, to continue working in the source-to-source Fortran front end I had written for my final engineering degree project. Fortran may sound like an old language but is still widely used in HPC. Probably is only used there as a niche. This led to the development of mf95 (the Mercurium Fortran 95 compiler).

Around 2007 we realized that the existing infrastructure was a bit lacking. Both mcc and mf95 were extremely simple tools (you could barely call them compilers). So we eventually decided to invest more on them. At this point, we abandoned the template approach by a more generic and common pass mechanism. And then I got maybe the most crazy idea ever: implement the missing C++ compiler we did not have yet.

Oh man. This took like 7 years to become in a state that I'd consider acceptable. Inbetween, our managers told us that we had to support Fortran again. This was a bit disruptive and required rewriting the compiler again, so now Mercurium has three compilers in one: C, C++ and Fortran 95 (mcc, mcxx and mfc). Despite the effort diversion that this caused, I think that the overall infrastructure became more robust after this change.

Mercurium

Mercurium is a compilation infrastructure for fast prototyping in source-to-source transformations. As such it works in the following way.

A front end parses the code and generates an ambiguous AST. This is very high level and detailed. Practically useless and very tailored to the input language.
The AST is semantically analyzed and at this point is disambiguated. The outcome of this process is another AST, but to distinguish it, it was called nodecl. This second tree is very high level but it was designed to be able to express common elements of C, C++ and Fortran. Of course each language has its own features so there are specific trees for each, but I'd say that maybe 80% of the nodes are shared.
This nodecl, along with the semantic information (symbols, types) determined in the earlier step, are passed to a compilation pipeline that has freedom to do whatever it wants. Eventually it modifies the tree (and updates the symbolic information in the way).
Finally the nodecl tree is pretty printed giving source code again.

The driver (mcc, mcxx or mfc) then invokes an underlying compiler like gcc, gfortran, icc, ifort, etc. to compile the transformed code.

A cool feature of Mercurium is that phases need not to manually create nodecl trees, they can parse chunks of code either in C, C++ or Fortran and get a subtree out of it (along with the side effect of declarations in the semantic information). This was possible because the semantic phase keeps enough scoping information to allow new parsings.

Many people raised an eyebrow when we said that we could handle C++. While we never claimed full compliancy with any particular C++ Standard, we got a decent degree of compatibility with the g++ headers. Our C++ front end does most of the stuff a C++ front end has to do.

For instance, the following code:

#include <vector>
#include <numeric>
#include <iostream>

int main(int argc, char *argv[])
{
    std::vector<int> a(20, 0);
    std::iota(a.begin(), a.end(), 1);

    for (auto &it : a)
    {
        std::cout << it << std::endl;
    }
}

is prettyprinted by the compiler as shown here (you will probably want to go to the end of the file). This is a non-trivial amount of front end processing. Note that Mercurium internally instantiates stuff but keeps it internally for the underlying compiler (g++ or icpc). Even this, you can see that Mercurium, has made some properties of the code explicit.

As of this year, Mercurium is available in github.

What do we need to write a C++ front end?

There are a few ingredients that you will need to write a C++ front end.

A mechanism to preprocess the input. Mercurium does not have a preprocessor, instead it uses the one of the underlying compiler. This simplifies a few things since compilers look for some specific headers in particular directories that have to be known in advance. The downside of this approach is that it requires a call to preprocess. Modern C and C++ front ends do this on the fly while parsing.
A way to tokenize the preprocessed input. This is not very complicated in C and C++ so it uses a Flex scanner. (In contrast, Fortran lexical analysis is incredibly complicated. Mercurium used a prescanner for fixed-form Fortran and a normal Flex scanner for free-form Fortran. But this had a bunch of problems, eventually a handmade lexer that could handle both free and fixed form was written.)
A parser. We need to take the sequence of tokens and verify that they form a syntactically valid C++ program. This is incredibly hard because of the C++ syntax itself.
A semantic analysis phase. C++ has a rich type system and semantic, so this C++ formalism is in practice a must follow when implementing the semantics. This phase is incredibly hard to write as well, because it will involve inferring lots of things through overloading and template instantiation.

Things I'd do different today

When we started in this crazy endeavour around 2007, clang was in very early stages. So probably today we would rely on a clang library, this would lift a big part of the burden. I'm not sure if the current ability of implementing transformations using source code would still be feasible, but at least a lot of pain would have been avoided.

In the next post we will see what parsing technology is used in Mercurium and all the problems that it has to solve.