Toying with GCC JIT – Part 3
In the last two parts of this series we've seen how to use GCC JIT and how to apply it to a simple regular expression matcher. But an open question remains, has it been worth the effort? This part will try to figure it out.
Generated recognizer
If you want to see the generated code (out of curiosity or for debugging), you can dump it into a Graphviz file. For instance, for the regular expression ab*c
we generate a match
function that looks like this. Note: to understand these graphs just recall that (char)97
is a
, (char)98
is b
and (char)99
is c
.
Function match
, calls matchere
in block-10
. Its code looks like this.
Function matchhere
calls matchhere_0
in block-08
. Its code looks like this.
Initial evaluation
Compilation is a hard process so the gains in speed when using a compiled regular expression matcher have to compensate the extra effort put in JIT compilation.
Let's first run the interpreted matcher to have a baseline. We will use the perf
tool.
As you can see, the original matcher is really fast and takes almost no time to run. Let's see now the JIT matcher.
Hey! It's slower! More than 15 times slower! Well, the truth is that Alice in Wonderland is a small text file. If we repeat the test on a big text file of 38 MiB (obtained from this benchmarking regex libraries page) then the results are slightly different.
Now the results show that our JIT is about 1.55 times faster. Not really brilliant but on big files that extra speed is always very well welcome.
Out of the critical path
The problem with the existing approach is that the JIT compilation may or may not bring enough benefits to be worth. What if we could have the cake and eat it? The idea is to JIT compile the matcher and if we have not finished yet matching lines, switch from the interpreted matcher to the compiled one. If there was not enough time, well, the compilation is simply discarded.
For the sake of simplicity, we will pass the same arguments as the interpreted code (regexp
and text
) to the JIT code although, as we know, the regexp
parameter will not be used. Also, to avoid complicating the code too much, there will be two global variables match_fun
and regexp
that will be initialized at the beginning of the main
function. These two variables will keep the current matcher function being used and the regular expression. For the matcher function variable, we will initialize it with the interpreted matcher.
Then we will proceed to match lines. Since the matcher function may eventually change to the compiled version, we will load it at each iteration, before calling it.
We need to make an atomic_load
to avoid data races because this variable will be written, concurrently, when the JIT compilation finishes.
To concurrently compile the code, well, we will just spawn a detached pthread
before starting matching lines.
A detached pthread will allow us to end the program even if the JIT compilation has not finished yet. This way we will be able to benefit from the fastest execution technique assuming that we can run two threads in parallel. Almost all processors in current computers can.
The function concurrent_jit
, simply compiles the code and updates the match_fun
pointer.
In this case we have to use an atomic_store
to correctly update the value of match_fun
without incurring in data-races.
Evaluation
Let's see if compiling concurrently improves for the short case (where compiling is not worth).
Now the timings are much better. The performance is just about 1.15 times worse than the original algorithm. Compared to the 15 times slowdown of the first approach this is quite an improvement. Let's see what happens with the big file.
Great. We do not lose performance, we even improve a bit, about 1.15 times better. Not much improvement but shows that we actually want to take the compilation out of the critical path.
Viewing the concurrent compilation
Using the Extrae library we can instrument a pthreads program to get a trace that can be visualized using Paraver. Here we see what happens in the case of Alice.*Rabbit
.
Thread 1.1.1 is the main thread (the one running the main
function) and thread 1.1.2 is the one that compiles our code. As you can see, JIT compilation takes much longer than actually matching the lines. The blue part in the main thread is where all the calls to match
happen. Every call to match is rendered as a green flag but they are drawn overlapped because they last so short.
For the long file doing Linus.*Linux
, the result is quite the opposite. The overall matching process lasts much more than JIT compilation.
When the code has just been compiled, we see that the a couple of matching calls becomes much slower. The reason is probably caused by the communication between the two threads using atomic instructions.
Within Paraver we can also measure the average duration of the calls to match.
Kind of code | Average (nanoseconds) |
---|---|
Interpreted | 197.29 |
Compiled | 126.88 |
This is consistent with the speed gain of 1.55 times we observed with jgrep-jit
on the big file.
Wrap-up
In these three-part short series about GCC JIT we have seen how to apply it to a regular expression matcher to get a bit more of speedup. The obtained speed gains are moderate (1.55 times). Since compilation may be a slow process, taking the compilation out of the critical path is beneficial for the cases where it may undermine the overall execution time of the program.
I have uploaded the code used in this series to github.
That's all for today.