One of the quintessential UNIX tools is the grep tool. The global regular expression print is a tool that prints lines of text file that match a given regular expression. In this post we will apply JIT compilation to a very simple regular expression matcher by Rob Pike.

Regular expressions

Regular expressions are a theoretical construct to generate regular languages. Without going into much detail, a regular language is one that just requires a finite (i.e. bounded) amount of memory to recognize it. This means that we will not be able to recognize some languages that may have unbounded contexts. For instance, recognizing the language of the basic arithmetic requires keeping track parentheses and/or priorities among operators. Since we can nest an unbounded number of terms (forcing us to remember an unbounded number of priorities and parentheses), such language cannot be regular. Regular expressions, despite the limitations of the regular languages that they generate, are very useful in the context of searching text.

There exist several regular expression syntaxes. The most famous ones are the POSIX syntax and the Perl syntax. Their syntax differ but the basic concepts are the same. Note also, that the Perl regular expressions can be used to recognize a superset of regular languages but this is a consequence of their implementation.

A mini regular expression syntax

In the article A Regular Expression Matcher, Brian Kernighan describes a very simple yet still useful syntax of regular expressions.

c    matches any literal character c
.    matches any single character
^    matches the beginning of the input string
$    matches the end of the input string
*    matches zero or more occurrences of the previous character

It is not obvious from the description above but ^ will only be honoured at the beginning of the regular expression (otherwise it will be handled like c).

The matcher

In the same article, Kernighan presents an implementation, by Rob Pike, of the matcher algorithm. The implementation is strikingly simple and neat.

6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Used below
static int matchstar(int c, const char *regexp, const char *text);

/* matchhere: search for regexp at beginning of text */
static int matchhere(const char *regexp, const char *text)
{
    if (regexp[0] == '\0')
        return 1;
    if (regexp[1] == '*')
        return matchstar(regexp[0], regexp+2, text);
    if (regexp[0] == '$' && regexp[1] == '\0')
        return *text == '\0';
    if (*text!='\0' && (regexp[0]=='.' || regexp[0]==*text))
        return matchhere(regexp+1, text+1);
    return 0;
}

/* matchstar: search for c* regexp at beginning of text */
static int matchstar(int c, const char *regexp, const char *text)
{
    do {    /* a * matches zero or more instances */
        if (matchhere(regexp, text))
            return 1;
    } while (*text != '\0' && (*text++ == c || c == '.'));
    return 0;
}

/* match: search for regexp anywhere in text */
static int match(const char *regexp, const char *text)
{
    if (regexp[0] == '^')
        return matchhere(regexp+1, text);
    do {    /* must look even if string is empty */
        if (matchhere(regexp, text))
            return 1;
    } while (*text++ != '\0');
    return 0;
}

The article is missing main (a driver routine in the article terminology) function that takes the regular expression and tries to match it against each line. It is not hard to make one, though.

43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
int main(int argc, char *argv[])
{
    if (argc != 3)
    {
        fprintf(stderr, "usage: %s regex filename\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    const char* regexp = argv[1];

    FILE *f = fopen(argv[2], "r");
    if (f == NULL)
    {
        fprintf(stderr, "error opening file '%s': %s\n",
                argv[2],
                strerror(errno));
        exit(EXIT_FAILURE);
    }

    char* line = NULL;
    size_t length = 0;

    while (getline(&line, &length, f) != -1)
    {
        if (match(regexp, line))
            fprintf(stdout, "%s", line);
    }

    free(line);
    fclose(f);

    return 0;
}

With all in place, now we can use this small program as a simpler version of grep. Let's search lines in the plain text version of Alice's Adventures In Wonderland from the Gutenberg project.

For instance lines that contain first Alice and later Rabbit.

$ ./jgrep-basic "Alice.*Rabbit" pg11.txt 
Alice heard the Rabbit say, 'A barrowful will do, to begin with.'
them Alice recognised the White Rabbit: it was talking in a hurried
Alice watched the White Rabbit as he fumbled over the list, feeling very

Or the opposite, first Rabbit and then Alice.

$ ./jgrep-basic "Rabbit.*Alice" pg11.txt 
Very soon the Rabbit noticed Alice, as she went hunting about, and
'We must burn the house down!' said the Rabbit's voice; and Alice called
'She boxed the Queen's ears--' the Rabbit began. Alice gave a little

Our goal using a JIT compiler

The shown algorithm of the matcher is generic meaning that it works for any valid regular expression. But note that the regular expression will never change during the program. This means that our matcher could be using a specialized version of the algorithm tailored exclusively for the given regular expression. This is just the opposite of generic, we want a specialized version of the matching algorithm. Since we do not know the exact regular expression ahead of the execution of the program, we will have to create it during the program execution. This is why we will need a JIT compiler.

You may be now wondering whether the specialized version will run faster than the generic one. Likely not, but it is worth trying anyway.

Compiling the regular expression

Let's get back at the matchhere function above.

1
2
3
4
5
6
7
8
9
10
11
12
static int matchhere(const char *regexp, const char *text)
{
    if (regexp[0] == '\0')
        return 1;
    if (regexp[1] == '*')
        return matchstar(regexp[0], regexp+2, text);
    if (regexp[0] == '$' && regexp[1] == '\0')
        return *text == '\0';
    if (*text!='\0' && (regexp[0]=='.' || regexp[0]==*text))
        return matchhere(regexp+1, text+1);
    return 0;
}

Now take into account that regexp will be known by the JIT compiler as if it were a constant. So we do not need to pass it as a parameter. The parameter text, will not be known, because there will be one text per line of the searched file (and no, we do not want to make a specialized matcher per regular expression times lines, only one specialized matcher per regex). Let's analyze the algorithm case by case.

In lines 3 to 4, we check the finalization of the regular expression. In our case, since the regular expression is fixed, it means that if we somehow process it all, our specialized match must simply return 1;, meaning that the current text matches the regular expression.

In lines 7 to 8, our matcher must verify that we are at the end of the text, otherwise there is no match. This means that if we find a $ at the end of the regular expression, the code must simply check if text is the NULL ending of text.

In lines 9 to 10, our matcher simply checks that . or c match a single letter. If there is a match, the original implementation simply calls itself skipping one character from the regular expression and the text. This effectively moves on to match the remainder of the regular expression with the remainder of the text. Our implementation will skip this recursive call and instead just simply advance the text. We will then, match the remainder of the regular expression.

Finally, lines 5 to 6 implement the Kleene star. We have to match 0 or more times the character before the star. The original implementation just calls matchstar (repeated below for clarity). The call to matchstar passes the current character (that we have to match 0 or more times) and the remainder of the regular expression (the code skips the * as well, this is why it says regexp+2).

1
2
3
4
5
6
7
8
9
/* matchstar: search for c* regexp at beginning of text */
static int matchstar(int c, const char *regexp, const char *text)
{
    do {    /* a * matches zero or more instances */
        if (matchhere(regexp, text))
            return 1;
    } while (*text != '\0' && (*text++ == c || c == '.'));
    return 0;
}

In line 5 of matchstar, the original algorithm attempts to match the remainder of the regular expression (now found in the parameter regexp). If it matches directly it means that c* or .* has matched 0 characters, and we are done. If it does not match it attempts to match a single letter (either c or .). If it matches, it advances the text (the hard-to-grasph expression *text++ == c is obfuscated C that means: do first tmp1 = *text; then text++; and then tmp1 == c) and just proceeds to match again the remainder of the regular expression with the now remainder of the text. Finally if the remainder of the expression never matches or the current text does not match any of c or . then the whole Kleene star does not match. Note that this is not the most efficient way to implement a regular expression matcher but it easy to understand. For our JIT compiled code, what we will do, is to create another specialized matcher for the remaining of the regular expression and call it in the loop.

This is all we need for the matchhere function. Let's encode it so it can be JIT compiled.

Code generation

The matchhere and matchstar functions

Ok, let's start with matchhere. Each matcher will return int and receive a const char*, that we will call text. So let's start creating the function.

109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
static gcc_jit_function *generate_code_matchhere(gcc_jit_context *ctx, const char* regexp, const char* function_name)
{
  gcc_jit_type *int_type = gcc_jit_context_get_type(ctx, GCC_JIT_TYPE_INT);
  gcc_jit_type *char_type = gcc_jit_context_get_type(ctx, GCC_JIT_TYPE_CHAR);
  gcc_jit_type *const_char_ptr_type = gcc_jit_context_get_type(ctx, GCC_JIT_TYPE_CONST_CHAR_PTR);

  gcc_jit_param *param_text = gcc_jit_context_new_param(ctx, /* loc */ NULL, const_char_ptr_type, "text");
  gcc_jit_rvalue *rval_text = gcc_jit_param_as_rvalue(param_text);

  // matchhere
  gcc_jit_param* params[] = { param_text };
  gcc_jit_function *matchhere = gcc_jit_context_new_function(ctx, /* loc */ NULL,
      GCC_JIT_FUNCTION_INTERNAL, int_type, function_name,
      1, params, /* is_variadic */ 0);
  gcc_jit_block* current_block = gcc_jit_function_new_block(matchhere, new_block_name());
  ...

The exact name of the function will be passed as a parameter, you will see later why. The function itself is internal, because we will only call from another JIT compiled function, and as said above it only has a single parameter (text) of type const char* and returns int. We also create a rvalue of the parameter, that will be used later in the generated code. We will also create a first block and put it in current_block. We will maintain the variable current_block to always be the block where we have to append code in order to match the different parts of the regular expression.

We will also prepare a couple of variables at this point. We will always need a return 1, either when our matcher reaches the end of the regular expression or when we generate code for the Kleene star. So let's have a return_one ready.

136
137
138
139
140
141
  gcc_jit_block* return_zero = NULL;

  gcc_jit_block* return_one = gcc_jit_function_new_block(matchhere, new_block_name());
  gcc_jit_block_end_with_return(
      return_one, /* loc */ NULL,
      gcc_jit_context_one(ctx, int_type));

We also prepare a return_zero, but it is not clear whether we will need it not, so at the moment let's leave it empty (as NULL) and we will create if needed later.

Now let's go processing the given regular expression and create code to match it.

144
145
146
  // Now go creating
  for (;;)
  {

A simple first case is when we reach the end of the regular expression.

146
147
148
149
150
151
152
    if (regexp[0] == '\0')
    {
      gcc_jit_block_end_with_jump(
          current_block, /* loc */ NULL,
          return_one);
      break; // We are done
    }

When we reach the end of the regular expression, it means that all the previous bits of the regular expression have matched as well. So lets end the current block with just return 1. And we leave the loop because there is none left to do with the regular expression.

Let's put the Kleene star aside for a moment so we see the simpler cases first. The next one is a $ at the end of the regular expression.

212
213
214
215
216
217
218
219
220
221
222
223
224
    else if (regexp[0] == '$' && regexp[1] == '\0')
    {
      gcc_jit_block_end_with_return(
          current_block, /* loc */ NULL,
          gcc_jit_context_new_comparison(ctx, /* loc */ NULL,
            GCC_JIT_COMPARISON_EQ, 
            gcc_jit_lvalue_as_rvalue(
              gcc_jit_rvalue_dereference(rval_text, /* loc */ NULL)
              ),
            gcc_jit_context_zero(ctx, char_type)));

      regexp++;
    }

If the regular expression contains a $ as the last character, just return the result of the equal comparison (lines 216 to 221) of *text (lines 218 to 220) and (char)0 (line 221). Note that if the compared operands are equal (i.e. *text == (char)0), then we will return 1, otherwise we will return 0. After we have generated the code, we just move on to the next letter of the regular expression (that in this case will be '\0', so we will just emit a return 1 because this is the first case above).

Another two simple examples are . and c. The case for . is as follows.

225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
    else if (regexp[0] == '.')
    {
      generate_return_zero(ctx, matchhere, &return_zero);

      gcc_jit_block* next_block = gcc_jit_function_new_block(matchhere, new_block_name());

      // if (*text == '\0')
      //    return 0;
      gcc_jit_block_end_with_conditional(
          current_block, /* loc */ NULL,
          gcc_jit_context_new_comparison(ctx, /* loc */ NULL,
            GCC_JIT_COMPARISON_EQ, 
            gcc_jit_lvalue_as_rvalue(
              gcc_jit_rvalue_dereference(rval_text, /* loc */ NULL)
              ),
            gcc_jit_context_zero(ctx, char_type)),
          return_zero,
          next_block);

      // text = &text[1]; // pointer arithmetic
      gcc_jit_block_add_assignment(next_block, /* loc */ NULL,
          gcc_jit_param_as_lvalue(param_text),
          text_plus_one);

      // Chain the code
      current_block = next_block;

      // Done with the current letter
      regexp++;
    }

The code is a bit longer but conceptually simple. We first make sure that we have a return 0 (line 227). We have a function generate_return_zero that we pass (by reference) that return_zero variable that we declared above (but there we left it initialized to NULL) and it creates a return 0; if it is still NULL, otherwise it does nothing. Then (line 229) we create a new block next_block that we will use soon. Next we end the current block with a conditional jump (line 233). If the checked condition succeeds, we jump to the return_zero block otherwise we will jump to next_block. What is the condition of this conditional jump? Well, we just compare (line 235) the current character of the text, this is *text (lines 237 to 239), with (char)0 (line 240).

If the comparison fails, meaning that the current letter is not (char)0 and that . has matched, we have to advance to the next letter of the text. Here we have to compute text+1, since this is a pointer GCC JIT does this by doing the equivalent C code &text[1]. Unfortunately, probably due to a (still unconfirmed) bug GCC JIT requires us to compute the also-equivalent-but-including-an-unnecessary-cast (const char*)&text[1]. Since we do this last operation several times in our matcher, we already have a text_plus_one variable representing this computation. We created it before entering the loop that processes the regular expression.

125
126
127
128
129
130
131
132
133
134
  gcc_jit_rvalue* text_plus_one = 
    gcc_jit_context_new_cast(                     // (const char*)&text[1]
        ctx, /* loc */ NULL,
        gcc_jit_lvalue_get_address(               // &text[1]
          gcc_jit_context_new_array_access(       // text[1]
            ctx, /* loc */ NULL,
            rval_text,                            // text
            gcc_jit_context_one(ctx, int_type)),  // 1
          /* loc */ NULL),
        const_char_ptr_type);

So in next_block above (line 245) we add an assignment to text with the expression in text_plus_one, effectively computing text = &text[1] which is the same (albeit a bit more convoluted) as text++, or text += 1 or text = text + 1. And now next_block becomes current_block (line 250). Finally we move on to the next letter of the regular expression (line 253).

The case for c, shown below, is very similar but now we just return zero if the current letter of the text, i.e *text, is not the same as the current letter (line 270) of the regular expression.

255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
    else
    {
      generate_return_zero(ctx, matchhere, &return_zero);

      gcc_jit_block* next_block = gcc_jit_function_new_block(matchhere, new_block_name());

      // if (*text != regexp[0])
      //    return 0;
      gcc_jit_block_end_with_conditional(
          current_block, /* loc */ NULL,
          gcc_jit_context_new_comparison(ctx, /* loc */ NULL,
            GCC_JIT_COMPARISON_NE, 
            gcc_jit_lvalue_as_rvalue(
              gcc_jit_rvalue_dereference(rval_text, /* loc */ NULL)
              ),
            gcc_jit_context_new_rvalue_from_int(ctx, char_type, regexp[0])),
          return_zero,
          next_block);

      // text = &text[1]; // pointer arithmetic
      gcc_jit_block_add_assignment(next_block, /* loc */ NULL,
          gcc_jit_param_as_lvalue(param_text),
          text_plus_one);

      // Chain the code
      current_block = next_block;

      // Done with the current letter
      regexp++;
    }

You may be wondering what is this new_block_name (lines 229 and 259) that we use when creating a new block. It is just a function that returns a new block name, this may be useful for debugging the JIT compiled code. Along with that function we have new_function_name and new_local_name that do something similar and we will use them when implementing the Kleene star. The three functions are shown below (although they are not particularly interesting).

55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
static const char* new_block_name(void)
{
  static int n = 0;
  enum { SIZE = 16 };
  static char c[SIZE];

  snprintf(c, SIZE, "block-%02d", n);
  c[SIZE-1] = '\0';

  n++;

  return c;
}

static const char* new_function_name(void)
{
  static int n = 0;
  enum { SIZE = 16 };
  static char c[SIZE];

  snprintf(c, SIZE, "matchhere_%d", n);
  c[SIZE-1] = '\0';

  n++;

  return c;
}

static const char* new_local_name(void)
{
  static int n = 0;
  enum { SIZE = 16 };
  static char c[SIZE];

  snprintf(c, SIZE, "tmp_%d", n);
  c[SIZE-1] = '\0';

  n++;

  return c;
}

Now we can proceed to implement the Kleene star. I left if for the last because its implementation is slightly longer. As we mentioned above, the original algorithm matches the current text with the remaining regular expression (after the *). So what we need is specialized matcher for the remainder of the regexp. We can do that by recursively calling the current function to generate such matcher. Cool, right? :)

153
154
155
156
    else if (regexp[1] == '*')
    {
      // Generate code for the remaining regular expression
      gcc_jit_function *remaining_regexp_match = generate_code_matchhere(ctx, regexp + 2, new_function_name());

Now we need two more blocks, loop_body and loop_check, and we will end the current block with an unconditional jump to loop_body. This is implementing the initial semantics of the C do-while statement: we first enter the loop body prior checking the condition.

158
159
160
161
      gcc_jit_block* loop_body = gcc_jit_function_new_block(matchhere, new_block_name());
      gcc_jit_block* loop_check = gcc_jit_function_new_block(matchhere, new_block_name());

      gcc_jit_block_end_with_jump(current_block, /* loc */ NULL, loop_body);

If you check the code of matchstar above you will see that inside the do-while we first call matchhere with the remainder of the regular expression. So this is what we do now.

163
164
165
166
167
168
169
170
171
172
173
174
175
      gcc_jit_rvalue* args[] = { rval_text };
      gcc_jit_rvalue* match_remainder = 
        gcc_jit_context_new_comparison(
            ctx, /* loc */ NULL,
            GCC_JIT_COMPARISON_NE,
            gcc_jit_context_new_call(ctx, /* loc */ NULL,
              remaining_regexp_match, 1, args),
            gcc_jit_context_zero(ctx, int_type));

      gcc_jit_block_end_with_conditional(loop_body, /* loc */ NULL,
          match_remainder,
          return_one,
          loop_check);

We will call the remainig regexp matcher (lines 168-169) passing as the argument the current text (line 163). We will compare the result of the matcher with zero (lines 165 to 170) to end with a conditional jump the loop_body block. If the result is not zero, it means that the matcher of the remaining regepx did match so we return one. Otherwise we go to the loop_check block, this is the while part of the do-while in matchstar.

Now there is a bit of complexity caused by the *text++ == c syntax. As I said above, this is equivalent to do the following.

tmp = *text;
text++;
tmp == c;

So we have to create a temporary variable to store *text, then increment text and then use the temporary variable for the comparison itself. So this is what we do.

177
178
179
180
181
182
      gcc_jit_lvalue* tmp = gcc_jit_function_new_local(matchhere, /* loc */ NULL, char_type, new_local_name());
      gcc_jit_block_add_assignment(
          loop_check, /* loc */ NULL,
          tmp,
          gcc_jit_lvalue_as_rvalue(
            gcc_jit_rvalue_dereference(rval_text, /* loc */ NULL)));

We create a new temporary variable (line 177) and then we assign to it the value of *text.

Now our code will be different depending on whether the Kleene star was c* or .*. For the latter case we just compute if the temporary is not equal (lines 188 to 191) to (char)0 (line 191).

184
185
186
187
188
189
190
191
192
      gcc_jit_rvalue* check_expr;
      if (regexp[0] == '.')
      {
        check_expr =
          gcc_jit_context_new_comparison(ctx, /* loc */ NULL,
              GCC_JIT_COMPARISON_NE,
              gcc_jit_lvalue_as_rvalue(tmp),
              gcc_jit_context_zero(ctx, char_type));
      }

For the former case (i.e. c*), we just compute if the temporary is equal (lines 196 to 199) to c (line 199).

193
194
195
196
197
198
199
200
      else
      {
        check_expr =
          gcc_jit_context_new_comparison(ctx, /* loc */ NULL,
              GCC_JIT_COMPARISON_EQ,
              gcc_jit_lvalue_as_rvalue(tmp),
              gcc_jit_context_new_rvalue_from_int(ctx, char_type, regexp[0]));
      }

Next, we perform text++.

202
203
204
      gcc_jit_block_add_assignment(loop_check, /* loc */ NULL,
          gcc_jit_param_as_lvalue(param_text),
          text_plus_one);

And finally we end the loop_check block with a conditional jump, depending on whether the comparison was true or not (line 205). If the comparison was false, it means that neither the remainder of the regular expression matched neither the current letter of the regular expression matches the Kleene star, so we return zero (line 208). If the comparison was true (line 207) we just go back to the do part of the do-while.

205
206
207
208
209
210
      gcc_jit_block_end_with_conditional(loop_check, /* loc */ NULL,
          check_expr,
          loop_body,
          return_zero);

      break; // We are done

After this, we are done with the regular expression (line 210). We stop processing the regular expression at this point because, the remainder of it is already being processed in a matcher that we created for this Kleene star case. It may be surprising that for a regular expression like ab*c, the matcher only generates code for ab*, but take into account that we created a matcher for c, that is being used when matching b*.

Finally we return the function we have created. This is important because we need to call this function from the JIT compiled code (from the match function or from the Kleene star code).

293
294
  return matchhere;
}

match

Phew, that was long, but we are not done yet. We have to implement the match function. The match function handles ^ (if any).

296
297
298
299
300
301
302
303
void generate_code_regexp(gcc_jit_context *ctx, const char* regexp)
{
  const char* matchhere_regexp = regexp;
  if (regexp[0] == '^')
  {
    matchhere_regexp++;
  }
  gcc_jit_function* matchhere = generate_code_matchhere(ctx, matchhere_regexp, "matchhere");

We first generate a matcher for the regular expression (line 303) skipping an initial ^ if any (lines 299 to 302). To do this we use the code generator shown in the previous section.

Now we create a match function, that like matchhere, will just receive a text as a parameter.

310
311
312
313
314
315
316
  gcc_jit_param *param_text = gcc_jit_context_new_param(ctx, /* loc */ NULL, const_char_ptr_type, "text");
  gcc_jit_rvalue *rval_text = gcc_jit_param_as_rvalue(param_text);

  gcc_jit_param* params[] = { param_text };
  gcc_jit_function *match = gcc_jit_context_new_function(ctx, /* loc */ NULL,
      GCC_JIT_FUNCTION_EXPORTED, int_type, "match",
      1, params, /* is_variadic */ 0);

Note that this time the function is exported, because we will call it outside of the JIT compiler context.

Since we will have to call matchhere, let's prepare a call now that we will use later.

318
319
320
321
  gcc_jit_rvalue* args[] = { rval_text };
  gcc_jit_rvalue* call_to_matchhere = gcc_jit_context_new_call(ctx, /* loc */ NULL,
      matchhere,
      1, args);

Now let's conditionally generate code depending on whether the regular expression started with ^. If it does, the code is rather simple.

322
323
324
325
326
327
328
329
330
331
332
  if (regexp[0] == '^')
  {
    gcc_jit_block* block = gcc_jit_function_new_block(match, new_block_name());

    gcc_jit_block_end_with_return(
        block, /* loc */ NULL,
        gcc_jit_context_new_comparison(ctx, /* loc */ NULL,
          GCC_JIT_COMPARISON_NE, 
          call_to_matchhere,
          gcc_jit_context_zero(ctx, int_type)));
  }

We only have to check (lines 328 to 331) if the result of the call to matchhere (line 330) is not equal to zero (line 331) and use that value as the return of the match function.

If the regular expression does not start with ^, then we have to succesively match suffixes of the text. We need a loop for that, so the required code is a bit longer.

333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
  else
  {
    gcc_jit_block* loop_body = gcc_jit_function_new_block(match, new_block_name());
    gcc_jit_block* return_one = gcc_jit_function_new_block(match, new_block_name());
    gcc_jit_block* condition_check = gcc_jit_function_new_block(match, new_block_name());
    gcc_jit_block* return_zero = gcc_jit_function_new_block(match, new_block_name());

    gcc_jit_block_end_with_conditional(
        loop_body, /* loc */ NULL,
        gcc_jit_context_new_comparison(ctx, /* loc */ NULL,
          GCC_JIT_COMPARISON_NE, 
          call_to_matchhere,
          gcc_jit_context_zero(ctx, int_type)),
        return_one,
        condition_check);

    gcc_jit_block_end_with_return(
        return_one, /* loc */ NULL,
        gcc_jit_context_one(ctx, int_type));

    gcc_jit_lvalue* tmp = gcc_jit_function_new_local(match, /* loc */ NULL, char_type, new_local_name());
    gcc_jit_block_add_assignment(
        condition_check, /* loc */ NULL,
        tmp,
        gcc_jit_lvalue_as_rvalue(
          gcc_jit_rvalue_dereference(rval_text, /* loc */ NULL)));
    gcc_jit_block_add_assignment(
        condition_check, /* loc */ NULL,
        gcc_jit_param_as_lvalue(param_text),
        gcc_jit_context_new_cast(
          ctx, /* loc */ NULL,
          gcc_jit_lvalue_get_address(
            gcc_jit_context_new_array_access(
              ctx, /* loc */ NULL,
              rval_text,
              gcc_jit_context_one(ctx, int_type)),
            /* loc */ NULL),
          const_char_ptr_type));
    gcc_jit_block_end_with_conditional(
        condition_check, /* loc */ NULL,
        gcc_jit_context_new_comparison(ctx, /* loc */ NULL,
          GCC_JIT_COMPARISON_NE, 
          gcc_jit_lvalue_as_rvalue(tmp),
          gcc_jit_context_zero(ctx, char_type)),
        loop_body,
        return_zero);

    gcc_jit_block_end_with_return(
        return_zero, /* loc */ NULL,
        gcc_jit_context_zero(ctx, int_type));
  }

We will need four blocks. A loop_body block for the do-while (note that since loop_body is the first block created for the function it will be the entry block of the function), a condition_check where we will check if we have reached the end of the text, and two return_zero and return_one blocks that simply return 0 (lines 380 to 382) and return 1 (lines 349 to 351) respectively. We end the loop_body block conditionally (line 340) depending on whether the call to matchhere (line 344) returns a value different to zero (line 342 to 345). If it does it means there was a match and then we go to return_one, otherwise we go to the condition_check block.

Again in the condition of the loop we find *text++ == '\0' which again we have to decompose in tmp = *text; text++; tmp != '\0' (lines 353 to 370). We check if tmp is different to zero (lines 373 to 376). If it is we just go back to the loop body, otherwise we return zero (lines 371 to 378). This is similar to the code we generated for check in the loop of the Kleene star matcher.

Driver

We are close to the end. We still need to change our main function to generate and compile the code and then run the called code.

int main(int argc, char *argv[])
{
  if (argc != 3)
  {
    fprintf(stderr, "usage: %s regex filename\n", argv[0]);
    exit(EXIT_FAILURE);
  }

  const char* regexp = argv[1];

  gcc_jit_context *ctx;
  ctx = gcc_jit_context_acquire ();
  if (ctx == NULL)
    die("acquired context is NULL");

  gcc_jit_context_set_int_option(ctx, GCC_JIT_INT_OPTION_OPTIMIZATION_LEVEL, 2);

  generate_code_regexp(ctx, regexp);

  gcc_jit_result *result = gcc_jit_context_compile(ctx);
  if (result == NULL)
    die("compilation failed");

  void *function_addr = gcc_jit_result_get_code(result, "match");
  if (function_addr == NULL)
    die("error getting 'match'");

  typedef int (*match_fun_t)(const char*);
  match_fun_t match = (match_fun_t)function_addr;

  FILE *f = fopen(argv[2], "r");
  if (f == NULL)
  {
    fprintf(stderr, "error opening file '%s': %s\n",
        argv[2],
        strerror(errno));
    exit(EXIT_FAILURE);
  }

  char* line = NULL;
  size_t length = 0;

  while (getline(&line, &length, f) != -1)
  {
    if (match(line))
      fprintf(stdout, "%s", line);
  }

  free(line);
  fclose(f);

  return 0;
}

Most of the code is similar to the driver at the beginning of the post. The only relevant parts are that now we first generate the code calling generate_code_regexp (line 407). This will populate the JIT compiler context so we can compile it (line 409). Once compiled it is just a matter of getting the address of the match function. See how this time our match function only has one parameter. The regexp itself has been embedded in its code because we have generated a specialized matcher. Optionally we can enable -O2 optimization level so the generated code is faster (at expense of a slower JIT compilation).

Cliffhanger 8-)

In the next chapter we will get some performance numbers in order to see if it was worth or not to use JIT compilation to speed up or regular expression matcher.

That's all for today