Think In Geek

In geek we trust

A tiny GCC front end – Part 3

Now that the minimal infrastructure is already set, we can start with the implementation of our tiny front end. Today we will talk about the lexer.

Typical compiler architecture

Since compilation is a lowering process, it makes sense to split the whole process in several steps. Each of these steps usually performs a single process. This is done because compilers quickly become untameable beasts. By splitting the compiler in several steps, it is easier to assign each step a specific and narrow task. This helps reasoning about the code and thus can minimize introducing bugs (or helps fixing them) and eases extending the compiler (this is actually commonplace software engineering good practice, but obviously applies to compilers as well).

Almost every compilation infrastructure like GCC, LLVM, Open64, … is split in two big parts. The front end (what we are implementing) and the back end. The front end is responsible of recognizing the source language and diagnosing syntactic and semantic errors. The back end is responsible for generating code for the specific target (i.e. the environment and architecture). Nowadays compilers are in general optimizing compilers (a misnomer since they should be called enhancing compilers) this means thay they try to generate code that makes the most of the properties of the program (machine-independent optimizations) and the target where the program is going to run (machine-dependent optimizations). While machine-independent optimizations can be done in the front end this would make them language-specific. This is undesirable, so most compilation infrastructures have an extra part between the front end and the back end that sometimes is called the middle end.

Since the process sequentially flows from the front end to the middle end and from that to the back end, there must be a way for each two to communicate. Compilers use what is commonly called intermediate representations. An intermediate representation is a more or less abstract representation of the program. At the early stages of the compilation process, the representation is very abstract and high level, pretty close to the programming language. As the compilation process goes on, the intermediate representation becomes more and more low level, until it is almost assembler or machine code. It is not unusual that compilers have several intermediate representations. Sometimes it is the same representation simply restricted to a lower level subset. Sometimes is an entirely different representation. It may even happen that some optimizations require additional intermediate representations that are built only for them.

In GCC, a front end can use whatever representation it wants but at the end, before handing out the code to the rest of the compilation process, it has to be lowered into something called GENERIC. GENERIC is a tree-like representation and it is used in the front end and some early steps in the middle end of GCC. As a front end we only have to care to be able to deliver sensible GENERIC to the rest of the compiler.

But before we can create any GENERIC we first have to deal with the source code.

Open the file

Before we do anything let’s modify our parse hook in gcc-src/gcc/tiny/tiny1.cc so we open the input file and so we can later analyze it.

static void
tiny_parse_file (const char *filename)
{
  FILE *file = fopen (filename, "r");
  if (file == NULL)
    {
      fatal_error (UNKNOWN_LOCATION, "cannot open filename %s: %m", filename);
    }
 
  // Here we will analyze our file
 
  fclose (file);
}
 
static void
tiny_parse_files (int num_files, const char **files)
{
  for (int i = 0; i < num_files; i++)
    {
      tiny_parse_file (files[i]);
    }
}
 
static void
tiny_langhook_parse_file (void)
{
  tiny_parse_files (num_in_fnames, in_fnames);
}

This still does nothing useful but at least defines a placeholder where the analysis starts. Note the usage of globals num_in_fnames and in_fnames in tiny_langhook_parse_file that contain the number of files we are going to compile. Although we do not mind the number of parameters in tiny_parse_files our examples will have just a single file. In function tiny_parse_file now we just open the input file but the analysis will happen there.

Typical front end design

Nowadays front ends are commonly designed with three components: the lexical analyzer, the syntactic analyzer and the semantic analyzer. This is consistent with the strategy of splitting the compiler in smaller parts so they are manageable.

The lexical analyzer takes the input and splits into the constitutive parts of the language (i.e. the lexical elements) called tokens. The lexical analyzer is commonly called the lexer, the scanner or the tokenizer. Conceptually constructs a sequence of tokens of the program and nothing else.

This sequence of tokens is consumed by the syntactic analyzer. The syntactic analyzer verifies that the order of the tokens is such that it represents a (syntactically) valid program. The semantic analyzer still has to verify that a (syntactically) valid sequence of tokens adheres to the rules of the language. The term parser is commonly used to encompass the three processes of lexical, syntactic and semantic analysis.

Lexical analysis

Back to tiny, the tokens of the language more or less correspond to the elements in black bold face in the rules of part 1, this is, all what appears in the right hand side of a rule and is not like 〈this〉. The following elements of the program will be analyzed by the lexer as tokens.

var
write
;
+

Some other tokens will not directly correspond to these elements, but a sequence of them. Identifiers, integer literals, string literals, can be understood as tokens on their own. The reason is that there is little value in keeping them split in their constituents.

123
123.456
“hello”
foo

For instance, the first three lines of the following tiny program (assume it is in a file sum.tiny)

1
2
3
4
5
6
7
var i : int;
var s : int;
s := 0;
for i := 1 to 10 do
  s := s + i;
end
write s;

will be tokenized like this.

VAR IDENTIFIER COLON INT SEMICOLON
VAR IDENTIFIER COLON INT SEMICOLON
IDENTIFIER ASSIG INTEGER_LITERAL SEMICOLON

For each token we will want a bit more of information. In particular, IDENTIFIER tokens and INTEGER_LITERAL, among others, should have at least their associated text. For diagnostic purposes, we will also want to keep track of the location of each token inside the input. So we will also associate a file, line and column (called a location or locus). The above sequence of tokens would actually be more like

id=VAR, file=sum.tiny, line=1, col=1 id=IDENTIFIER, file=sum.tiny, line=1, col=5, text=i id=COLON, file=sum.tiny, line=1, col=7 id=INT, file=sum.tiny, line=1, col=9 id=SEMICOLON, file=sum.tiny, line=1, col=12
id=VAR, file=sum.tiny, line=2, col=1 id=IDENTIFIER, file=sum.tiny, line=2, col=5, text=s id=COLON, file=sum.tiny, line=2, col=7 id=INT, file=sum.tiny, line=2, col=9 id=SEMICOLON, file=sum.tiny, line=2, col=12
id=IDENTIFIER, file=sum.tiny, line=3, col=1, text=s id=ASSIG, file=sum.tiny, line=3, col=3 id=INTEGER_LITERAL, file=sum.tiny, line=2, col=6, text=0 id=SEMICOLON, file=sum.tiny, line=3, col=7

Tokens

What is a token? It is conceptually a tuple of what a token can have: the kind of the token, its location and a text (if any).

struct Token
{
private:
  TokenId token_id;
  location_t locus;
  std::string *str;
...
};

Field token_id will store the kind of token, locus will keep the location (more on this below) and str will keep an associated text, if any, of the token.

Field token_id has type TokenId that is nothing but an enum of the kinds of tokens we have.

enum TokenId
{
  ...
};

The enum would contain all the token kinds as enumerators. We can write them manually but this quickly becomes tedious. Instead we will use X-Macros. This way we can describe our tokens in one place and the data structures will be updated automatically. We will use this technique several times in our front end to ease maintaining. Of course other code-generating approaches (like using small DSLs like GNU M4) can be used instead, this one is enough for most of our needs.

// TINY_TOKEN(name, description)
// TINY_TOKEN_KEYWORD(name, identifier)
//
// Keep TINY_TOKEN_KEYWORD sorted
 
#define TINY_TOKEN_LIST                                                        \
  TINY_TOKEN (FIRST_TOKEN, "<first-token-marker>")                             \
  TINY_TOKEN (END_OF_FILE, "end of file")                                      \
  TINY_TOKEN (ASSIG, ":=")                                                     \
  TINY_TOKEN (ASTERISK, "*")                                                   \
  TINY_TOKEN (COLON, ":")                                                      \
  TINY_TOKEN (DIFFERENT, "!=")                                                 \
  TINY_TOKEN (EQUAL, "=")                                                      \
  TINY_TOKEN (LEFT_PAREN, "(")                                                 \
  TINY_TOKEN (MINUS, "-")                                                      \
  TINY_TOKEN (PLUS, "+")                                                       \
  TINY_TOKEN (RIGHT_PAREN, ")")                                                \
  TINY_TOKEN (SEMICOLON, ";")                                                  \
  TINY_TOKEN (SLASH, "/")                                                      \
  TINY_TOKEN (PERCENT, "%")                                                    \
  TINY_TOKEN (GREATER, ">")                                                    \
  TINY_TOKEN (GREATER_OR_EQUAL, ">=")                                          \
  TINY_TOKEN (LOWER, "<")                                                      \
  TINY_TOKEN (LOWER_OR_EQUAL, "<=")                                            \
  TINY_TOKEN (IDENTIFIER, "identifier")                                        \
  TINY_TOKEN (INTEGER_LITERAL, "integer literal")                              \
  TINY_TOKEN (REAL_LITERAL, "real literal")                                    \
  TINY_TOKEN (STRING_LITERAL, "string literal")                                \
                                                                               \
  TINY_TOKEN_KEYWORD (AND, "and")                                              \
  TINY_TOKEN_KEYWORD (DO, "do")                                                \
  TINY_TOKEN_KEYWORD (ELSE, "else")                                            \
  TINY_TOKEN_KEYWORD (END, "end")                                              \
  TINY_TOKEN_KEYWORD (FLOAT, "float")                                          \
  TINY_TOKEN_KEYWORD (FOR, "for")                                              \
  TINY_TOKEN_KEYWORD (IF, "if")                                                \
  TINY_TOKEN_KEYWORD (INT, "int")                                              \
  TINY_TOKEN_KEYWORD (NOT, "not")                                              \
  TINY_TOKEN_KEYWORD (OR, "or")                                                \
  TINY_TOKEN_KEYWORD (READ, "read")                                            \
  TINY_TOKEN_KEYWORD (THEN, "then")                                            \
  TINY_TOKEN_KEYWORD (TO, "to")                                                \
  TINY_TOKEN_KEYWORD (VAR, "var")                                              \
  TINY_TOKEN_KEYWORD (WHILE, "while")                                          \
  TINY_TOKEN_KEYWORD (WRITE, "write")                                          \
                                                                               \
  TINY_TOKEN (LAST_TOKEN, "<last-token-marker>")
 
enum TokenId
{
#define TINY_TOKEN(name, _) name,
#define TINY_TOKEN_KEYWORD(x, y) TINY_TOKEN (x, y)
  TINY_TOKEN_LIST
#undef TINY_TOKEN_KEYWORD
#undef TINY_TOKEN
};

What we do is we define a macro TINY_TOKEN_LIST using undefined macros inside. Right before using the list we define these, in this case TINY_TOKEN and TINY_TOKEN_KEYWORD, to what we need. For this specific case, this will fill our enum TokenId with the first parameter of the macro, that we chose to use as the enumerator name. Note that we distinguish plain tokens from keyword tokens using TINY_TOKEN_KEYWORD instead of TINY_TOKEN. Later on we will see why.

The expansion of the above X-Macro will generate something like the following. This is what we wanted but we do not have to write it manually.

enum TokenId
{
  FIRST_TOKEN,
  END_OF_FILE,
  ASSIG,
  // ... etcetera ...
  WRITE,
  LAST_TOKEN
};

We will also use this technique to create a function that returns a descriptive text for a given token kind.

const char *
get_token_description (TokenId tid)
{
  switch (tid)
    {
#define TINY_TOKEN(name, descr)                                                \
  case name:                                                                   \
    return descr;
#define TINY_TOKEN_KEYWORD(x, y) TINY_TOKEN (x, y)
      TINY_TOKEN_LIST
#undef TINY_TOKEN_KEYWORD
#undef TINY_TOKEN
    default:
      gcc_unreachable ();
    }
}

And we will also create a debugging function that given a token_id will return its token kind as a string (i.e. given the token kind ASSIG this function will return the string "ASSIG").

const char *
token_id_to_str (TokenId tid)
{
  switch (tid)
    {
#define TINY_TOKEN(name, _)                                                    \
  case name:                                                                   \
    return #name;
#define TINY_TOKEN_KEYWORD(x, y) TINY_TOKEN (x, y)
      TINY_TOKEN_LIST
#undef TINY_TOKEN_KEYWORD
#undef TINY_TOKEN
    default:
      gcc_unreachable ();
    }
}

Since we do not want to create tokens directly using the constructors we will create them using factory functions. Most tokens will simply be created with a token id and a location but some of them have will have an associated text. The factories are make, make_identifier, make_integer, make_real and make_string.

struct Token
{
 private:
  // ...
 
  Token (TokenId token_id_, location_t locus_)
    : token_id (token_id_), locus (locus_), str (NULL)
  {
  }
  Token (TokenId token_id_, location_t locus_, const std::string& str_)
    : token_id (token_id_), locus (locus_), str (new std::string (str_))
  {
  }
 
  // No default initializer
  Token ();
  // Do not copy/assign tokens
  Token (const Token &);
  Token &operator=(const Token &);
 
public:
  static TokenPtr
  make (TokenId token_id, location_t locus)
  {
    return TokenPtr(new Token (token_id, locus));
  }
 
  static TokenPtr
  make_identifier (location_t locus, const std::string& str)
  {
    return TokenPtr(new Token (IDENTIFIER, locus, str));
  }
 
  static TokenPtr
  make_integer (location_t locus, const std::string& str)
  {
    return TokenPtr(new Token (INTEGER_LITERAL, locus, str));
  }
 
  static TokenPtr
  make_real (location_t locus, const std::string& str)
  {
    return TokenPtr(new Token (REAL_LITERAL, locus, str));
  }
 
  // ...
};

These factories return smart pointers to Token because the lifetime of a token is not obvious and we want to clean up them anyway when they are not used anymore.

#include <tr1/memory>
 
struct Token;
typedef std::tr1::shared_ptr<Token> TokenPtr;
typedef std::tr1::shared_ptr<const Token> const_TokenPtr;

Type const_TokenPtr will be used later in the lexer. We have to use C++03 TR1 because gcc is written in C++03 not C++11 (in C++11 we would use the standard memory header and std::shared_ptr template instead)

The complete implementation of the class Token is in files gcc-src/gcc/tiny/tiny-token.h and gcc-src/gcc/tiny/tiny-token.cc

Lexer operation

Conceptually a lexer returns the sequence of tokens of our input. Building such sequence as a whole would force us to temporarily store it in memory. This would work but it is not particularly efficient because input files may easily have thousands of tokens and we are going to analyze them more or less sequentially. A lexer that returns a stream of tokens will have lower memory requirements.

This means that the lexer will always return a single token. It will return the current one unless we tell the lexer to advance to the next token (going backwards is out of question). This suggests that our lexer must support a get operation that returns the current token and a skip operation that advances the current token to the next one (skip does not return anything). Sometimes we can even mix the two operations in a single get operation that, at the same time, returns the current token and skips it.

This stream-like approach saves memory but now we have made our life a bit harder. Sometimes (this will be more evident during syntactic analysis) we may need to peek a few tokens ahead. With the get/skip interface above it will be responsibility of the user of the lexer to keep track of tokens peeked ahead. While this is doable it may be a bit unwieldy. So our lexer should support peeking n tokens ahead. If n is zero this is the same as the current token, so we can make our lexer to have two operations peek(n) and skip. Note that there is no real need of skip(n) since this can easily be achieved by calling n times skip (although it may be handy having it).

For theoretical reasons out of the scope of this post, our front end should not do unbounded peeks. This means that, ideally, all peek(n) operations should receive an n value known at compile time. For the sake of simplicity, though, our implementation will allow unbounded peeks.

The two operations that the lexer can do (peek and skip) remind us of a queue. When we peek a token, if it is not in the queue it will be added to the back of the queue, otherwise the token in the queue will be used. Skip simply removes an item from the front of the queue. Recall that peek(0) is the current token and skip will advance the token stream. This means that the token returned by peek(n+1) before skip, will be returned by peek(n) after the skip.

lexer-input

Lexer interface

Our lexer interface will look like this.

struct Lexer
{
public:
  Lexer (const char *filename, FILE *input);
  ~Lexer ();
 
  const_TokenPtr peek_token () { return peek_token(0); }
  const_TokenPtr peek_token (int);
 
  void skip_token () { return skip_token(0); }
  void skip_token (int);
private:
 ...
};

Basically we will pass to the lexer the input file (and the filename, for location tracking purposes) and then we have the two operations described above now renamed as peek_token and skip_token.

Where do we get the tokens from? Well, before we can form a token we have to somehow read the input file. We can view the input file similar to a stream of tokens but this time it will be a stream of characters. We will group them into tokens. This suggests that the idea of peek(n) and skip can be applied to the FILE*. So it seems a good idea to abstract this away in a template class buffered_queue.

template <typename T, typename Source> struct buffered_queue
{

The template parameter T will be the type of items stored in our queue. For the input file it will be char (although we will use int for a reason explained below). For the tokens themselves it will be TokenPtr. Source is a class type that implements the function call operator. This operator must return a T value. It will be invoked during peek(n) n refers to an element that is not yet in the queue.

The function call operator for the input file will basically invoke fgetc. This function returns an unsigned char casted into a int because of the EOF marker used to mark end of file. This is the reason why our buffered_queue for the input file will store int rather than char. For the stream of tokens the function call operator will just build the next token using the input, we will see later how we build the token.

struct Lexer
{
  // ...
 private: 
  struct InputSource
  {
    FILE *input;
    InputSource (FILE *input_) : input (input_) {}
    int operator() () { return fgetc (input); }
  };
  InputSource input_source;
  buffered_queue<int, InputSource> input_queue;
 
  struct TokenSource
  {
    Lexer *lexer;
    TokenSource (Lexer *lexer_) : lexer (lexer_) {}
    TokenPtr operator() () { return lexer->build_token (); }
  };
 
  TokenSource token_source;
  buffered_queue<std::tr1::shared_ptr<Token>, TokenSource> token_queue;
};

Data members input_queue and token_queue implement respectively the buffered queues of the input file and the stream of tokens.

The implementation of buffered_queue is a bit long to paste it here. It is in gcc-src/gcc/tiny/tiny-buffered-queue.h. It is implemented using a std::vector and two position markers: start and end. When start == end it means that the queue is empty. Member function skip just advances start, calling peek if we are actually skiping more than what was peeked so far. If it becomes the same as end, it just moves both to the beginning of the vector, this way the vector does not have to grow indefinitely. Member function peek checks if the requested n is already in the queue, if it is, it just returns it. If it is not it will call Source::operator(), but before that it checks if there is enough room in the vector. If there is not, then a larger vector is allocated, data is copied and the new vector is now used as the buffer. There is room for improvement for this class. For instance we may try to compact the vector before reallocating because the space may already be there just we are too near the end of the vector, etc. But I think it is good enough for us now.

Now we can see the implementation of peek_token and skip_token.

const_TokenPtr
Lexer::peek_token (int n)
{
  return token_queue.peek (n);
}
 
void
Lexer::skip_token (int n)
{
  token_queue.skip (n);
}

This way the token will be formed during the call to token_queue.peek . It will invoke SourceToken::operator() which ends calling Lexer::build_token. So now it is time to see how a token is formed.

TokenPtr
Lexer::build_token ()
{
  for (;;)
    {
      location_t loc = get_current_location ();
      int current_char = peek_input ();
      skip_input ();
 
      // ... rest of the code ...
    }
}

Before we discuss the main loop, note that the body of the loop calls peek_input and skip_input. These two functions use the input_queue.

int
Lexer::peek_input (int n)
{
  return input_queue.peek (n);
}
 
void
Lexer::skip_input (int n)
{
  input_queue.skip (n);
}

Similar to what happened with token_queue, input_queue will invoke InputSource::operator() which simply calls fgetc, effectively returning us the current character of the file (until we skip it, of course).

So, why is there a loop in Lexer::build_token? Because we may have to advance several characters of the input before we can form a token. When a token is formed, we will simply return from the function. While we cannot form it (for instance when we encounter whitespace or newlines) we will just keep requesting characters. Of course there must be a way of finishing the loop: when we find the end of file we will just build the END_OF_FILE token and stop processing the input.

107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
TokenPtr
Lexer::build_token ()
{
  for (;;)
    {
      location_t loc = get_current_location ();
      int current_char = peek_input ();
      skip_input ();
 
      if (current_char == EOF)
	{
	  return Token::make (END_OF_FILE, loc);
	}
      // ...
}

If the character read from the input is not the END_OF_FILE we can start tokenizing it. We have to ignore whitespace by not forming a token but we still have to update location information.

121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
      switch (current_char)
	{
	// **************
	// * Whitespace *
	// **************
	case '\n':
	  current_line++;
	  current_column = 1;
	  linemap_line_start (::line_table, current_line, max_column_hint);
	  continue;
	case ' ':
	  current_column++;
	  continue;
	case '\t':
	  // Width of a tab is not well defined, let's assume 8 for now
	  current_column += 8;
	  continue;

As you can see we have two data members current_line and current_column that we have to update for proper location tracking. Let’s ignore for now the line 129, more on this later.

Now we can start matching punctuation. Some tokens are straightforward.

171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
	case '=':
	  current_column++;
	  return Token::make (EQUAL, loc);
	case '(':
	  current_column++;
	  return Token::make (LEFT_PAREN, loc);
	case '-':
	  current_column++;
	  return Token::make (MINUS, loc);
	case '+':
	  current_column++;
	  return Token::make (PLUS, loc);
	case ')':
	  current_column++;
	  return Token::make (RIGHT_PAREN, loc);
	case ';':
	  current_column++;
	  return Token::make (SEMICOLON, loc);

Some others may require a bit of peeking, but that’s all.

189
190
191
192
193
194
195
196
197
198
199
200
201
202
	case '<':
	  if (peek_input () == '=')
	    {
	      skip_input ();
	      current_column += 2;
 
	      return Token::make (LOWER_OR_EQUAL, loc);
	    }
	  else
	    {
	      current_column++;
	      return Token::make (LOWER, loc);
	    }
	  break;

If you wonder how comments are implemented: the lexer just skips over it.

223
224
225
226
227
228
229
230
231
232
233
	case '#': /* comment */
	  current_column++;
	  current_char = peek_input ();
	  while (current_char != '\n')
	    {
	      skip_input ();
	      current_column++; // won't be used
	      current_char = peek_input ();
	    }
	  continue;
	  break;

If we reach the end of the loop and the character has not been handled, it means that the character is invalid. Function error_at is a diagnostic utility from GCC that we will see again during syntactic analysis, so let’s ignore it from now.

335
336
337
      // Martians
      error_at (loc, "unexpected character '%x'", current_char);
      current_column++;

And so on. See the full listing here. It may seem tedious but it only has to be written once and after that it is relatively easy to extend.

Identifiers and keywords

Identifiers and keywords are interesting because they share some lexical form. As we specified in part 1, if an identifier may be a keyword it is tokenized as a keyword. This means that whil is an IDENTIFIER but while is the token WHILE. So what we do is we just
gather the text of the token and check if it is an keyword. If it is, we form the corresponding keyword token, otherwise we form an identifier token.

237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
      // ***************************
      // * Identifiers or keywords *
      // ***************************
      if (ISALPHA (current_char) || current_char == '_')
	{
	  std::string str;
	  str.reserve (16); // some sensible default
	  str += current_char;
 
	  int length = 1;
	  current_char = peek_input ();
	  while (ISALPHA (current_char) || ISDIGIT (current_char)
		 || current_char == '_')
	    {
	      length++;
 
	      str += current_char;
	      skip_input ();
	      current_char = peek_input ();
	    }
 
	  current_column += length;
 
	  TokenId keyword = classify_keyword (str);
	  if (keyword == IDENTIFIER)
	    {
	      return Token::make_identifier (loc, str);
	    }
	  else
	    {
	      return Token::make (keyword, loc);
	    }
	}

Macros ISALPHA and ISDIGIT are provided by the gcc header safe-ctype.h and check if a character belongs to the set of alphanumeric letters or decimal digits, respectively. The function classify_keyword is implemented by doing a binary search in a sorted array of keywords. This sorted array is defined using X-Macros, here we use only TINY_TOKEN_KEYWORD and we ignore the remaining tokens.

64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
namespace
{
 
const std::string keyword_index[] = {
#define TINY_TOKEN(x, y)
#define TINY_TOKEN_KEYWORD(name, keyword) keyword,
  TINY_TOKEN_LIST
#undef TINY_TOKEN_KEYWORD
#undef TINY_TOKEN
};
 
TokenId keyword_keys[] = {
#define TINY_TOKEN(x, y)
#define TINY_TOKEN_KEYWORD(name, keyword) name,
  TINY_TOKEN_LIST
#undef TINY_TOKEN_KEYWORD
#undef TINY_TOKEN
};
 
const int num_keywords = sizeof (keyword_index) / sizeof (*keyword_index);
}

What we do here is declaring two parallel arrays. The first one, keyword_index, is just an array of std::string with one element per keyword. Since TINY_TOKEN_KEYWORDs are sorted by keyword inside TINY_TOKEN_LIST this array will be sorted too. The second array, keyword_keys, contains the token ids for the corresponding tokens in keyword_index. Then our function classify_keyword just looks up a string in keyword_index. If it finds it uses the position of the keyword to index keyword_keys.

86
87
88
89
90
91
92
93
94
95
96
97
98
TokenId
Lexer::classify_keyword (const std::string &str)
{
  const std::string *last = keyword_index + num_keywords;
  const std::string *idx = std::lower_bound (keyword_index, last, str);
 
  if (idx == last || str != *idx)
    return IDENTIFIER;
  else
    {
      return keyword_keys[idx - keyword_index];
    }
}

Tracking location

Tracking location can be implemented manually but GCC has good support in this area so it would be a pity not to use it.

GCC has a global variable called line_table responsible for tracking locations. We have to tell line_table when we enter a file and when we leave it (this is useful for include-like constructions since line_table will keep track of this). We do this in the constructor of Lexer.

Lexer::Lexer (const char *filename, FILE *input_)
  : input (input_), current_line (1), current_column (1), line_map (0),
    input_source (input), input_queue (input_source), token_source (this),
    token_queue (token_source)
{
  line_map = ::linemap_add (::line_table, ::LC_ENTER,
			    /* sysp */ 0, filename,
			    /* current_line */ 1);
}

Function linemap_add informs line_table that we are entering a file (LC_ENTER) and that we are in line 1. When forming a token, we have to get the location. We did this calling Lexer::get_current_location (see above). It simply requests a new location_t for the current column.

location_t
Lexer::get_current_location ()
{
  return ::linemap_position_for_column (::line_table, current_column);
}

When a newline is encountered, we have to tell line_table that a new line starts. We already did this in build_token.

123
124
125
126
127
128
129
130
	// **************
	// * Whitespace *
	// **************
	case '\n':
	  current_line++;
	  current_column = 1;
	  linemap_line_start (::line_table, current_line, max_column_hint);
	  continue;

Current layout

Our gcc-src/gcc/tiny now looks like this

gcc-src/gcc/tiny
├── config-lang.in
├── lang-specs.h
├── Make-lang.in
├── tiny1.cc
├── tiny-buffered-queue.h
├── tiny-lexer.cc
├── tiny-lexer.h
├── tinyspec.cc
├── tiny-token.cc
└── tiny-token.h

Trying our lexer

Ok, now we have a lexer, let’s try it. First let’s build the new files. Let’s update gcc-src/gcc/tiny/Make-lang.in.

tiny_OBJS = \
    tiny/tiny1.o \
    tiny/tiny-token.o \
    tiny/tiny-lexer.o \
    $(END)

Now let’s change tiny_parse_file in gcc-src/gcc/tiny/tiny1.cc.

static void
tiny_parse_file (const char *filename)
{
  FILE *file = fopen (filename, "r");
  if (file == NULL)
    {
      fatal_error (UNKNOWN_LOCATION, "cannot open filename %s: %m", filename);
    }
 
  // Here we would parse our file
  Tiny::Lexer lex (filename, file);
 
  Tiny::const_TokenPtr tok = lex.peek_token ();
  for (;;)
    {
      bool has_text = tok->get_id () == Tiny::IDENTIFIER
		      || tok->get_id () == Tiny::INTEGER_LITERAL
		      || tok->get_id () == Tiny::REAL_LITERAL
		      || tok->get_id () == Tiny::STRING_LITERAL;
 
      location_t loc = tok->get_locus ();
 
      fprintf (stderr, "<id=%s%s, %s, line=%d, col=%d>\n", tok->token_id_to_str (),
	       has_text ? (std::string(", text=") + tok->get_str ()).c_str () : "",
	       LOCATION_FILE (loc), LOCATION_LINE (loc), LOCATION_COLUMN (loc));
 
      if (tok->get_id() == Tiny::END_OF_FILE)
          break;
 
      lex.skip_token ();
      tok = lex.peek_token ();
    }
 
  fclose (file);
}

After the customary make and make install we can test with the example of sum.tiny.

$ ./gcc-install/bin/gcctiny -c sum.tiny
<id=VAR, sum.tiny, line=1, col=1>
<id=IDENTIFIER, text=i, sum.tiny, line=1, col=5>
<id=COLON, sum.tiny, line=1, col=7>
<id=INT, sum.tiny, line=1, col=9>
<id=SEMICOLON, sum.tiny, line=1, col=12>
<id=VAR, sum.tiny, line=2, col=1>
<id=IDENTIFIER, text=s, sum.tiny, line=2, col=5>
<id=COLON, sum.tiny, line=2, col=7>
<id=INT, sum.tiny, line=2, col=9>
<id=SEMICOLON, sum.tiny, line=2, col=12>
<id=IDENTIFIER, text=s, sum.tiny, line=3, col=1>
<id=ASSIG, sum.tiny, line=3, col=3>
<id=INTEGER_LITERAL, text=0, sum.tiny, line=3, col=6>
<id=SEMICOLON, sum.tiny, line=3, col=7>
<id=FOR, sum.tiny, line=4, col=1>
<id=IDENTIFIER, text=i, sum.tiny, line=4, col=5>
<id=ASSIG, sum.tiny, line=4, col=7>
<id=INTEGER_LITERAL, text=1, sum.tiny, line=4, col=10>
<id=TO, sum.tiny, line=4, col=12>
<id=INTEGER_LITERAL, text=10, sum.tiny, line=4, col=15>
<id=DO, sum.tiny, line=4, col=18>
<id=IDENTIFIER, text=s, sum.tiny, line=5, col=3>
<id=ASSIG, sum.tiny, line=5, col=5>
<id=IDENTIFIER, text=s, sum.tiny, line=5, col=8>
<id=PLUS, sum.tiny, line=5, col=10>
<id=IDENTIFIER, text=i, sum.tiny, line=5, col=12>
<id=SEMICOLON, sum.tiny, line=5, col=13>
<id=END, sum.tiny, line=6, col=1>
<id=WRITE, sum.tiny, line=7, col=1>
<id=IDENTIFIER, text=s, sum.tiny, line=7, col=7>
<id=SEMICOLON, sum.tiny, line=7, col=8>
<id=END_OF_FILE, sum.tiny, line=8, col=1>

Yay!

Wrap-up

Now that we are able to create the sequence of tokens of our input program we still have to verify that the sequence forms a syntactically valid program. But this will be in the next chapter. That’s all for today.

Share on FacebookShare on Google+Tweet about this on TwitterShare on LinkedIn

,

10 thoughts on “A tiny GCC front end – Part 3

  • Eugene Wissner says:

    I had again a weired problem with a gcc7 snapshot (20160918) 🙂
    I got:


    In file included from ../../gcc-7-20160918/gcc/tiny/tiny-lexer.h:6:0,
    from ../../gcc-7-20160918/gcc/tiny/tiny-lexer.cc:1:
    ../../gcc-7-20160918/gcc/input.h:122:33: Error: expected template-name before »<« token
    struct location_hash : int_hash { };
    ^
    ../../gcc-7-20160918/gcc/input.h:122:33: Error: expected »{« before »<« token
    ../../gcc-7-20160918/gcc/input.h:122:33: Error: expected unqualified-id before »<« token
    ../../gcc-7-20160918/gcc/input.h:143:11: Error: »gt_pointer_operator« was not declared
    gt_pointer_operator op,
    ^~~~~~~~~~~~~~~~~~~
    ../../gcc-7-20160918/gcc/input.h:146:3: Error: »hash_map« does not name a type; did you mean »hash_map_h«?
    hash_map *m_table;
    ^~~~~~~~
    hash_map_h

    Just if someone experiences the same behavior, I had to #include “coretypes.h” before input.h in tiny-lexer.h and tiny-token.h

  • asmwarrior says:

    Quote: “The function call operator for the input file will basically invoke fgetc. This function returns an unsigned char casted into a int because of the EOF marker used to mark end of file. ”
    My question is: do you mean the EOF marker is not a single byte? So, we need 2 or 4 bytes to store the EOF marker?

    • Roger Ferrer Ibáñez says:

      Hi,

      the problem here is that fgetc returns a byte. But fgetc not only returns a byte, it also needs to tell us if it has reached the end of the file. Unfortunately it is not possible to encode such information in a single byte, you’d need that byte (in case there is real information from the file) and one (extra) bit to encode whether you reached the end of the file.

      So the solution taken in fgetc is returning an int. This is clearly overkill as an int can represent so many more values than a single byte. We only need all bytes + 1 more value for EOF and instead we get a whole int (of 32-bit usually, but in some environments it could be smaller). I guess that this approach is justified in practical terms: a more economical (in terms of memory usage) interface would necessarily be more convoluted.

      This does not mean either that the EOF must be represented inside the file. In fact the EOF marker is usually a logical element in the file. The file ends, your function fgetc must return “something”. This something is represented using a special value called EOF, which should not be confused as any other byte of the file! Of course this is all very general, it could be that physically is represented in the file, but again this would pose a problem if the underlying model of your file is “a sequence of bytes”, you’d again need that extra bit to really know you’ve reached the end.

      Hope this makes it clear.

      Regards.

  • asmwarrior says:

    Hi, Roger Ferrer, for a lexer, why do you think a shared_ptr is a good idea better than some pointer based methods. I have a plan to implement a parser which need a preprocessor and a lexer, so I would see some suggestions from you, in-fact, I have a question in Stackoverflow several days ago, but no one response here, see here: [c++ – How to pass token kind with its associated information from lexer to preprocessor, then to parser – Stack Overflow](http://stackoverflow.com/questions/42041579/how-to-pass-token-kind-with-its-associated-information-from-lexer-to-preprocesso?noredirect=1#comment71259076_42041579), thanks.

    • Roger Ferrer Ibáñez says:

      Hi,

      this is a very good question and unfortunately there is no good answer for it. As usual it depends on your constraints. I will try to give a bit more of context to the examples you mention in your StackOverflow question.

      First we need to model the abstract process of the preprocessor/lexer and parser. A way to model this is thinking that the lexer and the parser are coroutines. This means that when the parser needs a token, it does a yield to the lexer, the lexer synthesizes the token and yields to the parser and then the parser consumes it, starting the process again. This all very nice but rarely programming languages support coroutines so we need something equivalent that can be implemented. We can imagine two ways to do this, the lexer sends tokens to the parser (called a push parser) or the parser ask them to the lexer (called a pull parser). Most parsers today are pull parsers, so we need a way to get a token from the parser. I understand your question is how to do this.

      UNIX lex, and its derivative Flex, uses a very simple mechanism through global variables. This was acceptable in 1975, when memory was scarce and computers were slow, and also when good practices in programming where not fully defined yet. Basically using a global variable means that the parser, in the case of Flex, calls yylex() and your token will be in yyval. This is not very handy if you plan to use this in a library where more than one parser may be used at the same time.

      The approach used by clang is that you pass a local token by reference and then the lexer updates it. This solution is efficient but puts the burden of managing the token as a resource to the user of the lexer. The user of the Lexer in clang is the Preprocessor class which handles the tokens. This means that if you plan any kind of lookahead, you will need to make sure the tokens are around in the parser (or the user of the lexer). For example, clang Preprocessor class needs to cache them (see http://clang.llvm.org/doxygen/Preprocessor_8h_source.html#l01167)

      The approach used by GCC cpp by returning a pointer is also efficient, but this time the burden of preserving the token as a resource is up to the lexer. This impacts the interface with the parser as now it has to make sure the parser is using tokens that are still valid. You may want to read https://gcc.gnu.org/onlinedocs/cppinternals/Lexer.html where it explains that earlier tokens are valid only if they are in the same line.

      Finally the approach used in GCC Tiny is a bit of the both worlds. In our case the lexer has a queue for the tokens that have already been lexed (because of lookaheads). And the resource itself ends being shared between the lexer and the parser using a C++ shared_ptr. This means that when the token is not used any longer by the lexer (it is not in the lexer queue) and the parser (the code of the parser that uses the token does not need it anymore) it will be freed. This is convenient but it may not be the most efficient way to do this, as shared_ptr have some performance impact, due to their internal reference counter, compared to raw pointers.

      If you were concerned about performance (e.g. for production), either a solution like that of GCC or clang would be the best. Every solution has a downside and it is up to you to decide on which one suits the best for you or the user.

      Hope this helps.

      Kind regards,
      Roger

  • asmwarrior says:

    Hi, Roger, thanks for the time for helping me. In-fact, I’m one of the maintainer of the free and opensource CodeBlocks IDE http://www.codeblocks.org/ and we have a c/c++ parser to catch all the tags need for the IDE(such as functions, classes and others), and the Parser can give some code completion to some degree. But now, with the complex of the C++ grammar(especially the template grammar), our build-in parser(which you can think like the ctags parser) always failed handling this. One way is to use a third part library such as clang, there is already one plugin which support code completion by the clang. The other way is to improve the build-in parser. Since I maintained the build-in parser for so many years, I still want to improve its capability. Ahha.

    I have looked at the way used in Clang, in it’s Processor level, it have a member variable called “CachedTokens”, see here: https://code.woboq.org/llvm/clang/include/clang/Lex/Preprocessor.h.html#clang::Preprocessor::CachedTokens. I just see that this variable is a vector, and will grow bigger and bigger if the parsing parsing a file. By reading the source code of Clang, I see the from the lexer to preprocessor, the preprocessor just maitaining a cached vector to store all the tokens. From the processor to Parser level, it just return a token reference to the cached token, see this function. https://code.woboq.org/llvm/clang/include/clang/Lex/Preprocessor.h.html#1171

    const Token &LookAhead(unsigned N) {
    if (CachedLexPos + N preprocessor->parser. Currently, only a string value is passed from each layers, I mean lexer just return a string(wxString, since CodeBlocks is based on wxWidgets library), and preprocessor also return a string when called by the Parser class. So, the bad thing is in the Parser, I have still need to do a lot of string compare, such as

    wxstring peek_string = proprocessor.peek_token()
    if (peek_string== “class”)
    handle_class();

    So, I plain to change the string to some kinds of Token class, so that I can directly compare the TokenType in the Parser level.

    BTW: I see your project mentioned here How (not) to write a C++ front end – Part 1 | Think In Geek – http://thinkingeek.com/2016/10/08/not-write-c-parser-part-1/ use some kinds of bison generated GLR parser to parser the source file, I’m not sure if it could used for the IDE to show the file structure or codecompletion. For now, I only see Clang have such feature, but clang’s speed maybe another concern. Our build-in parser in CodeBlocks is recursive decent and handwritten, but it is hard to maintain and add new features due to the so much complex C++ grammar.

    Thanks.

    • Roger Ferrer Ibáñez says:

      Hi,

      I’m not expert in clang but reading a bit the code suggests that CachedToken is used when the lexer is in cached mode. Apparently the lexer can be in several modes (look Preprocessor::Lex(Token&)). At some point, as suggested by the code PPCaching.cpp, the cache is emptied once all the tokens have been consumed. So I don’t think it grows indefinitely.

      It makes sense to me use something a bit more abstract than just a string, so passing a Token looks good to me. You have to decide who is going to own the token: the lexer or the client or if it will behave like a value class.

      Regarding the usage of Mercurium, it was never designed with your scenario in mind: its resource management is poor and the parsing process itself is too slow for an editor/IDE.

      If I were you I’d abandon the idea of providing a custom C++ parser and I’d use the clang one instead but I can understand that doing so adds an extra dependency with LLVM.

      Kind regards,
      Roger

      • asmwarrior says:

        Thanks for the quick response. Indeed by reading the code in your mentioned cpp file, especially the function:

        void Preprocessor::CachingLex(Token &Result) {
        if (!InCachingLexMode())
        return;
        if (CachedLexPos < CachedTokens.size()) {
        Result = CachedTokens[CachedLexPos++];
        return;
        }
        ExitCachingLexMode();
        Lex(Result);
        if (isBacktrackEnabled()) {
        // Cache the lexed token.
        EnterCachingLexMode();
        CachedTokens.push_back(Result);
        ++CachedLexPos;
        return;
        }
        if (CachedLexPos < CachedTokens.size()) {
        EnterCachingLexMode();
        } else {
        // All cached tokens were consumed.
        CachedTokens.clear();
        CachedLexPos = 0;
        }
        }

        I see the if the CachedLexPos reaches the end of the CachedTokens, then the whole tokens were cleared. Oh, this is the exact method used in your tiny GCC front end.
        Also, I notice that the preprocessor has some modes, in one mode, the lexer just read a token from the source file, in other mode, it read a token from the cached tokens(the cached tokens maybe come from expanding a macro usage or some peek token function already called before).

        QUOTE:If I were you I’d abandon the idea of providing a custom C++ parser and I’d use the clang one instead but I can understand that doing so adds an extra dependency with LLVM.

        Indeed, some IDE in the world such as Kdevelop or QTcreator are planed to abandon their own build-in parser and move to use libclang, but there are still a lot of IDE or software using their own parser. For example, the "Cppcheck" has its own parser.

Leave a Reply

Your email address will not be published. Required fields are marked *