A tiny GCC front end – Part 3
Now that the minimal infrastructure is already set, we can start with the implementation of our tiny front end. Today we will talk about the lexer.
Typical compiler architecture
Since compilation is a lowering process, it makes sense to split the whole process in several steps. Each of these steps usually performs a single process. This is done because compilers quickly become untameable beasts. By splitting the compiler in several steps, it is easier to assign each step a specific and narrow task. This helps reasoning about the code and thus can minimize introducing bugs (or helps fixing them) and eases extending the compiler (this is actually commonplace software engineering good practice, but obviously applies to compilers as well).
Almost every compilation infrastructure like GCC, LLVM, Open64, ... is split in two big parts. The front end (what we are implementing) and the back end. The front end is responsible of recognizing the source language and diagnosing syntactic and semantic errors. The back end is responsible for generating code for the specific target (i.e. the environment and architecture). Nowadays compilers are in general optimizing compilers (a misnomer since they should be called enhancing compilers) this means thay they try to generate code that makes the most of the properties of the program (machine-independent optimizations) and the target where the program is going to run (machine-dependent optimizations). While machine-independent optimizations can be done in the front end this would make them language-specific. This is undesirable, so most compilation infrastructures have an extra part between the front end and the back end that sometimes is called the middle end.
Since the process sequentially flows from the front end to the middle end and from that to the back end, there must be a way for each two to communicate. Compilers use what is commonly called intermediate representations. An intermediate representation is a more or less abstract representation of the program. At the early stages of the compilation process, the representation is very abstract and high level, pretty close to the programming language. As the compilation process goes on, the intermediate representation becomes more and more low level, until it is almost assembler or machine code. It is not unusual that compilers have several intermediate representations. Sometimes it is the same representation simply restricted to a lower level subset. Sometimes is an entirely different representation. It may even happen that some optimizations require additional intermediate representations that are built only for them.
In GCC, a front end can use whatever representation it wants but at the end, before handing out the code to the rest of the compilation process, it has to be lowered into something called GENERIC. GENERIC is a tree-like representation and it is used in the front end and some early steps in the middle end of GCC. As a front end we only have to care to be able to deliver sensible GENERIC to the rest of the compiler.
But before we can create any GENERIC we first have to deal with the source code.
Open the file
Before we do anything let's modify our parse hook in gcc-src/gcc/tiny/tiny1.cc
so we open the input file and so we can later analyze it.
This still does nothing useful but at least defines a placeholder where the analysis starts. Note the usage of globals num_in_fnames
and in_fnames
in tiny_langhook_parse_file
that contain the number of files we are going to compile. Although we do not mind the number of parameters in tiny_parse_files
our examples will have just a single file. In function tiny_parse_file
now we just open the input file but the analysis will happen there.
Typical front end design
Nowadays front ends are commonly designed with three components: the lexical analyzer, the syntactic analyzer and the semantic analyzer. This is consistent with the strategy of splitting the compiler in smaller parts so they are manageable.
The lexical analyzer takes the input and splits into the constitutive parts of the language (i.e. the lexical elements) called tokens. The lexical analyzer is commonly called the lexer, the scanner or the tokenizer. Conceptually constructs a sequence of tokens of the program and nothing else.
This sequence of tokens is consumed by the syntactic analyzer. The syntactic analyzer verifies that the order of the tokens is such that it represents a (syntactically) valid program. The semantic analyzer still has to verify that a (syntactically) valid sequence of tokens adheres to the rules of the language. The term parser is commonly used to encompass the three processes of lexical, syntactic and semantic analysis.
Lexical analysis
Back to tiny, the tokens of the language more or less correspond to the elements in black bold face
in the rules of part 1, this is, all what appears in the right hand side of a rule and is not like 〈this〉. The following elements of the program will be analyzed by the lexer as tokens.
var
write
;
+
Some other tokens will not directly correspond to these elements, but a sequence of them. Identifiers, integer literals, string literals, can be understood as tokens on their own. The reason is that there is little value in keeping them split in their constituents.
123 123.456 "hello" foo
For instance, the first three lines of the following tiny program (assume it is in a file sum.tiny
)
1
2
3
4
5
6
7
var i : int;
var s : int;
s := 0;
for i := 1 to 10 do
s := s + i;
end
write s;
will be tokenized like this.
VAR
IDENTIFIER
COLON
INT
SEMICOLON
VAR
IDENTIFIER
COLON
INT
SEMICOLON
IDENTIFIER
ASSIG
INTEGER_LITERAL
SEMICOLON
For each token we will want a bit more of information. In particular, IDENTIFIER
tokens and INTEGER_LITERAL
, among others, should have at least their associated text. For diagnostic purposes, we will also want to keep track of the location of each token inside the input. So we will also associate a file, line and column (called a location or locus). The above sequence of tokens would actually be more like
id=VAR
, file=sum.tiny, line=1, col=1 id=IDENTIFIER
, file=sum.tiny, line=1, col=5, text=i id=COLON
, file=sum.tiny, line=1, col=7 id=INT
, file=sum.tiny, line=1, col=9 id=SEMICOLON
, file=sum.tiny, line=1, col=12
id=VAR
, file=sum.tiny, line=2, col=1 id=IDENTIFIER
, file=sum.tiny, line=2, col=5, text=s id=COLON
, file=sum.tiny, line=2, col=7 id=INT
, file=sum.tiny, line=2, col=9 id=SEMICOLON
, file=sum.tiny, line=2, col=12
id=IDENTIFIER
, file=sum.tiny, line=3, col=1, text=s id=ASSIG
, file=sum.tiny, line=3, col=3 id=INTEGER_LITERAL
, file=sum.tiny, line=2, col=6, text=0 id=SEMICOLON
, file=sum.tiny, line=3, col=7
Tokens
What is a token? It is conceptually a tuple of what a token can have: the kind of the token, its location and a text (if any).
Field token_id
will store the kind of token, locus
will keep the location (more on this below) and str
will keep an associated text, if any, of the token.
Field token_id
has type TokenId
that is nothing but an enum of the kinds of tokens we have.
The enum would contain all the token kinds as enumerators. We can write them manually but this quickly becomes tedious. Instead we will use X-Macros. This way we can describe our tokens in one place and the data structures will be updated automatically. We will use this technique several times in our front end to ease maintaining. Of course other code-generating approaches (like using small DSLs like GNU M4) can be used instead, this one is enough for most of our needs.
What we do is we define a macro TINY_TOKEN_LIST
using undefined macros inside. Right before using the list we define these, in this case TINY_TOKEN
and TINY_TOKEN_KEYWORD
, to what we need. For this specific case, this will fill our enum TokenId
with the first parameter of the macro, that we chose to use as the enumerator name. Note that we distinguish plain tokens from keyword tokens using TINY_TOKEN_KEYWORD
instead of TINY_TOKEN
. Later on we will see why.
The expansion of the above X-Macro will generate something like the following. This is what we wanted but we do not have to write it manually.
We will also use this technique to create a function that returns a descriptive text for a given token kind.
And we will also create a debugging function that given a token_id will return its token kind as a string (i.e. given the token kind ASSIG
this function will return the string "ASSIG"
).
Since we do not want to create tokens directly using the constructors we will create them using factory functions. Most tokens will simply be created with a token id and a location but some of them have will have an associated text. The factories are make
, make_identifier
, make_integer
, make_real
and make_string
.
These factories return smart pointers to Token
because the lifetime of a token is not obvious and we want to clean up them anyway when they are not used anymore.
Type const_TokenPtr
will be used later in the lexer. We have to use C++03 TR1 because gcc is written in C++03 not C++11 (in C++11 we would use the standard memory
header and std::shared_ptr
template instead)
The complete implementation of the class Token is in files gcc-src/gcc/tiny/tiny-token.h and gcc-src/gcc/tiny/tiny-token.cc
Lexer operation
Conceptually a lexer returns the sequence of tokens of our input. Building such sequence as a whole would force us to temporarily store it in memory. This would work but it is not particularly efficient because input files may easily have thousands of tokens and we are going to analyze them more or less sequentially. A lexer that returns a stream of tokens will have lower memory requirements.
This means that the lexer will always return a single token. It will return the current one unless we tell the lexer to advance to the next token (going backwards is out of question). This suggests that our lexer must support a get operation that returns the current token and a skip operation that advances the current token to the next one (skip does not return anything). Sometimes we can even mix the two operations in a single get operation that, at the same time, returns the current token and skips it.
This stream-like approach saves memory but now we have made our life a bit harder. Sometimes (this will be more evident during syntactic analysis) we may need to peek a few tokens ahead. With the get/skip interface above it will be responsibility of the user of the lexer to keep track of tokens peeked ahead. While this is doable it may be a bit unwieldy. So our lexer should support peeking n tokens ahead. If n is zero this is the same as the current token, so we can make our lexer to have two operations peek(n) and skip. Note that there is no real need of skip(n) since this can easily be achieved by calling n times skip (although it may be handy having it).
For theoretical reasons out of the scope of this post, our front end should not do unbounded peeks. This means that, ideally, all peek(n) operations should receive an n value known at compile time. For the sake of simplicity, though, our implementation will allow unbounded peeks.
The two operations that the lexer can do (peek and skip) remind us of a queue. When we peek a token, if it is not in the queue it will be added to the back of the queue, otherwise the token in the queue will be used. Skip simply removes an item from the front of the queue. Recall that peek(0) is the current token and skip will advance the token stream. This means that the token returned by peek(n+1) before skip, will be returned by peek(n) after the skip.
Lexer interface
Our lexer interface will look like this.
Basically we will pass to the lexer the input file (and the filename, for location tracking purposes) and then we have the two operations described above now renamed as peek_token
and skip_token
.
Where do we get the tokens from? Well, before we can form a token we have to somehow read the input file. We can view the input file similar to a stream of tokens but this time it will be a stream of characters. We will group them into tokens. This suggests that the idea of peek(n) and skip can be applied to the FILE*
. So it seems a good idea to abstract this away in a template class buffered_queue
.
The template parameter T
will be the type of items stored in our queue. For the input file it will be char
(although we will use int
for a reason explained below). For the tokens themselves it will be TokenPtr
. Source
is a class type that implements the function call operator. This operator must return a T
value. It will be invoked during peek(n) n refers to an element that is not yet in the queue.
The function call operator for the input file will basically invoke fgetc
. This function returns an unsigned char
casted into a int
because of the EOF
marker used to mark end of file. This is the reason why our buffered_queue for the input file will store int
rather than char
. For the stream of tokens the function call operator will just build the next token using the input, we will see later how we build the token.
Data members input_queue
and token_queue
implement respectively the buffered queues of the input file and the stream of tokens.
The implementation of buffered_queue is a bit long to paste it here. It is in gcc-src/gcc/tiny/tiny-buffered-queue.h. It is implemented using a std::vector
and two position markers: start
and end
. When start == end
it means that the queue is empty. Member function skip
just advances start
, calling peek
if we are actually skiping more than what was peeked so far. If it becomes the same as end
, it just moves both to the beginning of the vector, this way the vector does not have to grow indefinitely. Member function peek
checks if the requested n
is already in the queue, if it is, it just returns it. If it is not it will call Source::operator()
, but before that it checks if there is enough room in the vector. If there is not, then a larger vector is allocated, data is copied and the new vector is now used as the buffer. There is room for improvement for this class. For instance we may try to compact the vector before reallocating because the space may already be there just we are too near the end of the vector, etc. But I think it is good enough for us now.
Now we can see the implementation of peek_token and skip_token.
This way the token will be formed during the call to token_queue.peek
. It will invoke SourceToken::operator()
which ends calling Lexer::build_token
. So now it is time to see how a token is formed.
Before we discuss the main loop, note that the body of the loop calls peek_input
and skip_input
. These two functions use the input_queue
.
Similar to what happened with token_queue
, input_queue
will invoke InputSource::operator()
which simply calls fgetc
, effectively returning us the current character of the file (until we skip it, of course).
So, why is there a loop in Lexer::build_token
? Because we may have to advance several characters of the input before we can form a token. When a token is formed, we will simply return from the function. While we cannot form it (for instance when we encounter whitespace or newlines) we will just keep requesting characters. Of course there must be a way of finishing the loop: when we find the end of file we will just build the END_OF_FILE
token and stop processing the input.
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
TokenPtr
Lexer::build_token ()
{
for (;;)
{
location_t loc = get_current_location ();
int current_char = peek_input ();
skip_input ();
if (current_char == EOF)
{
return Token::make (END_OF_FILE, loc);
}
// ...
}
If the character read from the input is not the END_OF_FILE
we can start tokenizing it. We have to ignore whitespace by not forming a token but we still have to update location information.
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
switch (current_char)
{
// **************
// * Whitespace *
// **************
case '\n':
current_line++;
current_column = 1;
linemap_line_start (::line_table, current_line, max_column_hint);
continue;
case ' ':
current_column++;
continue;
case '\t':
// Width of a tab is not well defined, let's assume 8 for now
current_column += 8;
continue;
As you can see we have two data members current_line
and current_column
that we have to update for proper location tracking. Let's ignore for now the line 129, more on this later.
Now we can start matching punctuation. Some tokens are straightforward.
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
case '=':
current_column++;
return Token::make (EQUAL, loc);
case '(':
current_column++;
return Token::make (LEFT_PAREN, loc);
case '-':
current_column++;
return Token::make (MINUS, loc);
case '+':
current_column++;
return Token::make (PLUS, loc);
case ')':
current_column++;
return Token::make (RIGHT_PAREN, loc);
case ';':
current_column++;
return Token::make (SEMICOLON, loc);
Some others may require a bit of peeking, but that's all.
189
190
191
192
193
194
195
196
197
198
199
200
201
202
case '<':
if (peek_input () == '=')
{
skip_input ();
current_column += 2;
return Token::make (LOWER_OR_EQUAL, loc);
}
else
{
current_column++;
return Token::make (LOWER, loc);
}
break;
If you wonder how comments are implemented: the lexer just skips over it.
223
224
225
226
227
228
229
230
231
232
233
case '#': /* comment */
current_column++;
current_char = peek_input ();
while (current_char != '\n')
{
skip_input ();
current_column++; // won't be used
current_char = peek_input ();
}
continue;
break;
If we reach the end of the loop and the character has not been handled, it means that the character is invalid. Function error_at
is a diagnostic utility from GCC that we will see again during syntactic analysis, so let's ignore it from now.
335
336
337
// Martians
error_at (loc, "unexpected character '%x'", current_char);
current_column++;
And so on. See the full listing here. It may seem tedious but it only has to be written once and after that it is relatively easy to extend.
Identifiers and keywords
Identifiers and keywords are interesting because they share some lexical form. As we specified in part 1, if an identifier may be a keyword it is tokenized as a keyword. This means that whil
is an IDENTIFIER
but while
is the token WHILE
. So what we do is we just
gather the text of the token and check if it is an keyword. If it is, we form the corresponding keyword token, otherwise we form an identifier token.
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
// ***************************
// * Identifiers or keywords *
// ***************************
if (ISALPHA (current_char) || current_char == '_')
{
std::string str;
str.reserve (16); // some sensible default
str += current_char;
int length = 1;
current_char = peek_input ();
while (ISALPHA (current_char) || ISDIGIT (current_char)
|| current_char == '_')
{
length++;
str += current_char;
skip_input ();
current_char = peek_input ();
}
current_column += length;
TokenId keyword = classify_keyword (str);
if (keyword == IDENTIFIER)
{
return Token::make_identifier (loc, str);
}
else
{
return Token::make (keyword, loc);
}
}
Macros ISALPHA
and ISDIGIT
are provided by the gcc header safe-ctype.h
and check if a character belongs to the set of alphanumeric letters or decimal digits, respectively. The function classify_keyword
is implemented by doing a binary search in a sorted array of keywords. This sorted array is defined using X-Macros, here we use only TINY_TOKEN_KEYWORD
and we ignore the remaining tokens.
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
namespace
{
const std::string keyword_index[] = {
#define TINY_TOKEN(x, y)
#define TINY_TOKEN_KEYWORD(name, keyword) keyword,
TINY_TOKEN_LIST
#undef TINY_TOKEN_KEYWORD
#undef TINY_TOKEN
};
TokenId keyword_keys[] = {
#define TINY_TOKEN(x, y)
#define TINY_TOKEN_KEYWORD(name, keyword) name,
TINY_TOKEN_LIST
#undef TINY_TOKEN_KEYWORD
#undef TINY_TOKEN
};
const int num_keywords = sizeof (keyword_index) / sizeof (*keyword_index);
}
What we do here is declaring two parallel arrays. The first one, keyword_index
, is just an array of std::string with one element per keyword. Since TINY_TOKEN_KEYWORD
s are sorted by keyword inside TINY_TOKEN_LIST
this array will be sorted too. The second array, keyword_keys
, contains the token ids for the corresponding tokens in keyword_index
. Then our function classify_keyword
just looks up a string in keyword_index. If it finds it uses the position of the keyword to index keyword_keys.
86
87
88
89
90
91
92
93
94
95
96
97
98
TokenId
Lexer::classify_keyword (const std::string &str)
{
const std::string *last = keyword_index + num_keywords;
const std::string *idx = std::lower_bound (keyword_index, last, str);
if (idx == last || str != *idx)
return IDENTIFIER;
else
{
return keyword_keys[idx - keyword_index];
}
}
Tracking location
Tracking location can be implemented manually but GCC has good support in this area so it would be a pity not to use it.
GCC has a global variable called line_table
responsible for tracking locations. We have to tell line_table
when we enter a file and when we leave it (this is useful for include-like constructions since line_table will keep track of this). We do this in the constructor of Lexer
.
Function linemap_add
informs line_table that we are entering a file (LC_ENTER
) and that we are in line 1. When forming a token, we have to get the location. We did this calling Lexer::get_current_location
(see above). It simply requests a new location_t
for the current column.
When a newline is encountered, we have to tell line_table that a new line starts. We already did this in build_token
.
123
124
125
126
127
128
129
130
// **************
// * Whitespace *
// **************
case '\n':
current_line++;
current_column = 1;
linemap_line_start (::line_table, current_line, max_column_hint);
continue;
Current layout
Our gcc-src/gcc/tiny
now looks like this
Trying our lexer
Ok, now we have a lexer, let's try it. First let's build the new files. Let's update gcc-src/gcc/tiny/Make-lang.in
.
Now let's change tiny_parse_file
in gcc-src/gcc/tiny/tiny1.cc
.
After the customary make
and make install
we can test with the example of sum.tiny
.
Yay!
Wrap-up
Now that we are able to create the sequence of tokens of our input program we still have to verify that the sequence forms a syntactically valid program. But this will be in the next chapter. That's all for today.