A tiny GCC front end – Part 8
Now that we have the basic language set implemented we can consider adding new features to it. Today we will add arrays.
Array type and array values
An important element of programming languages is their type system. Type systems are crucial in the semantics of programming languages and are an actively researched topic nowadays. tiny, so far, has a very simple type system: there are only four types (int, float, boolean and string). We can express lots of things already with those types but it may fall short in some contexts.
A type system is a set of types along with the rules that govern them. An element of the type system, i.e. a type, will be denoted by τ. As we said, tiny has four types.
τ → int
| float
| bool
| string
A type is a set of values: int
values are the 32 bit signed integers, float
values are the reals encoded by IEEE 754 Binary32, bool has only two values true or false and values of string type are (possibly empty) finite sequences of characters.
Now we want to add an array type. An array type has a size and an element type. The size is an integer expression of the language, that we will denote as ε that evaluates to a positive (nonzero) integer.
τ → array ε τ
This means that our typesystem has a type array constructed using an integer expression ε (the size) and a type τ (the element type).
After this addition, our typesystem looks like this.
τ → int
| float
| bool
| string
| array ε τ
What are, thus, the values of a type array ε τ? A value of array type is a set of values of type τ called the elements of the array. There is an integer associated to each element, called the index. The set of indexes of the elements is such that they form an ascending sequence, where each index is the previous one plus one. The first index is called the lower bound (say it L) and the last one is the upper bound (say it U). This way it holds that U - L + 1 = ε.
I know that at this point this seems unnecessarily theoretic but let's make a simple example. Consider array 3 float. A possible array value could be the following one, where L = 0 and U = 2.
〈0 → 1.2, 1 → 2.3, 2 → 2.3〉
for another example where L = 4 and U = 6
〈4 → 1.2, 5 → 2.3, 6 → 2.3〉
The indexes form a growing sequence wherer each index equals the previous one plus one. The following would not be the value of an array.
〈12 → 1.2, 25 → 2.3, 42 → 2.3〉
Syntax
We will extend the rule of types of tiny to let us define a variable of array type.
〈type〉 → int
| float
| 〈type〉[
〈expression〉]
| 〈type〉(
〈expression〉:
〈expression〉)
We will also need to extend expressions so we can designate one of the elements of the array.
〈primary〉 → (
expression )
| 〈identifier〉
| 〈integer-literal〉
| 〈float-literal〉
| 〈string-literal〉
| 〈array-element〉
〈array-element〉 → 〈primary〉[
〈expression〉]
Semantics
A 〈type〉 of the form
〈type〉[
〈expression〉]
designates an array type. If 〈type〉 is not an array then the designated type is just array
〈type〉 〈expression〉. The set of indexes range from 0 to 〈expression〉 minus one.
Things are a bit more complicated if 〈type〉 is an array because now there are two possible interpretations. In the comments below, parentheses are used only to express grouping
We will chose the first interpretation. Some programming languages, like Fortran, choose the second one.
For the case when 〈type〉 is an array, let's assume it is of the form array ε0 τ0. Then the designated type will be array ε0 (array τ0 〈expression〉)
The other syntax is similar.
〈type〉 → 〈type〉(
〈expression0〉:
〈expression1〉)
Now ε is 〈expression1〉 - 〈expression0〉 + 1 and the indexes of the array range from 〈expression0〉 to 〈expression1〉 (both ends included). 〈expression1〉 must be larger or equal than 〈expression0〉, otherwise this is an error.
A 〈primary〉 of the form
〈array-element〉 → 〈primary〉[
〈expression〉]
designates a single element of 〈primary〉. The type of 〈primary〉 must be array, otherwise this is an error. The 〈expression〉 must be an expression of integer type the value of which must be contained in the range of indexes of the array type, otherwise this is an error. The type of an array element is the same as the element type of the array.
Given the declarations of a1
, b1
, c1
, d1
above, valid array elements are.
Primaries of the form 〈identifier〉 and 〈array-element〉 can be used in the left hand side of an assignment and in the read statement. We will call this subset of expressions as variables. Some programming languages, like C and C++, name these expressions lvalues (or L-values) for historical reasons: an lvalue can appear in the left hand side of an assignment.
〈assignment〉 → 〈variable〉 :=
〈expression〉 ;
〈read〉 → read
〈variable〉 ;
〈variable〉 → 〈identifier〉
| 〈array-element〉
This opens up many possibilities. For instance now we can write a tiny program (bubble.tiny
) that sorts a given set of numbers.
Implementation
Adding support for arrays to our front end is not too hard.
Minor issue first
Before we proceed we need to fix an issue that may cause us problems when we play with arrays: We want all the declarations have a DECL_CONTEXT
. Current code only sets it for LABEL_DECL
but all declarations (except those that are global) should have some DECL_CONTEXT
. In our case VAR_DECL
s and the RESULT_DECL
of main are missing the DECL_CONTEXT
. We have to set it to the FUNCTION_DECL
of the main function (this effectively makes them local variables of the main function).
Lexer
For the lexer we only have to add three tokens [
and ]
. The remaining punctuation required for arrays (
, )
and :
were already in tiny.
Parser
Array type
First let's see how to parse a type that designates an array. In member function Parser::parse_type we cannot just return the parsed type. Instead we will keep it.
Now we will start parsing the indexes ranges. We will have a list of pairs of expressions, each pair denoting the lower and the upper indexes of the array type. For arrays of the form [e]
we will set the lower bound to zero and the upper bound to the e - 1
. For arrays of the form (e0:e1)
, the lower and the upper will be e0
and e1
respectively.
Now we can start building the array type.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
+ for (Dimensions::reverse_iterator it = dimensions.rbegin ();
+ it != dimensions.rend (); it++)
+ {
+ it->first = Tree (fold (it->first.get_tree ()), it->first.get_locus ());
+ it->second
+ = Tree (fold (it->second.get_tree ()), it->second.get_locus ());
+
+ Tree range_type
+ = build_range_type (integer_type_node, it->first.get_tree (),
+ it->second.get_tree ());
+ type = build_array_type (type.get_tree (), range_type.get_tree ());
+ }
+
+ return type;
Due to the semantics of the array types described above, we have to traverse the list in reverse order. We get the lower and upper expressions and we fold it (lines 4 to 5). This GCC function will attempt to simplify the expression if possible. For instance 1+2*3 will become 7. Now we build a GCC range type. A range type is a type the values of which are integers in the specified range. In this case we use the lower and the upper to create the range type (lines 8 to 10). A range type is represented as a GENERIC tree with tree code RANGE_TYPE
. Once we have this range type, we take the current type (which may be at this point an integer type, a float type or another array type) and the range type to build an array type (line 11). An array type is represented as a GENERIC tree with three code ARRAY_TYPE
.
Note that we currently do not check that the ε of the array type is actually a positive, nonzero, integer value. If the bounds of the array are constant, such error can be detected at compile time (the earlier an error is detected the better). If the bounds are non-constant then the semantics of the language should specify what to do during the execution of the program. Tiny semantics simply say that it is an error. Since we have not clarified what to be an error
is, we will not do anything special yet.
Array element
Now we have to add support for array elements in expressions. Recall that we use a Pratt parser to recognize them. We can recognize an array element by just acting as if [
were a binary operation with very high priority.
This will require a binary handler, like other infix operators.
The binary handler is actually rather straightforward.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Tree Parser::binary_array_ref(const const_TokenPtr tok, Tree left) {
Tree right = parse_integer_expression();
if (right.is_error())
return Tree::error();
if (!skip_token(Tiny::RIGHT_SQUARE))
return Tree::error();
if (!is_array_type(left.get_type())) {
error_at(left.get_locus(), "does not have array type");
return Tree::error();
}
Tree element_type = TREE_TYPE(left.get_type().get_tree());
return build_tree(ARRAY_REF, tok->get_locus(), element_type, left, right,
Tree(), Tree());
}
Recall that a binary handler has the lexer positioned right after the infix operator. This means that we have already consumed [
. So we have to parse the integer expression enclosed by the square brackets (line 4). Recall that any token unknown to the Pratt parser has the lowest possible binding power, this means that parsing the integer expression will stop when it encounters the ]
. This behaviour is actually the one we want. We still have to consume the ]
(line 8). Now we verify if the left operand has array type (line 9). If it does not, this is an error. If it does, we compute the type of the array element. To do this we have to use the accessor TREE_TYPE
from GCC which given an ARRAY_TYPE
will return its element type (line 14). Finally we build the GENERIC tree ARRAY_REF
that repreents an access the array element (line 16).
Checking if a tree in GENERIC represents an array type is done using this auxiliar function.
Likewise with ε, we are not verifying that the expression of the array element evaluates to an integer contained in the range of indexes of the declared array. Recall that the semantics of tiny are not complete enough regarding errors.
Final touches
As we said above we allow variables and array elements in the expression of a read statement and in the left hand side of an assignment. Let's first create a couple of functions that expression r that check this for us.
Since we allow the same thing in both cases, parse_lhs_assignment_expression
just forwards to parse_expression_naming_variable
. Now we can update parse_assignment.
Language hook
If we want to use arrays with non-constant size, GCC will invoke a language hook when internally computing the size of the array. This is for those cases where the language supports variable-sized types in a global scope. In this case the hook must return true, false otherwise.
Since in tiny where everything is conceptually inside an implicit main function, the binding must return false.
Our hook, currently crashes the compiler, so we need to adjust it first. Recall that this hook is in tiny1.cc
.
Trying it
Let's try the bubble.tiny
program shown earlier.
Yay!
That's all for today.