A tiny GCC front end – Part 11
Our tiny language features a few types: int
, float
, bool
, string
and arrays of those types. We can even declare new type names based on other types but it still missing a record type. Today we will address this.
Record types
Before we added arrays to tiny, the value of our variables was simple, atomic, non-structured at all. When we introduced arrays we let a single variable denote several values, but all the values have to be of the same type. We are missing the possibility of representing several values of different types: we need a record type (or tuple type).
Record types are constructed using an ordered set of types. We will need to refer the elements of such set in some way. We could access the field using some index derived from its order (e.g. the first element could be indexed using a zero, the second element using one). Usually, though, the elements of the record are given names, so the name can be used to refer to the element. While both approaches are feasible using names is probably easier for the programmer. In tiny we will use names.
Syntax
Similar to arrays, we will need special syntax to define a record type. A record type is made of a sequence of pairs of names and types that we will call fields. First let's see how to declare a field.
〈field-declaration〉 → 〈identifier〉 :
〈type〉 ;
A field declaration has the same syntax as a variable/type declaration but without an initial var
/type
. A keyword is not needed because a field declaration will always appear inside a record type.
Recall that * means the previous element of the language repeated zero or more times
〈type〉 → record
〈field-declaration〉* end
For instance we can declare a variable with a record type made of two floats x and y, like this.
In chapter 10 we introduced type declarations, so we can declare a point
type of record type.
This way we can now declare several points without having to repeat the record type. There is a reason for this that we will discuss below.
We need a way to access the field of a record type inside an expression or in the left hand side of an assignment. We will add a new 〈primary〉 expression.
〈primary〉 → (
expression )
| 〈identifier〉
| 〈integer-literal〉
| 〈float-literal〉
| 〈string-literal〉
| 〈array-element〉
| 〈field-access〉
〈field-access〉 → 〈primary〉 .
〈identifier〉
For instance, fields of p1
and p2
can be accessed like this.
We still need to clarify the priority between an 〈array-element〉 and a 〈field-access〉. The following expressions (assuming they are valid given appropiate types for the variable a
and the field b
)
are to be interpreted like
Semantics
A record type is a type the values of which is the cartesian product of the values that can be represented by the fields of a record type. A record type with n fields named φ0, φ1, …, φn-1 where each field φi has an associated type τi will be able to represent a value (ε0, ε1, …, εn-1) where each εi is a value of the type τi. Given a value of record type, we can select a single value of it, in our case using its name.
Two values of record type are the same only if they come from the same declaration. This means that p1
and p2
below have different types because their record types come from different declarations, even if their sequences of fields are the same.
Conversely, below p1
and p2
are the same, because their record type comes from the same declaration.
This kind of type equivalence is called equivalence by name
in contrast to equivalence by structure
. Both have advantages and drawbacks but most programming languages choose the former.
As we discussed in chapter 10, in a type declaration, we cannot use the type being declared in the type part. So this will be invalid.
The name of each field must be unique inside a record. It is possible to have a field with record type.
An expression of the form 〈primary〉 .
〈identifier〉 is a field access. The primary expression must have record type and 〈identifier〉 must be the name of a field of that record type. The type of a field access is the 〈type〉 as the type of the corresponding field declaration. A field access can be used as the left hand side operator of an assignment and can be used as the operand of a read
statement.
Implementation
Now that we have a specification of this extension, we can start implementing it.
Lexer
We need to recognize two new tokens: a new keyword record
and the dot operator. So we add bot to our set of tokens.
Our existing machinery will handle record
, so only the dot must be tokenized. Given the current specification, the dot is relatively simple as long as it is not followed by a number, a .
in the code will be the token DOT
. This restriction makes sense as we want .1
to be a FLOAT_LITERAL
not a DOT
followed by an INTEGER_LITERAL
.
Parse a record type
To parse a record type we first have to be able to parse a field declaration. It is pretty straightforward. GENERIC represents field declarations using a FIELD_DECL
tree which simply has the name of the field and its type. We also have to make sure to mark the field addressable otherwise the read
statement will not work on fields. Note also that we pass a vector of field names so we can diagnose repeated field names (I'm using a vector because the number of fields is usually small and it does not pay to use a more sophisticated data structure).
Now that we can parse a field declaration, let's parse a record type. First lets extend parse_type
so it forwards to parse_record
when it finds the token RECORD
.
Parsing a record type is not particularly complex. Once we have skipped the record
keyword we keep parsing field declarations until we find an end
keyword. A record type in GENERIC is represented using a RECORD_TYPE
tree, so we will have to create first an empty RECORD_TYPE
tree. Field declarations must have their DECL_CONTEXT
set to this RECORD_TYPE
(so they know of which record type they are fields). The set of fields in a RECORD_TYPE
is chained using TREE_CHAIN
. The code simply remembers the first field and the last one so it can chain each field with the previous one. Finally the first field is used to set the TYPE_FIELDS
attribute of the RECORD_TYPE
. At this point we also need to request to GCC to lay out this type. The reason is that a RECORD_TYPE
will have to be represented in memory in a way that can hold all the field values, the function layout_type
makes sure each field gets the appropiate location in the record type.
Parse a field access
Parsing a field access is done by handling the dot as a binary operator with very high priority. So we assign it a high left binding power.
We will use a convenience function is_record_type
with the obvious meaning.
In GENERIC a field access is represented with a tree of kind COMPONENT_REF
, where the first tree is an tree expression of record type and the second tree is a FIELD_DECL
. Parsing a field access involves just checking that the left expression has a record type and the dot is followed by an identifier that is the name of a field of that record type. Recall that the list of fields of a RECORD_TYPE
is available in the TYPE_FIELDS
attribute. We traverse each FIELD_DECL
chaining through TREE_CHAIN
. Like all other declarations in GENERIC, a FIELD_DECL
has a DECL_NAME
which contains an attribute IDENTIFIER_POINTER
where we will find the name of the field. If we do not find a field with the given name, then this is an error, otherwise we create a tree COMPONENT_REF using the left tree (that we checked it is of record type) and the appropiate FIELD_DECL
.
Finally we must update parse_expression_naming_variable
because a COMPONENT_REF tree also names a variable. This way we can put it in the left hand side of an assignment or as the operand of a read
statement.
Smoke test
And we are done. Let's try a simple program.
Yay!
That's all for today