Walk-through flang – Part 5
In the previous installment of this series we saw how flang parses the statements using an LR(1) algorithm. As the parser recognized the parts of the statements it invokes semantic actions. Today we’re going to talk more about them.
What is not being "parsed" yet
In part 4 we saw that the parser uses an LR(1) algorithm to parse the Fortran statements. But Fortran has constructs that require a specific sequence of statements: for instance a (block) do-construct is formed by a do-statement, followed by one or more statements and constructs and ended with an end-do-statement. Also not all statements are not useable in all contexts, for instance we cannot use an assignment-statement in the non-executable part of a MODULE program unit, use more than once the CONTAINS
statement in any program unit or declare a variable once we have assigned a variable (unless we have started a new BLOCK construct).
The examples above are just a few of the things that the parser, because of the kind of grammars used today, can't really check by itself. We have to do these syntactical checks elsewhere. But there are other aspects of the language that the parser cannot check, for instance whether we're declaring a variable twice or using it in a way that would render an ill-formed program. For instance one cannot pass a REAL
variable as argument of a function expecting an INTEGER
parameter.
Luckily as we recognize the statements of the program we can react to them and assess their validity and meaning. What has to be done when a part of the language is recognized can be named as a semantic action. Although not all actions are strictly related to semantics because some of them will simply validate syntactical properties (they will be concerned only about the form instead of the meaning).
Displeasure
It is precisely this semantic analysis of front-ends and parsers the one that most people find annoying. The reason is that the solutions are usually very ad-hoc, extremely tailored to the language, rarely general or systematic enough. Don't get surprised if this part of flang is precisely an example of this. Of course it does not have to be like this but it is hard (and difficult to motivate) to do otherwise.
Semantic actions and values
Semantic actions are a way of implementing a theoretical formalism called attribute grammars. Attribute grammars associate properties to terminals and non-terminals called attributes. These attributes can be synthesized or inherited. Purely synthesized attributes are those that are defined strictly by the part of the language. For instance an expression 1.0 + 1 we know it will have type REAL
and an expression 1 + 1
will have a type INTEGER
. We can easily identify the type of an expression as a synthesized attribute of expressions. An inherited attribute, though, is one that depends on the specific situation of the part of the language. These are attributes that somehow denote context. For instance in the following example, the expression A + 1
has type REAL
. And we know this because A has type REAL
. If we redeclare A
as an INTEGER
the expression A + 1
has type INTEGER
. This is because the occurrence of A
in an expression may have a type or another depending on the "declared type" of A
, that declared type is an inherited attribute at this point.
You will see that implementing attribute grammars in an LR(1) parser will end in mixed feelings.
For the synthesized attributes, we use semantic values. Each part of the language (token or rule) is informally (i.e. enforced by checks/assumptions in the code) assigned a semantic value. There are many types of semantic values defined in semant.h
but the 4 most used ones are the following:
S_EXPR
is used for general expressions. S_CONST
is used for expressions that denote constants or for contexts where a constant is required. S_LVALUE
and S_IDENT
represent variables in Fortran parlance, this is, expressions that denote some storage in memory. Usually these expressions appear at the left of an assigment hence the name lvalue. S_IDENT
may also be used for places where the name of an identifier is required.
All semantic values, regardless of their particular type share a few attributes that are always available. The semantic value is defined in semstk.h
and their type in C is SST
but in general only pointers to it will be used.
The field id
stores the type (e.g. S_CONST
or S_EXPR
). Another important field is ast
where an identifier to an AST tree will be stored, ASTs will represent computation itself. The other fields are used less often as they have narrower meanings.
A SST
also has a type-specific fields but they are not commonly accessed directly but using a bunch of macros which encapsulate the attributes. These macros use the following naming convention: if they end with G
the macro just returns the attribute (i.e. it gets it). If the macro ends with P
, then the macro receives two arguments, the first one being the semantic value and the second one being the value to which we want assign the attribute (i.e. it puts it). Most of the time, the attributes that we can get and set are integers.
Recall that a semantic action is executed when a reduction happens. Each reduction is a rule alternative with a left-hand side being the rule name and the right-hand side a sequence of tokens and rules. Each of these tokens and rules will have their semantic value. We can read the semantic values of the elements of the right hand side using the macro RHS(n)
where n
is the number of the element that we want to access (starting from 1). We can use the macro LHS to access the semantic value of the left hand side, typically to assign its attributes. Be careful to read RHS(1)
first before setting LHS
as they actually refer to the same memory. This might be handy if LHS
is just propagating RHS(1)
, in a unary reduction of the form X ::= Y
in cases where nothing else must have to be checked.
What about inherited attributes? Well, it is not possible to implement them nicely, so a way to do this is using global variables or data structures that represent the environment. In the case of flang there is a giant structure in tools/flang1/flang1exe/semant.h
called sem with lots of fields that will be used to represent inherited information. This global variable called sem
has so many fields that is not worth detailing them here.
Example
Let's go through a very simple example to see what the parser does. Recall that the parser does two passes through the input and in each pass different semantic actions can be run for the same reduction. We start with the first pass but later on we will have to switch to the second one.
1
2
3
4
PROGRAM MAIN
INTEGER :: X
X = 1
END PROGRAM MAIN
When the parser sees the token PROGRAM it already knows it can perform the following reduction.
This is a nullary reduction so no RHS
will be used. Setting LHS
is possible in these cases though. If you recall the part 4, reductions are classified in blocks. This one belongs to the first block and is implemented in semant.c
. The code it does is surprisingly long so I'm not pasting all of it here.
First it checks if we are inside an enum definition (a new feature of Fortran 2003) which is kept in sem.in_enum
. sem
is a global variable that keeps a vast amount of state regarding the parsing. It is defined in tools/flang1/flang1exe/semant.h
. Inside an enum definition only an ENUMERATOR constructor or END ENUM can appear so these are the two valid tokens at this point, otherwise this statement is wrong and has to be ignored (sem.ignore_stmt = TRUE
). break
will finish the semantic processing.
If you recall part 2 you may remember that Fortran has a strict ordering of statements. If we are still in a position to accept USE statements and the next token suggests a USE-statement we just proceed. Otherwise, no more USE statements can appear in which case their effect can be executed at this point: basically adding the imported names. Also functions may have some attributes that use names in these USE-statements, so the frontend has had to defer them up to this point.
There are a few more checks but most of them concerned to parallel constructs that we will omit here. Now the following reduction is executed.
As we explained in part 3, scn.id.name
is a buffer of null-ended strings for the current statement. Each identifier tokenized by the parser is given as a value the offset (in bytes) of that buffer where the identifier is stored (i.e the buffer is a sequence of null ended strings). When doing a shift, the parser has put the value passed by the tokenizer (although it used SST_SYMP
that macro uses the same storage as SST_CVALx
). So in np
we will have the current identifier. The next step is getting a symbol pointer (it is actually a 32-bit integer) of it in the symbol table. We will talk about the symbol table in the next chapter. Once we get the symbol very little has to happen at this level: just store it in LHS using SST_SYMP
(put the symbol lptr
in LHS). SST_ACLP
just clears a field that is used for array-constructors (it is unclear to me why it has to be done for every identifier at this point).
Some other checks happen but are not very relevant at this point. The next reduction is
The semantic action starts by marking in a field of gbl.rutype
that this is a program. Other valid values of this field are RU_FUNC
(for FUNCTION), RU_SUBR
(for PROGRAM) and RU_BDATA
for (BLOCK DATA). To be honest I have no idea what the prefix RU_
means here but apparently is related to subprogram. Maybe it started as a way to distinguish the routines and RO
might be confused with read-only.
Next is checking that we're not inside a module. This macro expands to (sem.mod_cnt && gbl.internal == 0)
which means we have seen zero modules so far and we are not a sub-program. Certainly a PROGRAM
statement cannot appear inside a MODULE
.
This code is shared among several semantic functions so not all the checks will make much sense for the PROGRAM
statement itself. Here we acknowledge that this is not an ENTRY
statement. Confusingly is_entry
is a static variable of semant.c
so we may just be resetting it here to false. Also a PROGRAM statement cannot appear inside an INTERFACE. A few lines below we see some interesting stuff:
Where the code retrieves the symbol from RHS(2)
, so the <id>
. This is the attribute we synthesized in the previous reduction.
The next reduction is
Here we would check generic things related to PROGRAM, FUNCTION, SUBROUTINE, MODULE and BLOCKDATA statements. For FUNCTION and SUBROUTINE at this point we would keep the dummy arguments (formal parameters) that would have been synthesized in another semantic function.
Next reduction is
This semantic function does final processing, most of it only of concern of parallel constructs.
More finish processing, mostly cleanups so the state of the next statement is ready.
The semantic function associated to this reduction does nothing.
So far we have just parsed PROGRAM MAIN
. Now we move onto INTEGER :: X
. I'll skip the boring reductions this time.
The semantic just synthesizes an inherited attribute for this one that will be used later.
Basically we remember that there is not any kind of length specifier like INTEGER * N
or CHARACTER * N
. Technically the Fortran standard only allows this for CHARACTER under the LEN attribute (i.e. CHARACTER * N
means CHARACTER(LEN=N)
) but historically this has been accepted for non-CHARACTER types to mean KIND (i.e INTEGER * 8 means
INTEGER(KIND=8)
). Apparently Flang also accepts a colon (:) instead of *, so INTEGER : 8
would also be accepted.
Here the semantic function synthesizes the final type using <base type>
and <opt len spec>
.
Here we would handle the attributes like DIMENSION
, POINTER
, ALLOCATABLE
, etc. But in this declaration there is none, so nothing of interest happens (only minor bookkeeping).
Here we check that we are inside some program unit. In this case this will do nothing, but for historical reasons it is possible to start an implicit PROGRAM program unit by using any statement that may follow a PROGRAM
statement. This means, among other consequences, that the minimal valid Fortran program is just END
and a "hello world" can be as short as PRINT *, "HELLO WORLD"
statement followed by an END
statement: no need for an initial PROGRAM
(though omitting it is currently deprecated).
The first production we already saw it for the name of PROGRAM. The second one just marks the semantic value of LHS as an identifier.
Ok, again a length specifier here. In this case it would be like INTEGER :: variable * N
which again is an extension, in standard Fortran this is only allowed for CHARACTER. Our declaration has none so again nothing happens here.
Here we gather the length information and array information (none for our declaration). We take the symbol we stored in <ident>
and create a variable for it.
create_var
(defined in semant.c
) makes a few checks and calls refsym_inscope
(defined in semsym.c
) to insert the symbol in the current scope. After this, the semantic function checks the type of the symbol and does several extra checks on the type of the declaration. If all goes well then the symbol is given the type. The code is a bit involved at this point because Fortran has several non-executable statements that can set the same attributes. The following code shows two ways of declaring two arrays A
and B
of 10 INTEGER elements.
Several statements allow setting the dimension attribute for historical reasons as well:
At this point some checks regarding the initialization of the entity happen.
This does nothing in the context of a type declaration but it just synthesizes a list of entities in other cases (the reduction used is the same in both).
This is the whole declaration. The semantic function just cleans up a few variables and sets a null AST to the LHS semantic value.
At this point the code updates the program phase, this is, the position that determines which statements inside the program unit are allowed and which are not.
Then, if the declaration entails any computation, it keeps the AST that represents it. This might happen for instance with a POINTER that must be initialized in a way that is not ASSOCIATED to anything, likewise for ALLOCATABLES and other entities.
At this point we have finished with INTEGER :: X
and we can proceed to parse X = 1
. Again I will omit repeated reductions.
Here we check that we are not inside an INTERFACE. Certainly an assignment is not valid happen there.
Here we check that we are not inside a MODULE. Again, an assignment is no valid there. Several of the reductions in this pass of the parsing do not have any interesting semantic value, so let's assume we're already in the second pass.
According to the code, this rule just toggles a flag that says whether we are at the left of a =
. psfunc
apparently means possible statement function
. A statement function is like a small declaration that totally looks like an assignment except that it takes scalar parameters and returns a scalar value. For obvious reasons this statement is deprecated as of Fortran 90.
We're at the left of the assignment, among other things, this function will mark <var ref>
as a ST_IDENT
.
Now we mark that we are not anymore at the left of an assignment, we can't see it yet through the reductions but we have already consumed the =
token.
So this is the token for the 1. The semantic function will basically set the LHS to no symbol, its type to be integer and will store the constant, this is done in ast_conval
.
(UMR means uninitialized memory read, and it is a warning printed by Purify, a tool similar to Valgrind to detect memory errors.)
The usage of top
vs LHS
is a bit confusing at this point. My understanding is that they are the same.
These two reductions do nothing or simply propagate values upwards the expression hierarchy.
Here we will build the assignment itself. This is a long function, but at some point we will get the AST of RHS(5) and the location in RHS(2). Here RHS(2)
is exactly <var ref>
and RHS(5) is <expression>
. An AST representing the assignment is then created which is stored in LHS. The code also makes sure we know we are in the executable part of the program unit (only executable statements or FORMAT statements are allowed in that part).
The semantic function of <simple stmt>
does nothing. The semantic function of <statement>
takes the AST from <simple stmt>
and adds it to the global AST.
Finally we are at the END PROGRAM statement.
This does nothing in our case, in other contexts may have to finish the INTERFACE blocks seen so far.
Several final clean ups happen here. Too complicated at this point to describe them.
Wrap-up
Ok, this was very long and sadly we could not get much detail of this. In the next chapter we will see how the symbol table works.