Walk-through flang – Part 8
In the last installment of this series we started to look at the AST and the symbol table by examining the compiler dumps of these two data structures. In this chapter we are going to explore a bit more the AST for the control flow statements.
Statement descriptor
At the end of the previous installment we learnt about the Statement Descriptor (STD
). STDs allow us to order the ASTs that represent statements (the ordering is required to make sure we don't alter the semantics of the program). An AST node that represents a statement is linked to an STD and each STD is linked back to that AST. An STD includes extra information about what other STD precede or follow it.
We can dump all the STDs passing -Hq,5,2
to flang. Because the second argument is a bitmask, we can use -Hq,5,3
to print both the symtab and the STD (they will be emitted in this order).
For instance, the function of the last chapter
1
2
3
4
5
FUNCTION ADD(A, B)
INTEGER :: A, B, ADD
ADD = A + B
END
has the following STD information
The field std
is the identifier of the STD (the number of the identifier itself does not convey any order). lineno
represents the line of the statement. The field label
is used for labeled statements, see later. Finally ast
is the AST identifier. The AST identifier can be matched in the AST dump (that we can obtain passing -Hq,4,256
to flang).
You may be wondering where have the two first statements gone. Well, the truth is that, a compiler only needs to emit code for anything that may have a side-effect (or visible behaviour). The non-executable statements by themselves won't cause any code be emitted (they will impact it, of course): the raw material here are the executable statements.
Equipped with this tools now we can look a bit more with detail some of the control flow statements of Fortran.
Examples of control flow in Fortran
Due to the long story of Fortran, the language has many colourful features that nowadays are regarded as obsolete or just weird. From a point of view of the evolution of programming languages (PL), Fortran is very interesting. Having been there for so long means that we can see the different trends an ideas in PL design impacting the evolution of the language.
If statement
Consider the following function that computes the absolute value of A
1
2
3
4
5
6
FUNCTION MY_ABS(A)
INTEGER :: A
IF (A < 0) A = -A
MY_ABS = A
END
The statement in line 4 is called an if-statement and it "gates" the execution of the statement A = -A
, to the truth value of the expression A < 0
.
From the two dumps above we can see that the if-statement is represented using an AST (14) of type if-then
. The field ifexpr
represents the control expression of the if-statement. In this case it is the AST 13 which is a binary operation, binop
, (of type 23 representing the "lower than" operation) with expression type logical
. The left operand is a reference to the variable a
, represented in the AST 12. As we saw in the previous chapter, references to variables are represented with a tree type ident
. The right operand is just the constant 0, represented in the AST 2 of type constant
.
The conditionally executed statement A = -A
is an assignment-statement represented in the AST 16. Note that the destination is the same AST 2 we used in left operand of the AST 13 above: rarely there is a need to repeat references to the same variable.
The AST 17 represents the END IF
statement. It may come as a surprising thing to you, but flang is not nesting the control flow structures in any way at this level. So to represent the nesting, flang sees itself forced to create regions which are naturally delimited in Fortran using many kinds of end-statements (in this case a end-if-statement). Because of the parser and semantic checks, these regions will be well-formed and will honour the proper nesting, but nothing in the in-memory representation is explicitly forcing this.
If-then construct
Consider the following function that does an addition or a subtraction depending on a logical parameter.
1
2
3
4
5
6
7
8
9
10
FUNCTION ADD_OR_SUB(L, A, B)
LOGICAL :: L
INTEGER :: A, B, ADD_OR_SUB
IF (L) THEN
ADD_OR_SUB = A + B
ELSE
ADD_OR_SUB = A - B
END IF
END
Again we need to delimit the regions corresponding to the "then" and the "else" via specific statements. This actually means that flang lowers the if-statement of the previous section as it would lower an if-construct with a single statement in the then part and no else part.
Let's see what happens a more complex example:
1
2
3
4
5
6
7
8
9
10
11
12
13
SUBROUTINE SUB(L1, L2, A, B)
LOGICAL :: L1, L2
INTEGER :: A, B
IF (L1) THEN
A = 1
B = 3
ELSE
IF (L2) THEN
A = 2
B = 4
END IF
END IF
END SUBROUTINE SUB
The STDs look like this
We can tally the control flow STDs with their ASTs. This time I have indented the dump to make the nesting explicit.
Do-while-statement
Fortran does not have a dedicated while statement but instead it is a special case of the do-statement. Consider the following program that computes the greater common divisor using the Euclid algorithm.
1
2
3
4
5
6
7
8
9
10
11
12
FUNCTION GCD(A, B)
INTEGER :: A, B, GCD
INTEGER :: T
DO WHILE (B /= 0)
T = B
B = MOD(A, B)
A = T
END DO
GCD = A
END FUNCTION GCD
Let's dump the STD to see what has happened now:
Take a moment to appreciate what flang did here. It introduced an artificial continue-statement (CONTINUE
). The execution of a continue-statement has no effect but flang attached an artificial label %L99999
to it. Now we see that our DO WHILE
has been replaced by an if-then-statement. The then part of this if-then-statement executes the body of the DO WHILE
. Because this is a loop, flang has introduced a GOTO
statement back to the label of the continue-statement that it emitted before the loop body.
So flang has lowered the input above into something like this.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
FUNCTION GCD(A, B)
INTEGER :: A, B, GCD
INTEGER :: T
99999: CONTINUE
IF (B /= 0) THEN
T = B
B = MOD(A, B)
A = T
GOTO 99999
END IF
GCD = A
END FUNCTION GCD
This kind of early lowerings come with some tradeoffs. The upside of lowering early is that the construct can be supported using the existing infrastructure for more general constructs. This way, the rest of the compiler does not need to be taught about the new construct. The downside is that sometimes these constructs can be lowered into executable code in a more efficient way. By lowering to existing, more general, constructs we may lose this ability or we are forced to invest some effort to recover the lost information. Also, tooling that needs to work at the source-level (e.g. refactoring tools) may be hindered by these lowerings as now the AST is not representing with enough fidelity what was in the input, so rewriting based on the AST is not possible or it will be impacted by this kind of lowering. A way to mitigate this problem is to annotate in the lowered construct extra information that can be used by source-level processing tools or the code generation phases of the compiler.
Do construct
One of the most important constructs of Fortran is the do-construct. Along with the assignment statement it is probably one of the oldest statements in Fortran. There are two forms of the do-construct: the non-block do-construct and the block one. The block do-construct is delimited by a do-statement and an end-do-statement.
The non-block one is delimited using a first do-statement including a label and any statement with the same label. This was a design sensible in the 60s but it is now regarded as deprecated.
The fun part of non-block constructs is that the label can be shared with more than one loop in a nested way.
Because this syntax is a bit awkward in the current world of structured programming, it is not uncommon to use a labeled CONTINUE
statement to as the last-statement of a non-block do-construct (in the unlikely case we're forced to use one).
The good news are that flang will lower the non-block loops into block ones. Consider the following program that computes the sum of integers 1 to N.
1
2
3
4
5
6
7
8
9
10
11
12
FUNCTION SUMMATION(N)
IMPLICIT NONE
INTEGER :: SUMMATION
INTEGER :: N
INTEGER :: I, S
S = 0
DO 50 I = 1, N
50 S = S + I
SUMMATION = S
END FUNCTION SUMMATION
The do-construct is a very high-level construct: a loop like DO var = start, end, stride
iterates M
times where M = (end - start + stride) / stride
(if M
is negative or zero no iterations happen). Before starting the first iteration, var
will be assigned the value of start
, and at the end of each iterationvar
is incremented with the value of stride
(which if not specified is 1 and it can be a negative value).
Technically in the Fortran standard, DO WHILE
and DO
belong to the same do-construct with different loop-control parts. It just makes sense, though, to consider them separately here.
"Do forever"
There is a form of the do-construct that may never end. In practice, this is just a dated form of the DO WHILE
because any sensible program written in Fortran will eventually terminate: this means there will be some check inside the loop body that will make the loop terminate. We can rewrite the Euclid algorithm above using this form. The EXIT
statement causes the termination of the innermost enclosing loop.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
FUNCTION GCD(A, B)
INTEGER :: A, B, GCD
INTEGER :: T
IF (B /= 0) THEN
DO
T = B
B = MOD(A, B)
A = T
IF (B /= 0) EXIT
END DO
END IF
GCD = A
END FUNCTION GCD
If we look at the STDs of the code above we see that the EXIT
statement is lowered to a GOTO
branch to a statement right after the loop (in this case the STD 12
). Note also how the DO "forever" is just implemented as if the user had written DO WHILE (.TRUE.)
(check STDs 2
, 3
and 10
).
Curtailing do constructs
It is possible to "curtail" an iteration of a loop using the CYCLE
statement. For instance, we use that statement to sum only the positive numbers of a given array.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
FUNCTION POSITIVE_SUMMATION(N, A)
IMPLICIT NONE
INTEGER :: POSITIVE_SUMMATION
INTEGER :: N
INTEGER :: A(N)
INTEGER :: S, I
S = 0
DO I = 1, N
IF (A(I) < 0) CYCLE
S = S + A(I)
END DO
POSITIVE_SUMMATION = S
END FUNCTION POSITIVE_SUMMATION
Flang does the obvious and the cycle becomes a GOTO to an artifically-introduced last statement of the do-construct. Again the continue-statement comes handy to add new statements with no effect only to be able to attach a label.
Under the hood
We've seen several examples of how these statements are represented in the AST and its STDs. Let's dive a bit deeper to see when this happens in flang. We have to understand we are at relatively early steps in the compilation pipeline, the parser is checking our syntax and semantics and if it is correct it will build the appropriate trees.
For this post I think it is interesting to look how a DO WHILE
is currently lowered. Because this is an involved process, we will focus only on the specific construct itself.
We are handling semantic actions. We know from previous chapters that they are split in several files depending on the specific construct recognised by the parser. In the case of Fortran executable statements, most of them are found in the file tools/flang1/flang1exe/semant3.c
.
We will first encounter a DO
token. Fortran allows some constructs to have non-numeric labels. These labels precede the keyword (DO
in this case) and are followed by a colon. In this case there isn't any construct-name, so flang just remembers this fact.
2896
2897
2898
2899
2900
2901
/*
* <do construct> ::= DO |
*/
case DO_CONSTRUCT1:
named_construct = 0;
break;
We mentioned earlier about non-block constructs, for those a label would follow the DO token. In this case no label follows it, so flang just remembers this fact too.
2887
2888
2889
2890
2891
2892
/*
* <do begin> ::= <do construct>
*/
case DO_BEGIN2:
do_label = 0;
break;
What may follow a <do begin>
in the flang parser parlance is called a <loop control>
so we will recognise that first. A <loop control>
includes several syntaxes but in this case our will be <dowhile> <etmp lp> <expression> )
. So we need to first recognise a <dowhile>.
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
3119
3120
3121
3122
3123
3124
3125
/*
* <dowhile> ::= <opt comma> WHILE
*/
case DOWHILE1:
NEED_DOIF(doif, DI_DOWHILE);
DI_DO_LABEL(doif) = do_label;
doinfo = get_doinfo(1);
doinfo->index_var = 0; /* marks doinfo for a DOWHILE */
DI_DOINFO(doif) = doinfo;
DI_NAME(doif) = named_construct;
if (scn.currlab)
DI_TOP_LABEL(doif) = scn.currlab;
else
scn.currlab = DI_TOP_LABEL(doif) = getlab();
ast = mk_stmt(A_CONTINUE, 0);
(void)add_stmt(ast);
direct_loop_enter();
break;
Things have suddenly become much more complicated. NEED_DOIF
macro makes sure that the doif
variable will contain a fresh new nesting information for the do-construct that we are about to handle. It is called doif
because flang uses it both for do- and if-constructs. In the existing style of flang, it is just an identifier that will be used in other macros (in this case prefixed with DI_
) to associate information to the do-if nesting. And you can see that in the DI_DO_LABEL
where we just keep the do_label
we noted earlier (there is none in our input so do_label
is just 0 here).
Function get_doinfo
returns a DOINFO
pointer. This is used for the specific bits of the do-construct we have encountered. If you wonder about the 1 argument, this is used as the area to get the memory (flang has 30 memory areas and each one has some specific usage and behaviour, area 1 is used for doinfo, but flang creates DOINFOs in other areas too). Setting doinfo->index_var
to zero is used to mean that this do-construct
is a DO WHILE
. We link then the current doif
identifier with the doinfo
using DI_DOINFO
. We also link the doif with the construct name (which there was none much like the label) using DI_NAME
.
Now the code checks if the current statement had a label. Before the DO token and even before the construct-name, this do-statement, like any other Fortran statement, can have a label. If it doesn't, and in our case it doesn't, we just get a "compiler created" (i.e. artificial) label by calling getlab
. We then remember it as the top level label of this doif using DI_TOP_LABEL
. We also bluntly set the label identifier of the scanner, I'm pretty sure there is an (obscure) reason to do this.
Now we create the CONTINUE
statement we saw flang emitted above! The function direct_loop_enter
is called to state that we are entering a loop, there is some bookkeeping related to directives (e.g. OpenMP or OpenACC) that is required. This function takes care of it. In our case this function is mostly a no-operation.
Great, we semantically handled a <dowhile>. Now the next item is <etmp lp>
. This is a bit of an odd non-terminal that means "expression temporary and left parenthesis". Because this is not strictly related to statements, it is found in semant2.c.
531
532
533
534
535
536
537
/*
* <etmp lp> ::= (
*/
case ETMP_LP1:
sem.use_etmps = TRUE;
sem.etmp_list = NULL;
break;
Basically this is bookkeeping used by the parser to mean "we may have to allocate temporary expressions". After that an <expression> will follow, let's ignore that and presume everything works as expected and then a closing parenthesis. At this point we have recognised the <loop control> of a DO WHILE
.
3084
3085
3086
3087
3088
3089
3090
3091
3092
/*
* <loop control> ::= <dowhile> <etmp lp> <expression> ) |
*/
case LOOP_CONTROL2:
ast2 = gen_logical_if_expr(RHS(3));
ast = mk_stmt(A_IFTHEN, 0);
A_IFEXPRP(ast, ast2);
SST_ASTP(LHS, ast);
break;
The function gen_logical_if_expr
basically gathers the logical expression found in the parentheses (synthesized during the semantic checking of <expression> above). Now we create the if-then statement using mk_stmt
(the second argument is the type which there is none for a control statement like this one, so we just set it to 0). This gives us an incomplete if-then statement node so we set its IFEXPR
attribute using the macro A_IFEXPRP
with the logical expression. Finally, new statement is now set as the AST attribute of <loop control>
itself, this is done by SST_ASTP
.
Now a <loop control>
has been recognized so we're done with the <control stmt>
1832
1833
1834
1835
1836
1837
/*
* <control stmt> ::= <do begin> <loop control> |
*/
case CONTROL_STMT7:
SST_ASTP(LHS, SST_ASTG(RHS(2)));
break;
The code above just keeps the AST of <loop control>
(in our case an A_IFTHEN
) in <control stmt>
. These are almost top level-statements from the point of view of the parser, except for a few things.
1305
1306
1307
1308
1309
1310
1311
/*
* <statement> ::= <nii> <nim> <control stmt> |
*/
case STATEMENT6:
prevphase = sem.pgphase;
SST_ASTP(LHS, SST_ASTG(RHS(3)));
goto executable_shared;
If you wonder about <nii>
and <nim>
above, they just mean "not inside interface" and "not inside module" (certainly a DO WHILE
can't appear directly in those contexts). This semantic action just forwards the AST in control <control stmt>
into the <statement> itself. Finally executable_shared
performs common stuff for statements, including adding it to the STD list.
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
executable_shared:
sem.pgphase = PHASE_EXEC;
sem.temps_reset = FALSE;
/* fall thru to 'statement_shared' */
statement_shared:
if ((ast = SST_ASTG(LHS))) {
(void)add_stmt(ast);
SST_ASTG(LHS) = 0;
}
Ok, so what has happened is that now we have just added an if-then-statement
in the STD. Then the statements making up the body of the DO WHILE
will be added too and eventually we will reach the end-do-statement.
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
/*
* <control stmt> ::= ENDDO <construct name> |
*/
case CONTROL_STMT9:
share_do:
SST_ASTP(LHS, 0);
if (sem.doif_depth <= 0) {
error(104, ERR_Severe, gbl.lineno, "- mismatched ENDDO", CNULL);
(void)add_stmt(mk_stmt(A_ENDDO, 0));
break;
}
Here the construct name can be empty (and in our input it is). First we check if doif_depth
is not zero: if it is then this is a stray end-do-statement that does not follow any if-then-statement or do-statement.
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
doif = sem.doif_depth;
if (scn.currlab && DI_DO_LABEL(doif) == scn.currlab)
/*
* the enddo is labeled and the label matches the do label.
*/
;
else if ((DI_ID(doif) != DI_DO && DI_ID(doif) != DI_DOWHILE &&
DI_ID(doif) != DI_DOCONCURRENT) || DI_DO_LABEL(doif)) {
error(104, 3, gbl.lineno, "- mismatched ENDDO", CNULL);
SST_ASTP(LHS, 0);
break;
}
if (DI_NAME(doif) != named_construct)
err307("DO [CONCURRENT|WHILE] and ENDDO", DI_NAME(doif),
named_construct);
Now we check if this end-do-statement actually finishes a do-construct: it could well be that the input is an if-then-statement followed by and end-do-statement. The doif
data structure is used for if-then-statement
and do-statement
, so we need to check if this ends a (plain) DO
, a DO WHILE
or a DO CONCURRENT
. In our case it will be a DO WHILE
. Note that in our input DI_DO_LABEL(doif)
is going to be zero (and no label can be zero in Fortran) so the first check won't trigger. Note also that the first check may look a bit weak but it is correct because DI_DO_LABEL
only concerns to non-block do-constructs (and never to if-then-statements). After this check we verify that if the END DO
contains a construct name it matches the one we used in the do-construct, we didn't use any in our input so this branch will be trivially satisfied (both DI_NAME(doif)
and named_construct
will be 0).
1889
do_end(doinfo);
The function do_end
does a lot of things which are required upon completion of a do-contruct.
6000
6001
6002
6003
6004
6005
6006
case DI_DOWHILE:
ast = mk_stmt(A_GOTO, 0);
A_L1P(ast, mk_label(DI_TOP_LABEL(orig_doif)));
RFCNTI(DI_TOP_LABEL(orig_doif));
(void)add_stmt(ast);
(void)add_stmt(mk_stmt(A_ENDIF, 0));
break;
For the particular case of our DO WHILE
, we emit the GOTO
statement (A_GOTO
) that will branch back to the DI_TOP_LABEL
of this do-construct. The macro A_L1P
is used to set the label of the GOTO
(the attribute is called A_L1
because some statements have more than one label-reference, like the arithmetic if-statement that has three of them). Note also how we increment the reference counter of the top label (haven't verified it but looks like flang will later remove labels that are defined-but-not-used, so we have to prevent that). The GOTO
statement is added to the STD. Finally we create an end-if-statement and we also add it to the STD. This way we keep the AST well-formed by nesting everything as expected.
Summary
This has been another long post but we can summarise it in:
- Some constructs are lowered into sequences of if-construct and goto-statements.
- Structured programming structures are represented in a flattened way using the natural delimiter statements of Fortran. The in-memory representation does not provide nesting of any form and the proper nesting must be explicitly represented and maintained.
- The do-construct that is not a
DO WHILE
is lowered preserving the original loop-control part.
In the next chapter we will go beyond the AST, and related intermediate representation, and we will start looking at what happens after the initial AST has been formed.