In the last installment we saw how flang splits the input in tokens. Once we have the tokens identified we need to parse them.
Top level parser routines
The top-level parser routine in flang1 is called, not unexpectedly,
parser. It is invoked by main right after all the initialization has happened for each program unit in the input file.
Parsing as we mentioned in the last chapter happens in two passes. A first one gets the token from the user code and as a side effect generates an intermediate file. This intermediate file is stored in the global variable
astb.astfil, contrarily to what the name suggests it does not contain an AST per se but a simplified rendition of the tokens we found in the user code. The second pass gets the tokens from this intermediate file.
The first step is identified by the global variable
sem.which_pass being set to
0. After a few more initialization steps,
_parser. Where all the interesting stuff happens.
_parser is basically a big loop which parses a statement at a time. In the code below
LOOP is just a macro that expands to
while(TRUE). It starts by reading the first token of the statement. Recall that
get_token will either tokenize (scan) the user code or get the token from the intermediate file depending on which parsing pass we are (we can expect, except for errors, that in the two passes the sequence of tokens obtained will be the same).
Parsing is commonly used to mean a language recognizer that is able to associate some actions, usually called semantic actions, when parts of the language are recognized. If our recognizer is only able to say whether the input is a valid Fortran program, then it is of little use for a compiler. We need to be able to react to the parts of the language because the designers have assigned meaning (semantics) to them. And this is what the Fortran parser has to do as well.
The syntax of programming languages can be described using a grammar and Fortran is no exception. A grammar in this context is basically a set of rules (called productions) of the form
Variable ::= Sequence-of-Tokens-and-Variables. For instance we can define the form of an assignment-statement in Fortran using the following rule.
In the rule above for
= is just a token and
<var ref> and
<expressions> are productions.
At this point we can unveil where the Fortran grammar used by flang is defined. It is in the file
flang/tools/flang1/utils/prstab/gram.txt. This file does not define a grammar for the whole Fortran language but it does this at the level of statement. This is, the grammar faithfully describes the form of each statement but does not describe the overall form of the program unit or constructs that are compound of several statements (like an
if-construct or a
do-construct). The syntax of those will have to be handled elsewhere.
The top-level rule of the grammar is
<stmt>, because as stated above the grammar only describes the syntax of a single statement.
As you can see a rule may have several variants or alternatives. These alternatives are numbered starting from one, this is
<statement> ::= <prog title> is the alternative 1, and <nii> <nim> <entry statement> is the alternative 2. This numbering schema will be relevant when we talk about semantic analysis later.
You may be wondering how the tokens used in the grammar (like
CONTAINS above) are related to the internal tokens used by the scanner (i.e.
TK_CONTAINS). The relationship is defined in the file flang/tools/flang1/utils/prstab/gram.tki</code>.
A few files are generated at build time using these two files via the tool
prstab found in
tools/flang1/utils/prstab (we are going to ignore that tool because it is written in a style that makes it frankly unfathomable, we may go through it some day but not in a near future).
gramtk.his a bunch of defines for the tokens used by the scanner.
gramnt.his a bunch of defines for the rules of the grammar. Here
ntmeans non-terminals as the tokens are often called terminals (as they cannot generate anything). Those identifiers are not directly used by the code but the parser, as we will see below, will need to internally represent them.
gramsm.his a bunch of defines for the different alternatives of the rules of the grammar, as mentioned above the different alternatives of a rule are numbered starting from 1.
gramdf.his a set of tables used by the parsing algorithm that we will discuss below.
Parser algorithm: LR(1)
As a I mentioned above the parser will use a grammar to parse each statement. In computer science, grammars are generative devices, i.e. they define a set of words that will ultimately be the language. What we really want, though is, given a grammar that generates Fortran statements obtain an algorithm that recognizes Fortran statements. And not only this, but we want the recognizer also allows us to associate semantic actions to the different parts of the statement (i.e. parse properly).
There is an abundant literature of algorithms that given a grammar can in many cases generate an algorithm that can recognize the language generated by the grammar. In the particular case of flang the algorithm used, as I mentioned above, is a bottom-up parser. In a bottom-up parser we read tokens but we don't immediately react to them until we identify that they form a part of the language (this step is called a reduction). For instance if we see the following sequence of tokens (end-of-statement is represented with ↲).
we know that after the token ↲ we have seen an
assignment-statement and we want to react accordingly. A compiler typically will create some internal representation that means
assign 1 to the variable A (how it does this is not important at this point yet). Recognizing this fact means that we can reduce A = 1 into the
<assignment>used in the examples is not exactly the one used by flang. The real one includes a couple more elements that I omitted for the sake of the explanation.
This reduction process happens as the input is read. Advancing through the input is called shift (which explains why bottom-up parsers are often called shift/reduce parsers). In the example below ↓ will mean the position of the input, which starts at the beginning of the statement and ends after the end-of-statement. I am using a simplified form of what happens really in the flang grammar, but should be enough to understand the process.
The key in a bottom-up algorithm is knowing whether we have to shift or reduce and if reducing, which rule of the grammar has to be used. Once a reduction has been chosen, all the elements of the right hand side of the rule (recall it is a sequence of tokens and rules) will appear immediately before (to the left of) the ↓ so we just replace all them with the left hand side of the rule used for the reduction. The exact details are beyond the scope of this post. But suffice to know for now that the algorithm used by flang is a LR(1) (you may have heard of LALR(1) because of
bison, LR(1) is just a more general version of it).
So now _parser uses the LR(1) algorithm to parse the statement. We can complete a bit more the function
The complicated part above is determining if there is a reduction to be done. To do this the LR(1) algorithm uses a finite automaton that represents all the possible things that can be before the ↓. The particular state is represented in the flang parser using the variable
cstate. The initial state for parsing is the state 1 and is reinitialized per every statement in the function
parse_init (among other stuff used by the LR(1) algorithm itself).
Note that a rule with alternatives like
<rule> ::= A | B | C | … will actually imply several possible reductions
<rule> ::= A,
<rule> ::= B,
<rule> ::= C,
…. So the set of all possible reductions in the parser is all the alternatives of all the rules encoded in the defines of
gramsm.h above (e.g.
In each state of the LR(1) automaton there will be a number of potential reductions (I repeat it here: each reduction is one rule alternative of the grammar) that can be done. The table
fred tells us the set of those reductions. The 1 in LR(1) means that we use one token of look ahead to know exactly which of the reductions is to be done. The table
fred is actually a table of pairs
(start, end) that define the potential reductions associated with the current state.
Now the parser needs to check each reduction against the current token (in
tkntyp). So it loops each possible reduction.
nset maps from the reduction alternative to a look ahead set. I'm not exactly sure but I think this allows mapping different reduction alternatives to the same set of look ahead tokens (it looks plausible to me having different reductions in different states that end using the same set of look ahead tokens). With that look ahead set, we can query the
lset table. This is a table of tuples again (like
fred above) indexed by the look ahead set that denote a range of tokens in the table of look ahead tokens
ls (more on this table below). Now the code does a binary search where it searches the token
tkntyp in the look ahead set of the current reduction.
The code above uses the table
ls. The table
ls is a sequence of sorted sequences of token identifiers. So the tuple in
lset tuple points us to the start and end of one of those sorted sequences. That range is then used for the binary search itself. If the token is found the parser goes to
perform_reduction. Otherwise it tries all the other reductions. If no reduction is found then it will just proceed to do a shift.
A non-obvious precondition here is that the grammar fulfills the requirements of LR(1) which means that, for a given state, it is not possible both to have to do a reduction or a shift or two different reductions. This is why the code can eagerly assume that if a reduction is found that one is the right one to execute. Conversely, if no reduction is possible the only thing left to do is to shift.
Perform a reduction
Performing a reduction entails replacing a number of elements before ↓ with the right hand side of the reduced rule. So we keep in
rednum the reduction identifier (the look ahead is not relevant anymore here). The table
prod will map the current reduction with a rule alternative as defined above in
STATEMENT2, etc.). We also need to keep the current token in a global variable for communication purposes with other parts of the code that we will see below.
One of the nice things of the LR(1) algorithm is that it is possible to use a simple stack to represent the sequence of elements before ↓. Shifts will push new values onto the stack and a reduce will pop 0 values or more from it. flang1 actually uses two stacks (both with the same size):
pstack where the states of the LR(1) automaton are stored and
sst where the semantic values are stored. More on this later. The variable
stktop tracks the top of the two stacks. Table
len contains the number of elements on the right hand side of the rule alternative. For the example above, this means going from
<var ref> = <expression>↓ to
<assignment>↓. A table called
len gives us the length of the current reduction. For the case of
<assignment> its length would be 3.
Now that we have identified some part of the language a semantic action can be run.
Flang classifies the reductions in 5 sets depending on their number. Each set is defined in the grammar via the special separator
.B (possibly means block but whatever). Variables
p_semsmp are a tuple of pointer to functions. There is a pointer per parsing pass (first or second pass). The pointer points to a semantic function that is given the rule alternative and the new top of the stack (before we add the rule of the reduction to it).
After the semantic action has been executed (which is not strictly related to the LR(1) algorithm itself per se, just a process that fits very conveniently so we can actually parse) we have to update the state of the LR(1) automaton to represent the new state. The automaton must now represent the state that results from having applied the reduction. The table
lhs returns the rule identifier (the one that contains the particular rule alternative
rednum) as defined in
gramnt.h above. Function next_state computes the next state given the current state (in
pstack) and the next symbol. For shifts, as we will see below, this symbol is a token but for reductions is a rule because we have somehow read a whole rule (this is, something that the rule can generate).
If the transition is invalid for the current state and symbol,
NOSTATE which means a syntactic error. This is handled later.
Regardless of whether a reduction has happened or not a shift always happens. Shifting is much simpler because in practice means just updating the LR(1) automaton to the new state.
At this point the parser has to handle a syntactic infelicity in the flang grammar: complex constants. Complex constants in Fortran 90 are of the form
(part, part) where part is either a constant name (like FOO) or a signed integer or real constant. So complex constants like
(-1.2, +3.4) are valid but not
(-1.2 + 3.4, 3.4). Flang, though, accepts the latter case as an extension. Or maybe because it is not feasible to constrain the grammar for this case (the result might not be a grammar valid for LR(1)) or it is too difficult to change the scanner to handle a complex constant as a whole token. The net result is that in some cases a comma (
,) that does not have a valid transition may have one if a complex comma is used instead.
So the parser gives a second chance to a troubled comma.
NOSTATE at this point just means a syntactical error that must be handled at this point (note that this label is inside the block of the
if (nstate == NOSTATE)).
sem.psfunc is used for a deprecated feature in Fortran called statement-function-statement which we can ignore for now. It is just enough to know that they look like assignments so they have to be handled very carefully. Not sure, but probably the parser disables them at this point for sanity later.
If the transition is a valid one we just increment the stacks, allocating more space for them if needed. We keep the new state and the semantic value of the current token. We finally move to the new state by updating
At this point we may have to end the parsing for the current statement in case we reach an end of line. The flag
endflg is set to 1 if we have just finished the current program unit, in which case the parsing ends (
main will reinvoke
parser to parse the next program units). Also the code checks whether we have reached the end of the file when the program unit has not ended yet: this is a severe error reported using
errsev. The extra check for
!scn.multiple_stmts is there because
gbl.eof_flag may have been set when reading a line with more than one statement (separated by
;). It will have been unset when the scanner moves on to the last statement of the line, though.
When the parser recognizes syntactically a part of the program we want to be able to run some semantic actions with that part. These semantic actions include extra checks that cannot be represented in the grammar (i.e. redeclaring a variable) and also to do something with these parts of the language like creating an intermediate representation that will be later used to generate code.
Semantic actions are executed when a reduction happens. Conceptually it is like calling a function in which we pass the semantic values of the right hand side of the reduced rule. We can also understand that the result of the semantic function will be another semantic value, computed as the semantic value of the left hand side rule of the reduced rule.
In practice there is not one function per reduction. If you recall above we saw this code:
They are defined earlier in parser.c like this. Note that
p_semant2 has an empty field for the first component of the tuple.
Also right after the first token of the statement (before we enter the loop for the rest of the tokens of the statement),
p_semant2 are updated (not sure why
p_semant1 needs any fixing because apparently it is not modified anywhere else in the code).
psemant2 pointer is just a function that does nothing. It is used for all the statements in the first parsing pass that are not declarations (i.e. executable statements like an assignment or an if-statement starting an if-construct). Similarly for functions
psemsmp used above in the definition of p_semant3, p_semantio and p_semsmp respectively.
These functions all receive the reduction rule as the first parameter and the stack of semantic actions as the second one, always called
top (or otherwise some macros will stop
LHS above is just an alias of top using a macro). They are defined in the files with their own name (except for
semant1 which is defined in
semant.c). These routines can access the right hand side semantic values of the reduction using the macro
RHS(n) where n starts from 1.
Typically these routines have a giant switch statement with the value of rednum to handle each reduction specifically. Recall that rednum will be one of the values defined in
The semantic value itself is a generic structure that allows encoding several types of data. We will see with more detail how these routines work in the next chapter.
It is possible to see the parser in action passing the following flags
-Hq,0,1 -Hq,1,1 -Hq,2,1. Consider the following silly program:
The first statement
PROGRAM MAIN is parsed with the following debug output.
Lines starting with
tnktyp is the token returned by the scanner as requested by the parser. Every line of the form
prod( n) is a reduction (shifts are not shown in the debug because are usually uninteresting as nothing happens but updating the LR(1) automaton and the stacks).
END is a special token that means end-of-statement and as such is not encoded in the rules.
As you can see many rules are reduced by they have empty right hand sides. These are called nullary reductions and their sole purpose is to cause a side-effect (e.g. changes in the internal semantic state or emitting diagnostics) when calling the semantic routines. Nullary productions are reduced in the right order by the parser once a valid token is found. You can also see that the parser may need to reduce several times before advancing the next token. This happens here once the token
END has been scanned.
The next statement
X = 1 performs a similar process.
Except for the nullary reductions, the process looks relatively similar to the reduction process we showed above when talking about the parsing. As I mentioned above, the nullary productions are used for the semantic process. Here
nii means "not inside interface" (certainly you cannot have an assignment inside an INTERFACE construct) and
nim means "not inside module" (an assignment is not valid in the specification part of a MODULE). The semantic actions of these nullary reductions will diagnose these cases.
A similar process happens with the rest of the statements: the line
X = 2.3; PRINT *, X has two statements so you will see two parses. One for
X = 2.3 and another for
PRINT *, X.
At some point you will see
And the parser will apparently parse exactly the same statements as in the first pass with the same tokens. This time, though, the semantic actions called are different.
- The top-level parser process is invoked for each program unit.
- Each statement of the program unit is then parsed individually using a LR(1) parsing algorithm.
- The LR(1) parsing algorithm is generated using the grammar in the file
gram.txtand the tokens in
gram.tki. The tables required by the algorithm itself are generated by the tool
prstab. This grammar describes the syntax of a single statement.
- When the LR(1) parser reduces the input to a rule it invokes a semantic function.
- different semantic functions may be used in the first and second pass of the parser.
- The LR(1) parser only parses a single statement at a time. Thus, something else (spoiler: the semantic functions) will have to make sure that the sequence of statements make sense for a Fortran program.
In the next chapter we will see in more detail the semantic functions and all the semantic information used by the parser itself.