In previous chapters we saw how the input source was lexed, parsed and semantically analysed and we looked at how the symbols and data types are represented. But we haven't looked at what happens once the semantic analysis finishes. In this installment we're going to talk about the AST.
Abstract Syntax Tree
Fortran has two kinds of statements: non-executable and executable. The former declare properties about entities of the program and we saw in the previous chapter that these entities or symbols are represented in a symbol table. There is usually not much more to do beyond registering and remembering the attributes implied by a non-executable statement.
Executable statements, on the other hand, describe the computation that our Fortran program has to do. So we need a mechanism to represent that computation. At some point we will use this representation to generate a program that does exactly the intended computation by the Fortran source. A common way to represent this is using an AST, or abstract syntax tree.
ASTs are called abstract because they do not represent all the lexical and syntactical details of the source. Instead, they represent the fundamental parts of the language that are going to be relevant for the computation. The flang AST does exactly that.
Lets consider this very simple function below
In the last chapter we talked about the symbol table but we didn't consider any specific example neither we attempted to dump any of the symbol tables. Now it is a good opportunity to do this.
The flang driver allows us to pass debug flags to flang1 using -Hq,x,y (and to flang2 using -Mq,x,y). The numbers x and y are documented in the flang documentation (file coding.html). We can dump the symbol table doing -Hq,5,1
When we request debug information, flang will create a file named <name>.qbf for every given <name>.f90 compiled. In our case test.qdbf.
The dump of the table is not super easy to read but basically we see four entries. Each entry starts with the name of the symbol and the kind of symbol. The next lines are several attributes of the symbol. The set of attributes dumped changes for each kind of symbol.
In our example we only have two kinds of symbols: a first add which is an entry (which is the way flang names a FUNCTION) then a, b and another add. The second add is there because it represents the result-name of the FUNCTION (had we used RESULT(myname) in the FUNCTION statement we would not have had repeated symbol names). In general repeated names are not a problem (sometimes it happens in Fortran) but it in this case, note that the scope of the variables is 624 which is the id of the symbol. The id of symbols that entail scopes like FUNCTION, SUBROUTINE, MODULE is used to define a scoping relationship. Also note that the variables are marked to have a storage-class of SC_DUMMY. In Fortran parlance a dummy-argument is a formal parameter.
Ok, now we have seen what flang knows (or at least shows in the dumps) about the symbols. Let's see the AST it generates. We can get the AST using -Hq,4,256. It is possible to combine several -Hq flags so we can see at the same time several internal dumps.
For reasons I have not looked into them yet, the AST contains references to a couple of preregistered symbols .sqrt and .dsqrt (they look like to related to the corresponding Fortran intrinsics) and a couple of constants 0 and 1. Why no other preregistered symbols or constants appear is a bit puzzling to me. That said, the interesting bits of this dump are the following
An AST is made of nodes. A node represents some computation. Sometimes the computation is almost no computation, like just a constant or a reference to a variable. Some other times the computation is compound of other computations. This is, other nodes. Recall that in line 4 our Fortran program is doing
It is possible to relate that statement to the AST dump above. Let's see, that statement is in Fortran parlance an assignment statement. Flang represents it using a node for assignments.
assign is the kind of the node, in this case this node represents an assignment. The type of the operation is integer. aptr means the id of this node (remember that these id, even if just numbers, are unrelated to the id's used for symbols or data types and the precise id number could be the same). Now we have the operands of this node, which are other computations, i.e. other AST nodes. In this case the destination (dest) is encoded in the node 15 and the source (src) in the node 14. What are the destination and the source of an assignment? They are the left and right-hand side of the assignment, respectively.
The left-hand side is ADD in our Fortran program. Should be encoded in the tree 15.
Indeed, we could check the symbol table above and see that 627 is exactly the result name add. So good so far.
The right-hand side of our assignment is A + B and it is encoded in the tree 14.
This node is a bit more interesting because it is a binary operation (binop) (again of type integer). The left and right-hand sides of a binary operations are named by flang as lop and rop respectively. We can check the corresponding AST nodes.
Kinds and attributes of AST
Flang has about 160 ASTs. They are defined in the file tools/flang1/utils/ast/ast.n. This file is in a troff-like syntax which is not particularly readable. The form is relatively simple and we can extract information using grep.
File tools/flang1/utils/ast/ast.n is also used to generate a similar one in the build directory named tools/flang1/utils/ast/ast.out.n. The latter
includes extra information of the memory layout of the nodes. However the relevant attributes should already be documented in the former. Theoretically it should be possible to generate Sphinx documentation from this file (like it is already done with other similar files) but this does not seem to be implemented yet.
For each .SM entry in ast.n, look at the following entries .SI, .SE, .FL and .OV.
Kind of AST node: BINOP, UNOP, ID, etc..
A_TYPEG(ast) will return NAME. A_TYPEP(ast, NAME) is possible but less common.
Textual description of the node: for debugging only
Name of a flag (boolean value) of this node.
A_NAMEG(ast) A_NAMEP(ast, bool-value)
Name of an attribute of this AST, usually returning an id of another AST or data type.
A_NAMEG(ast) A_NAMEP(ast, id)
Used for attributes of integer nature that are not an id of an entity (like a kind or a hash link).
A_NAMEG(ast) A_NAMEP(ast, value)
For instance, AST nodes that represent a binary operation have this information, according to ast.n.
Kind of node: BINOP
CALLFG (flag). Set if a function reference appears in the left and/or right operand. (for instance in 2.3 + SQRT(1.2))
DTYPE. Data type of the result of the operation
LOP. AST pointer to left (first) operand
OPTYPE. Type of operator (see OP_ macros in ast.h)
ROP. AST pointer to right (second) operand
ALIAS. If node evaluates to a constant, this field locates the constant node representing this value
The gory details
If you are really interested on seeing how an AST is stored, just take a look at tools/flang1/utils/ast/ast.h in the build directory. You will see that there is a global variable astb which represents the global block that stores all the ASTs. A single AST node is represented using a struct of the type AST.
The names of the nodes are stored in astb.atypes, indexed by the type of the node (i.e. A_TYPEG(ast)).
The field astb.attr stores some attributes intended to simplify some checks.
The astb global variable also stores other info. We will discuss some of that extra info later in this chapter.
Creation of an AST
ASTs are created in many places of the compiler but semantic analysis is going to create the initial set that will be used by further analysis and transformations in the compiler.
Low level creation
To create an AST node we can use the function new_node. This is a low level function that will return a new node with the given kind.
This function basically allocates a new ast in astb. The macro STG_NEXT expands to the call stg_next((STG *)&name.stg_base, 1). The function stg_next is a bit too long to explain today but basically reserves a number of data and returns the first element allocated, as an integer of course. It is a generic function but as it is applied to astb it is used to allocate ASTs.
Due to implementation details elsewhere there is a maximum number of ASTs we can allocate, so we check this. The number is fairly large 0x0400_0000 (and also gives us 5 extra bits in the top of the identifier if we ever need them).
Next thing this function does it setting the kind of the node. The kind of the node is called TYPE in flang. A_TYPEP is the setter we need to use.
While new_node will give us a node it is pretty bare. Most of the time our nodes have essential information that we will always need to create them correctly. For instance a binary operation always will have the kind of operation performed, the left hand side, the right hand side and the type of the operation. As such flang provides many helpers that ease creating those
nodes. Their names are of the form mk_*.
Flang seems to distinguish, conventionally, the creation of statement ASTs from other ASTs. So there is a generic mk_stmt. The code uses this to signal the AST is actually a statement. As a special element it receives a DTYPE to state the data type of the statement. Statements where there is not an actual data computed, like control flow statements, are created with a data type of 0.
Some statements have dedicated constuctors like mk_assn_stmt, that is used to create an assignment statement. You can see the data type, the destination (the left hand side of the assignment) and the source (the right hand side). dest and source are ASTs (although not made obvious in the prototype).
Flang ASTs are going to represent two different things:
Expressions will have a type and denote a computation which mostly revolves about evaluating the expression to determine its value. Statements are instead executed, rather than evaluated. Or put differently, they are evaluated just for their side-effects. Because statements have side-effects the order in which we execute them may be important.
Sometimes it does not matter because there is no way to tell. For instance in the example below, it does not matter that A = 3 is executed after or before the first PRINT statement, we can't simply tell the order. But we can't run the second PRINT before the first PRINT or execute the assignment after the second PRINT as that would be an observably different execution of the program.
Because not all ASTs are statements, flang decouples information that is strictly for statements in another structure called the Statement Descriptor (STD). It is an attribute of an AST that represents a statement and it is accessible via A_STDP and A_STDG as expected.
The STD data structure is relatively simple. The most relevant attributes are the the following.
The AST that is described by this STD. Contains an AST identifier
The STD that follows this STD. This is a STD identifier
The STD that precedes this STD. This is a STD identifier
The label symbol of this statement if it has one. This is a SPTR
The line number of this statement
This is the file index. It identifies the file that contains the statement. It changes because of INCLUDE lines and preprocessor #include directives. The identifier is a FIH (File Information Header) that we haven't discussed yet.
Creation of an STD
To create an STD for a an AST the function mk_std can be used. This will only link (bidirectionally) the STD and the AST but we still need to place the statement itself.
In the early stages of the compiler, the semantic actions of the parser are responsible for most of the creations of ASTs and STDs in the compiler. Because the creation of statements often happens in the order of the program there is a convenient function add_stmt that will do a mk_std to the new AST and then place the STD as the last one. Sometimes AST of statements have to be placed before or after other STDs and this can be achieved with helpers like add_stmt_after and add_stmt_before.
The file ast.c contains many other functions useful to manipulate placement of statements. Note that where the code expects an statement it means an identifier of an STD, not an AST.
In this chapter we have seen the structure that flang uses to represent the computations performed by our Fortran program.
The computation is represented using an AST. An AST is a data structure formed by AST nodes.
There are about 160 different kinds of AST nodes. Each kind represents some different form of computation, ranging from simpler ones like just references to variables or constants to more complex ones like binary operands and statements.
Each kind of node has different attributes. Each attribute has its own accessors. Some of these attributes are ASTs, SPTRs or DTYPEs.
Some ASTs represent Fortran statements, where the order is important. The extra information of statements is decoupled from regular AST in a data structure called STD.
STDs concern about ordering of the fortran statements.
In the next installment of this series we will look at ASTs a bit more, in particular we have not considered statements that change the control flow.