Walk-through flang

In the previous installment of this series we saw how flang parses the statements using an LR(1) algorithm. As the parser recognized the parts of the statements it invokes semantic actions. Today we’re going to talk more about them.

What is not being "parsed" yet

In part 4 we saw that the parser uses an LR(1) algorithm to parse the Fortran statements. But Fortran has constructs that require a specific sequence of statements: for instance a (block) do-construct is formed by a do-statement, followed by one or more statements and constructs and ended with an end-do-statement. Also not all statements are not useable in all contexts, for instance we cannot use an assignment-statement in the non-executable part of a MODULE program unit, use more than once the CONTAINS statement in any program unit or declare a variable once we have assigned a variable (unless we have started a new BLOCK construct).

The examples above are just a few of the things that the parser, because of the kind of grammars used today, can't really check by itself. We have to do these syntactical checks elsewhere. But there are other aspects of the language that the parser cannot check, for instance whether we're declaring a variable twice or using it in a way that would render an ill-formed program. For instance one cannot pass a REAL variable as argument of a function expecting an INTEGER parameter.

Luckily as we recognize the statements of the program we can react to them and assess their validity and meaning. What has to be done when a part of the language is recognized can be named as a semantic action. Although not all actions are strictly related to semantics because some of them will simply validate syntactical properties (they will be concerned only about the form instead of the meaning).

Displeasure

It is precisely this semantic analysis of front-ends and parsers the one that most people find annoying. The reason is that the solutions are usually very ad-hoc, extremely tailored to the language, rarely general or systematic enough. Don't get surprised if this part of flang is precisely an example of this. Of course it does not have to be like this but it is hard (and difficult to motivate) to do otherwise.

Semantic actions and values

Semantic actions are a way of implementing a theoretical formalism called attribute grammars. Attribute grammars associate properties to terminals and non-terminals called attributes. These attributes can be synthesized or inherited. Purely synthesized attributes are those that are defined strictly by the part of the language. For instance an expression 1.0 + 1 we know it will have type REAL and an expression 1 + 1 will have a type INTEGER. We can easily identify the type of an expression as a synthesized attribute of expressions. An inherited attribute, though, is one that depends on the specific situation of the part of the language. These are attributes that somehow denote context. For instance in the following example, the expression A + 1 has type REAL. And we know this because A has type REAL. If we redeclare A as an INTEGER the expression A + 1 has type INTEGER. This is because the occurrence of A in an expression may have a type or another depending on the "declared type" of A, that declared type is an inherited attribute at this point.

REAL :: A
A = 42
PRINT *, A + 1

You will see that implementing attribute grammars in an LR(1) parser will end in mixed feelings.

For the synthesized attributes, we use semantic values. Each part of the language (token or rule) is informally (i.e. enforced by checks/assumptions in the code) assigned a semantic value. There are many types of semantic values defined in semant.h but the 4 most used ones are the following:

tools/flang1/flang1exe/semant.h

#define S_CONST 1    /* scalar constant */
#define S_EXPR 2     /* expression */
#define S_LVALUE 3   /* non-whole variable */
#define S_IDENT 7    /* identifier, possibly a whole variable */

S_EXPR is used for general expressions. S_CONST is used for expressions that denote constants or for contexts where a constant is required. S_LVALUE and S_IDENT represent variables in Fortran parlance, this is, expressions that denote some storage in memory. Usually these expressions appear at the left of an assigment hence the name lvalue. S_IDENT may also be used for places where the name of an identifier is required.

All semantic values, regardless of their particular type share a few attributes that are always available. The semantic value is defined in semstk.h and their type in C is SST but in general only pointers to it will be used.

tools/flang1/flang1exe/semstk.h

typedef struct sst {
  short id;           /* type of this stack entry */
  unsigned char flag; /* general flag */
  unsigned f1 : 1;    /* plain expr flag - 0 => no parens */
  unsigned f2 : 1;    /* id is an alias */
  int ast;
  int mnoff; /* derived type flag & information */
  int sr;    /* save & restore word */
  /* value of this stack entry */
  ...
} SST;

The field id stores the type (e.g. S_CONST or S_EXPR). Another important field is ast where an identifier to an AST tree will be stored, ASTs will represent computation itself. The other fields are used less often as they have narrower meanings.

A SST also has a type-specific fields but they are not commonly accessed directly but using a bunch of macros which encapsulate the attributes. These macros use the following naming convention: if they end with G the macro just returns the attribute (i.e. it gets it). If the macro ends with P, then the macro receives two arguments, the first one being the semantic value and the second one being the value to which we want assign the attribute (i.e. it puts it). Most of the time, the attributes that we can get and set are integers.

Recall that a semantic action is executed when a reduction happens. Each reduction is a rule alternative with a left-hand side being the rule name and the right-hand side a sequence of tokens and rules. Each of these tokens and rules will have their semantic value. We can read the semantic values of the elements of the right hand side using the macro RHS(n) where n is the number of the element that we want to access (starting from 1). We can use the macro LHS to access the semantic value of the left hand side, typically to assign its attributes. Be careful to read RHS(1) first before setting LHS as they actually refer to the same memory. This might be handy if LHS is just propagating RHS(1), in a unary reduction of the form X ::= Y in cases where nothing else must have to be checked.

What about inherited attributes? Well, it is not possible to implement them nicely, so a way to do this is using global variables or data structures that represent the environment. In the case of flang there is a giant structure in tools/flang1/flang1exe/semant.h called sem with lots of fields that will be used to represent inherited information. This global variable called sem has so many fields that is not worth detailing them here.

Example

Let's go through a very simple example to see what the parser does. Recall that the parser does two passes through the input and in each pass different semantic actions can be run for the same reduction. We start with the first pass but later on we will have to switch to the second one.

1
2
3
4
PROGRAM MAIN
    INTEGER :: X
    X = 1
END PROGRAM MAIN

When the parser sees the token PROGRAM it already knows it can perform the following reduction.

prod(   2) <stbeg> ::=

This is a nullary reduction so no RHS will be used. Setting LHS is possible in these cases though. If you recall the part 4, reductions are classified in blocks. This one belongs to the first block and is implemented in semant.c. The code it does is surprisingly long so I'm not pasting all of it here.

tools/flang1/flang1exe/semant.c

/* ------------------------------------------------------------------ */
/*
 *    <stbeg> ::=
 */
case STBEG1:
  if (sem.in_enum) {
    switch (scn.stmtyp) {
    case TK_ENUMERATOR:
    case TK_ENDENUM:
      break;
    default:
      error(155, 3, gbl.lineno, "ENUMERATOR statement expected", CNULL);
      sem.ignore_stmt = TRUE;
      break;
    }
  }

First it checks if we are inside an enum definition (a new feature of Fortran 2003) which is kept in sem.in_enum. sem is a global variable that keeps a vast amount of state regarding the parsing. It is defined in tools/flang1/flang1exe/semant.h. Inside an enum definition only an ENUMERATOR constructor or END ENUM can appear so these are the two valid tokens at this point, otherwise this statement is wrong and has to be ignored (sem.ignore_stmt = TRUE). break will finish the semantic processing.

tools/flang1/flang1exe/semant.c

if (sem.pgphase == PHASE_USE) {
  switch (scn.stmtyp) {
  case TK_USE:
  case TK_INCLUDE:
    break;
  default:
    apply_use_stmts();
    if (sem.deferred_func_kind) {
      get_retval_KIND_value();
    }
    if (sem.deferred_func_len) {
      get_retval_LEN_value();
    }
    if (sem.deferred_dertype) {
      get_retval_derived_type();
    }
    break;
  }
}

If you recall part 2 you may remember that Fortran has a strict ordering of statements. If we are still in a position to accept USE statements and the next token suggests a USE-statement we just proceed. Otherwise, no more USE statements can appear in which case their effect can be executed at this point: basically adding the imported names. Also functions may have some attributes that use names in these USE-statements, so the frontend has had to defer them up to this point.

There are a few more checks but most of them concerned to parallel constructs that we will omit here. Now the following reduction is executed.

prod(  33) <id> ::= <id name>

tools/flang1/flang1exe/semant.c

/* ------------------------------------------------------------------ */
/*                                                                      
 *    <id> ::= <id name>                                                
 */                                                                     
case ID1:                                                               
  np = scn.id.name + SST_CVALG(RHS(1));                                 
  sptr = getsymbol(np);                                                 
  if (sem.in_dim && sem.type_mode && !KINDG(sptr) &&                    
      STYPEG(sptr) != ST_MEMBER) {                                      
    /* possible use of a type parameter in the dimension field          
     * of an array type component declaration                           
     */                                                                 
    KINDP(sptr, -1);                                                    
  }                                                                     
  SST_SYMP(LHS, sptr);                                                  
  SST_ACLP(LHS, 0);

As we explained in part 3, scn.id.name is a buffer of null-ended strings for the current statement. Each identifier tokenized by the parser is given as a value the offset (in bytes) of that buffer where the identifier is stored (i.e the buffer is a sequence of null ended strings). When doing a shift, the parser has put the value passed by the tokenizer (although it used SST_SYMP that macro uses the same storage as SST_CVALx). So in np we will have the current identifier. The next step is getting a symbol pointer (it is actually a 32-bit integer) of it in the symbol table. We will talk about the symbol table in the next chapter. Once we get the symbol very little has to happen at this level: just store it in LHS using SST_SYMP (put the symbol lptr in LHS). SST_ACLP just clears a field that is used for array-constructors (it is unclear to me why it has to be done for every identifier at this point).

Some other checks happen but are not very relevant at this point. The next reduction is

prod(  45) <routine id> ::= PROGRAM <id>

tools/flang1/flang1exe/semant.c

/*                                                      
 *      <routine id> ::= PROGRAM <id>                   
 */                                                     
case ROUTINE_ID4:                                       
  gbl.rutype = RU_PROG;                                 
  rhstop = 2;                                           
  if (IN_MODULE)                                        
    ERR310("PROGRAM may not appear in a MODULE", CNULL);

The semantic action starts by marking in a field of gbl.rutype that this is a program. Other valid values of this field are RU_FUNC (for FUNCTION), RU_SUBR (for PROGRAM) and RU_BDATA for (BLOCK DATA). To be honest I have no idea what the prefix RU_ means here but apparently is related to subprogram. Maybe it started as a way to distinguish the routines and RO might be confused with read-only.

Next is checking that we're not inside a module. This macro expands to (sem.mod_cnt && gbl.internal == 0) which means we have seen zero modules so far and we are not a sub-program. Certainly a PROGRAM statement cannot appear inside a MODULE.

tools/flang1/flang1exe/semant.c

routine_id:                                                
  is_entry = FALSE;                                        
  if (sem.interface && gbl.currsub) {                      
    error(303, 2, gbl.lineno, SYMNAME(gbl.currsub), CNULL);
    pop_subprogram();                                      
    pop_scope_level(SCOPE_NORMAL);                         
  }

This code is shared among several semantic functions so not all the checks will make much sense for the PROGRAM statement itself. Here we acknowledge that this is not an ENTRY statement. Confusingly is_entry is a static variable of semant.c so we may just be resetting it here to false. Also a PROGRAM statement cannot appear inside an INTERFACE. A few lines below we see some interesting stuff:

tools/flang1/flang1exe/semant.c

  sptr1 = sptr = SST_SYMG(RHS(rhstop));

Where the code retrieves the symbol from RHS(2), so the <id>. This is the attribute we synthesized in the previous reduction.

The next reduction is

prod(  23) <prog title> ::= <routine id> |

Here we would check generic things related to PROGRAM, FUNCTION, SUBROUTINE, MODULE and BLOCKDATA statements. For FUNCTION and SUBROUTINE at this point we would keep the dummy arguments (formal parameters) that would have been synthesized in another semantic function.

Next reduction is

prod(   4) <statement> ::= <prog title>  |

This semantic function does final processing, most of it only of concern of parallel constructs.

prod(   3) <stend> ::=

More finish processing, mostly cleanups so the state of the next statement is ready.

prod(   1) <stmt> ::= <stbeg> <statement> <stend>

The semantic function associated to this reduction does nothing.

So far we have just parsed PROGRAM MAIN. Now we move onto INTEGER :: X. I'll skip the boring reductions this time.

prod( 154) <base type> ::= INTEGER |

The semantic just synthesizes an inherited attribute for this one that will be used later.

tools/flang1/flang1exe/semant.c

      /* ------------------------------------------------------------------ */
      /*                                                                      
       *      <base type> ::= INTEGER  |                                      
       */                                                                     
      case BASE_TYPE1:                                                        
        sem.gdtype = sem.ogdtype = stb.user.dt_int;                           
        sem.gty = TY_INT;                                                     
        break;

prod( 163) <opt len spec> ::=    |

Basically we remember that there is not any kind of length specifier like INTEGER * N or CHARACTER * N. Technically the Fortran standard only allows this for CHARACTER under the LEN attribute (i.e. CHARACTER * N means CHARACTER(LEN=N)) but historically this has been accepted for non-CHARACTER types to mean KIND (i.e INTEGER * 8 means INTEGER(KIND=8)). Apparently Flang also accepts a colon (:) instead of *, so INTEGER : 8 would also be accepted.

tools/flang1/flang1exe/semant.c

/* ------------------------------------------------------------------ */
/*
 *      <opt len spec> ::= |
 */
case OPT_LEN_SPEC1:
  SST_IDP(LHS, 0);
  SST_SYMP(LHS, -1);
  SST_ASTP(LHS, 0);
  SST_DTYPEP(LHS, sem.gdtype);
  break;

prod( 145) <data type> ::=  <base type> <opt len spec> |

Here the semantic function synthesizes the final type using <base type> and <opt len spec>.

prod( 298) <opt attr list> ::=  |

Here we would handle the attributes like DIMENSION, POINTER, ALLOCATABLE, etc. But in this declaration there is none, so nothing of interest happens (only minor bookkeeping).

prod(  21) <pgm> ::=

Here we check that we are inside some program unit. In this case this will do nothing, but for historical reasons it is possible to start an implicit PROGRAM program unit by using any statement that may follow a PROGRAM statement. This means, among other consequences, that the minimal valid Fortran program is just END and a "hello world" can be as short as PRINT *, "HELLO WORLD" statement followed by an END statement: no need for an initial PROGRAM (though omitting it is currently deprecated).

prod(  33) <id> ::= <id name>
prod(  32) <ident> ::= <id>

The first production we already saw it for the name of PROGRAM. The second one just marks the semantic value of LHS as an identifier.

prod( 163) <opt len spec> ::=    |

Ok, again a length specifier here. In this case it would be like INTEGER :: variable * N which again is an extension, in standard Fortran this is only allowed for CHARACTER. Our declaration has none so again nothing happens here.

prod( 360) <entity id> ::= <ident> <opt len spec>  |

Here we gather the length information and array information (none for our declaration). We take the symbol we stored in <ident> and create a variable for it.

tools/flang1/flang1exe/semant.c

      entity_id_shared:         
        sptr = SST_SYMG(RHS(1));
        /* .. several lines omitted .. */
        sptr = create_var(sptr);
        /* .. several lines omitted .. */
        SST_SYMP(LHS, sptr);
        stype1 = STYPEG(sptr);

create_var (defined in semant.c) makes a few checks and calls refsym_inscope (defined in semsym.c) to insert the symbol in the current scope. After this, the semantic function checks the type of the symbol and does several extra checks on the type of the declaration. If all goes well then the symbol is given the type. The code is a bit involved at this point because Fortran has several non-executable statements that can set the same attributes. The following code shows two ways of declaring two arrays A and B of 10 INTEGER elements.

INTEGER :: A       ! Sets the type attribute of A to be INTEGER
DIMENSION :: A(10) ! Sets the dimension attribute so A is an array 1 to 9
                   ! Ultimately A is an array 1 to 9 of INTEGER
INTEGER :: B(10)   ! Sets the type and dimension attributes at once.
                   ! B is an array 1 to 9 of INTEGER

Several statements allow setting the dimension attribute for historical reasons as well:

INTEGER :: A       ! Sets the type attribute of A to INTEGER
SAVE :: A(10)      ! Sets the save and dimension attributes

INTEGER :: B       ! Sets the type attribute of B to INTEGER
TARGET :: B(10)    ! Sets the target and dimension attributes

prod( 357) <entity decl> ::= <entity id> |

At this point some checks regarding the initialization of the entity happen.

prod( 356) <entity decl list> ::= <entity decl>

This does nothing in the context of a type declaration but it just synthesizes a list of entities in other cases (the reduction used is the same in both).

prod( 104) <declaration> ::= <data type> <opt attr list> :: <pgm> <entity decl list>

This is the whole declaration. The semantic function just cleans up a few variables and sets a null AST to the LHS semantic value.

tools/flang1/flang1exe/semant.c

/*                                                                       
 *    <declaration> ::= <data type> <opt attr list> :: <pgm> <entity decl
 *list> |                                                                
 */                                                                      
case DECLARATION26:                                                      
  if (entity_attr.exist & ET_B(ET_PARAMETER)) {                          
    seen_parameter = TRUE;                                               
    if (sem.interface == 0)                                              
      end_param();                                                       
  }                                                                      
  SST_ASTP(LHS, 0);                                                      
  in_entity_typdcl = FALSE;                                              
  entity_attr.exist = 0;                                                 
  entity_attr.access = ' ';                                              
  bind_attr.exist = -1;                                                  
  bind_attr.altname = 0;                                                 
  break;

prod(   6) <statement> ::= <declaration> |

At this point the code updates the program phase, this is, the position that determines which statements inside the program unit are allowed and which are not.

tools/flang1/flang1exe/semant.c

case STATEMENT3:
  sem.class = 0;
  prevphase = sem.pgphase;
  if (scn.stmtyp == TK_IMPLICIT) {
    if (sem.pgphase > PHASE_IMPLICIT)
      errsev(70);
    else
      sem.pgphase = PHASE_IMPLICIT;
  } else if (scn.stmtyp == TK_DATA || scn.stmtyp == TK_NAMELIST) {
    if (sem.pgphase > PHASE_EXEC)
      errsev(70);
    else if (sem.pgphase < PHASE_SPEC)
      sem.pgphase = PHASE_SPEC;
  } else if (scn.stmtyp == TK_INTERFACE || scn.stmtyp == TK_ABSTRACT) {
    sem.pgphase = PHASE_INIT;
    prevphase = PHASE_INIT;
  } else if (scn.stmtyp == TK_PARAMETER) {
    if (sem.pgphase > PHASE_SPEC)
      errsev(70);
    else if (sem.pgphase < PHASE_IMPLICIT)
      sem.pgphase = PHASE_IMPLICIT;
  } else if (scn.stmtyp == TK_USE) {
    if (sem.pgphase > PHASE_USE)
      errsev(70);
    else if (sem.pgphase < PHASE_USE)
      sem.pgphase = PHASE_USE;
  } else if (scn.stmtyp == TK_IMPORT) {
    if (sem.pgphase > PHASE_IMPORT)
      errsev(70);
    else if (sem.pgphase < PHASE_IMPORT)
      sem.pgphase = PHASE_IMPORT;
  } else {
    if (sem.pgphase > PHASE_SPEC)
      errsev(70);
    /* allow for attributes before a use statement */
    else if (scn.stmtyp != TK_ATTRIBUTES && scn.stmtyp != TK_MP_DECLARESIMD)
      sem.pgphase = PHASE_SPEC;
  }

Then, if the declaration entails any computation, it keeps the AST that represents it. This might happen for instance with a POINTER that must be initialized in a way that is not ASSOCIATED to anything, likewise for ALLOCATABLES and other entities.

tools/flang1/flang1exe/semant.c

if ((ast = SST_ASTG(LHS))) {
  (void)add_stmt(ast);      
  SST_ASTG(LHS) = 0;        
}

At this point we have finished with INTEGER :: X and we can proceed to parse X = 1. Again I will omit repeated reductions.

prod(  19) <nii> ::=

Here we check that we are not inside an INTERFACE. Certainly an assignment is not valid happen there.

prod(  20) <nim> ::=

Here we check that we are not inside a MODULE. Again, an assignment is no valid there. Several of the reductions in this pass of the parsing do not have any interesting semantic value, so let's assume we're already in the second pass.

prod( 765) <psfunc> ::=

According to the code, this rule just toggles a flag that says whether we are at the left of a =. psfunc apparently means possible statement function. A statement function is like a small declaration that totally looks like an assignment except that it takes scalar parameters and returns a scalar value. For obvious reasons this statement is deprecated as of Fortran 90.

PROGRAM FOO
 IMPLICIT NONE
 INTEGER :: X
 INTEGER :: F
 F(X) = X + 1     ! Looks like an assignment
                  ! but is a function statement
                  ! They can only appear before
                  ! the first executable statement.
 PRINT *, F(10)   ! Will print 11
END PROGRAM FOO

prod(  33) <id> ::= <id name>
prod(  32) <ident> ::= <id>
prod( 627) <var ref> ::=    <ident>  |

We're at the left of the assignment, among other things, this function will mark <var ref> as a ST_IDENT.

prod( 765) <psfunc> ::=

Now we mark that we are not anymore at the left of an assignment, we can't see it yet through the reductions but we have already consumed the = token.

prod( 665) <constant> ::=   <integer>  |

So this is the token for the 1. The semantic function will basically set the LHS to no symbol, its type to be integer and will store the constant, this is done in ast_conval.

tools/flang1/flang1exe/semant2.c

SST_LSYMP(LHS, 0);      
SST_DTYPEP(LHS, DT_INT);
/* value set by scan */ 
ast_conval(top);

tools/flang1/flang1exe/semant2.c

static void                                                     
ast_conval(SST *top)                                            
{                                                               
  SST_ACLP(LHS, 0); /* prevent UMR */                           
  SST_ASTP(LHS, mk_cval1(SST_CVALG(LHS), (int)SST_DTYPEG(LHS)));
  SST_SHAPEP(LHS, 0);                                           
}

(UMR means uninitialized memory read, and it is a warning printed by Purify, a tool similar to Valgrind to detect memory errors.) The usage of top vs LHS is a bit confusing at this point. My understanding is that they are the same.

prod( 597) <primary> ::= <constant> |
prod( 584) <expression> ::= <primary>   |

These two reductions do nothing or simply propagate values upwards the expression hierarchy.

prod( 764) <assignment> ::= <psfunc> <var ref> <psfunc> = <expression>

Here we will build the assignment itself. This is a long function, but at some point we will get the AST of RHS(5) and the location in RHS(2). Here RHS(2) is exactly <var ref> and RHS(5) is <expression>. An AST representing the assignment is then created which is stored in LHS. The code also makes sure we know we are in the executable part of the program unit (only executable statements or FORMAT statements are allowed in that part).

tools/flang1/flang1exe/semant3.c

   /* ... */
   ast = assign(RHS(2), RHS(5));
   *LHS = *RHS(2);
   /* ... */
   sem.pgphase = PHASE_EXEC;
   /* ... */
   SST_ASTP(LHS, ast);

prod( 742) <simple stmt> ::= <assignment> |
prod(   7) <statement> ::= <nii> <nim> <simple stmt> |

The semantic function of <simple stmt> does nothing. The semantic function of <statement> takes the AST from <simple stmt> and adds it to the global AST.

Finally we are at the END PROGRAM statement.

prod(  22) <end> ::=

This does nothing in our case, in other contexts may have to finish the INTERFACE blocks seen so far.

prod(  33) <id> ::= <id name>
prod(  32) <ident> ::= <id>
prod(  78) <opt ident> ::= <ident>
prod(  73) <end stmt> ::= ENDPROGRAM    <opt ident> |

Several final clean ups happen here. Too complicated at this point to describe them.

Wrap-up

Ok, this was very long and sadly we could not get much detail of this. In the next chapter we will see how the symbol table works.