At this point we should have a reasonable picture of how flang parses and generates semantic information. So now it is time to explore with more detail what is actually synthesized and how it can be used later in the compiler. In this chapter we are going to see the symbol table.


Our Fortran programs deal with abstract, symbolic entities. These entities, which often have a name, are called collectively symbols. Those symbols are created by the front end as it finds declarative statements in our code. Declarations assigns several properties, or attributes, to the symbols. The front end also needs to look up these symbols in order to check their properties. For instance, in a derived type construct like

  REAL :: Y(2)
TYPE(T) :: W

The front end will need symbols for T, X, Y and W. T is a derived type. X and Y are components of the derived type T. X is the first component of T and has type INTEGER. Y is the second component of T and has type REAL and rank 1 with size 2. W is a variable whose type is TYPE(T).

Symbol table

For resource-management purposes we want to group the symbols under a single data structure usually called the symbol table.

Flang symbol table


Flang has symbol table is in a global variable called stb, declared in symacc.h (I think symacc means symbol access). This variable has type STB which is a struct with several fields described below.

The variable is (apparently) defined twice, an empty one in symacc.c and another one in tools/flang2/utils/symtab/symtabdf.h (this is valid in C code but incredibly confusing, don't do it). The latter is a file generated when compiling flang itself. This is the real definition and initializes the first 5 fields which are made of tables that won't change during the compilation (no idea at this why these tables are not decoupled from the STB).

</tbody> </table>

Integers. Integers everywhere

If the flang code weren't stuck in the 1980s the design would look a bit different nowadays. Each kind of entity (symbols, data types, trees, etc) would be represented using a pointer of a particular type. Unfortunately this is not how the flang front end works. Instead all entities are designated using an index or an offset. This offset is meaningful only within its associated table. Thus is a SPTR is just an integer that indexes the stb.stg_base and a DTYPE indexes stb.dt_base. A similar thing happens with the trees (AST).

That said, the flang code rarely uses these tables directly, instead it uses macros to access the properties given one of those integers. Get the wrong integer and you will be accessing the wrong table and using wrong data. The whole picture is a bit grim at this point from a maintenance point of view. Probably there are technical reasons in the past that justified a design like this (limited memory, compilers were also more limited at that time, reek of Fortran-style engineering, etc). None of these reasons seem to have been revisited very seriously in the code which explains its unpleasant dated look and feel. The code shows some shy attempts to move to a more declarative setting by using enums, see below, but this is still not enough: in C enums and integers are almost interchangeable, and many places the code still gives up and uses plain int.

If moving to a more "graph-like" approach using pointers for the entities is not feasible in this code base, an intermediate solution could be what in modern programming languages is often called a "new type". Here we want to use the same physical representation (integers indexing arrays) to mean different things. So we need a different type. Typically in C this is achievable by wrapping the type (in this case an integer) in a struct and passing that struct by value (as we were doing with the original integer). Most convention calls are designed in a way that passing a struct like this is equivalent to pass the original integer.Type checking rules of C will catch many wrong uses and the macros can be reused by just readjusting them. Often, though, more strictness may raise issues when the original code was doing dubious things that even if out of the type discipline, are still correct. So this process is likely to be painful anyways: the technical debt repayment is going to involve a fat bill.

Static tables

From the definition above we see that flang uses several tables. A first set of them are static and won't change (stypes, ovclass, ocnames, scnames and tynames) and their sizes are known upfront (ST_MAX, OC_MAX, SC_MAX and TY_MAX).

A second set of them are dynamic and may grow depending on the program unit being compied.

Data types

There is a table of data types in stb.dt_base which can contain up to stb.dt_size entries. The next available slot for a new datatype (when we need a new one) is represented in stb.dt_avail. This table is reallocated dynamically as needed. The table is indexed by integers (of type ISZ_T which happens to be a long). That table is a bit special because it has entries of variable length. While scalar types like INTEGER will just take one entry, types associated to arrays will take up to three entries. The first entry is always of type TY_KIND which is a set of kinds of types. That type determines whether more data follows. The set of kinds of types is static and predefined in symtab.h.

 typedef enum TY_KIND {
     TY_NONE = 0,     // No type in this entry of the table
     TY_WORD = 1,     // 32-bit value. Used internally by flang.
     TY_DWORD = 2,    // 64-bit value. Used internally by flang.
     TY_HOLL = 3,     // Type of Hollerith constants.
     TY_BINT = 4,     // Type of a BYTE (flang extension?).
     TY_SINT = 5,     // Type of an INTEGER(KIND=2)
     TY_INT = 6,      // Type of an INTEGER(KIND=4)
     TY_INT8 = 7,     // Type of an INTEGER(KIND=8)
     TY_REAL = 8,     // Type of a REAL(KIND=4)
     TY_DBLE = 9,     // Type of a REAL(KIND=8) or DOUBLE PRECISION
     TY_QUAD = 10,    // Type of a REAL(KIND=16)
     TY_CMPLX = 11,   // Type of a COMPLEX(KIND=4) (also called COMPLEX*8)
     TY_DCMPLX = 12,  // Type of a COMPLEX(KIND=8) (also called COMPLEX*16)
     TY_QCMPLX = 13,  // Type of a COMPLEX(KIND=16) (also called COMPLEX*32)
     TY_BLOG = 14,    // Type of a LOGICAL(KIND=1)
     TY_SLOG = 15,    // Type of a LOGICAL(KIND=2)
     TY_LOG = 16,     // Type of a LOGICAL(KIND=4)
     TY_LOG8 = 17,    // Type of a LOGICAL(KIND=8)
     TY_CHAR = 18,    // Type of CHARACTER(KIND=1)
     TY_NCHAR = 19,   // Type of CHARACTER(KIND=2)
     TY_PTR = 20,     // Type of pointers. Used internally by flang.
     TY_ARRAY = 21,   // Type of an array
     TY_STRUCT = 22,  // Type of a struct
     TY_UNION = 23,   // Type of a UNION (flang extension)
     TY_DERIVED = 24, // Type of a derived type (i.e. TYPE(T))
     TY_NUMERIC = 25, // Any numeric type. Internally used for intrinsics.
     TY_ANY = 26,     // Any type. Internally used for intrinsics.
     TY_PROC = 27,    // Procedure pointers and dummy procedures
     TY_128 = 28,     // A vector of 128-bit (e.g. SSE, Neon, Altivec)
     TY_256 = 29,     // A vector of 256-bit (e.g AVX-2)
     TY_512 = 30,     // A vector of 512-bit (e.g. AVX-512)
     // A few more that are unused apparently
     // related to __float128 and __int128

Because some of the kinds are fundamental to the language and do not need anything else to be represented, the data type table is prepopulated with a few DTYPEs which match the above types.

typedef enum DTYPE {
    DT_NONE = 0,          // TY_NONE
    DT_WORD = 1,          // TY_WORD
    DT_DWORD = 2,         // TY_DWORD
    DT_HOLL = 3,          // TY_HOLL
    DT_BINT = 4,          // TY_BINT
    DT_SINT = 5,          // TY_SINT
    DT_INT4 = 6,          // TY_INT
    DT_INT8 = 7,          // TY_INT8
    DT_REAL4 = 8,         // TY_REAL
    DT_REAL8 = 9,         // TY_DBLE
    DT_QUAD = 10,         // TY_QUAD
    DT_CMPLX8 = 11,       // TY_CMPLX
    DT_CMPLX16 = 12,      // TY_DCMPLX
    DT_QCMPLX = 13,       // TY_QCMPLX
    DT_BLOG = 14,         // TY_BLOG
    DT_SLOG = 15,         // TY_SLOG
    DT_LOG4 = 16,         // TY_LOG
    DT_LOG8 = 17,         // TY_LOG8
    DT_ADDR = 18,         // TY_PTR
    DT_CHAR = 20,         // TY_CHAR
    DT_NCHAR = 22,        // TY_NCHAR
    DT_ANY = 24,          // TY_ANY
    DT_NUMERIC = 25,      // TY_NUMERIC
    DT_ASSNCHAR = 26,     // Assumed-size array of TY_NCHAR
    DT_ASSCHAR = 28,      // Assumed-size array pf TY_CHAR
    DT_IARRAY = 30,       // Integer array of size 1.
                          // Used internally for array 
                          // descriptors of array sections.
    // ... other types for SIMD and __float128, __int128

The macro DTY is used to obtain the kind (this is one of the TY_ above) of a given DTYPE.

#define DTY(dt) (stb.dt_base[dt])

Each kind of type has a few properties related, these are defined in a table generated when compiling flang. These properties can be queried with the following macros (dttypes is a table that associates each TY_KIND with its properties).

#define DT_ISINT(dt)	(dttypes[DTY(dt)]&_TY_INT)
#define DT_ISREAL(dt)	(dttypes[DTY(dt)]&_TY_REAL)
#define DT_ISCMPLX(dt)	(dttypes[DTY(dt)]&_TY_CMPLX)
#define DT_ISNUMERIC(dt) (dttypes[DTY(dt)]&(_TY_INT|_TY_REAL|_TY_CMPLX))
#define DT_ISBASIC(dt)	(dttypes[DTY(dt)]&_TY_BASIC)
#define DT_ISUNSIGNED(dt) (dttypes[DTY(dt)]&_TY_UNSIGNED)
#define DT_ISSCALAR(dt)	(dttypes[DTY(dt)]&_TY_SCALAR)
#define DT_ISVEC(dt)	(dttypes[DTY(dt)]&_TY_VEC)
#define DT_ISLOG(dt)	(dttypes[DTY(dt)]&_TY_LOG)  // Is LOGICAL
#define DT_ISWORD(dt)	(dttypes[DTY(dt)]&_TY_WORD)
#define DT_ISDWORD(dt)	(dttypes[DTY(dt)]&_TY_DWORD)
#define DT_ISVECT(dt)   (dttypes[DTY(dt)]&_TY_VECT)

We add new types in the table using the function get_type (note that intrinsic types are already there and don't need registering).

Field declaration Meaning
Static arrays defined when building flang
const char *stypes[ST_MAX + 1]; This an array indexed by the different symbol kinds. Each entry in this array contains a human-readable name of the symbol kind.
OVCLASS ovclass[ST_MAX + 1]; This an array indexed by the different symbol kinds. Each entry in this array contains an overload class for the symbol kind. We will see below what is this. </tr>
const char *ocnames[OC_MAX + 1]; This an array indexed by the different overload clases. Each entry in this array contains a debug name of the overload class itself. </tr>
const char *scnames[SC_MAX + 1]; This an array indexed by the different storage clases. Each entry in this array contains a debug name of the storage class itself. We will see below that is this.
const char *tynames[TY_MAX + 1]; This an array indexed by the different types. Each entry in this array contains a human-readable name of the type itself.
Distinguished constants
int i0, i1; These represent the symbols for the constants 0 and 1 of type INTEGER. Because these are very common values, having a symbol for them is handy in the compiler.
int k0, k1; These represent the symbols for the constants 0_8 and 1_8 respectively.
SPTR flt0, dbl0, quad0; These represent the symbols for the constants 0.0_4, 0.0_8 and 0.0_16 respectively.
SPTR fltm0, dblm0, quadm0; These represent the symbols for the constants -0.0_4, -0.0_8 and -0.0_16. Because of the numeric model used by Fortran involves a sign plus magnitude representation, there is a negative zero (or signed zero) distinct to the regular zero.
SPTR flt1, dbl1, quad1; These represent the symbols for the constants 1.0_4, 1.0_8 and 1.0_16 respectively.
SPTR flt2, dbl2, quad2; These represent the symbols for the constants 2.0_4, 2.0_8 and 2.0_16 respectively.
SPTR flthalf, dblhalf, quadhalf; These represent the symbols for the constants 0.5_4, 0.5_8 and 0.5_16 respectively.
Data type table
INDEX_BY(ISZ_T, DTYPE) dt_base; This is a table, that may grow, indexed by an integer and keeps information of the different types
int dt_size; Number of elements that can be stored in the table dt_base.
int dt_avail; Highest index valid in the table dt_base.
int curr_scope; This is an integer that represents the current scope.
Hash table
SPTR hashtb[HASHSIZE + 1]; This is a hash table of symbols.
The symbol table itself
SPTR firstusym, firstosym; Contain the first symbol that is not an intrinsic or a predefined one. I'm not sure about the difference between the two apparently firstusym may change when finishing the handling of internal subprograms (probably because we want subsequent sibling internal subprograms start with an "empty" table).
INDEX_BY(SYM, SPTR) stg_base; Table of symbols (of type SYM) indexed by SPTR. A SPTR is an integer (actually an enumerator but used everywhere else as an index). This is the table proper. When adding a symbol it is added in this table.
int stg_size; Size of stg_base.
int stg_avail; Next symbol free in stg_base.
Strings table
char *n_base; Table of strings. This is just a sequence of null-terminated strings. When using strings, we will find their contents by having the offset in n_base stored somewhere else.
int n_size; Size of n_base.
int namavl; Next position in n_base available.
Labels table
int lbavail; This is the next label available, when the compiler needs to emit an internal label in the code. flang1 starts from 99999 and decreases this value for each label. flang2 starts from zero and increases the value instead
int lb_string_avail; This does not seem to be used anywhere(?)
Seemingly unused things
INT *w_base; This is not used anywhere. Apparently it is allocated and then deallocated. But nothing else uses it. flang2 also does not seem to bother to deallocate it.
int w_size; The size of w_base but otherwise not used for anything.
int wrdavl; Not used for anything apparently.
Distinguished data types
DTYPE dt_int; This is the type index for the default INTEGER of the target.
DTYPE dt_real; This is the type index for the default REAL of the target.
DTYPE dt_cmplx; This is the type index for the default COMPLEX of the target.
DTYPE dt_log; This is the type index for the default LOGICAL of the target.
DTYPE dt_dble; This is the type index for the default DOUBLE PRECISION of the target.
DTYPE dt_dcmplx; This is the type index for the default DOUBLE COMPLEX of the target. This is an extension of flang.
DTYPE dt_ptr; This is the INTEGER type index for the type used for Cray pointers. This is an extension of flang.
Kind Extra slots Description
  1. Length of the string (represented by a tree).
  1. DTYPE of the element.
  2. Index inside aux.arrdsc_base.
There is a function get_array_dtype to create DTYPEs of this kind.
  1. Linked list of components of the derived type.
  2. Size in bytes.
  3. Symbol T associated to a TYPE(T)
  4. Alignment.
  5. Tree for constant initialization.

So the way to access to the extra information of a DTYPE is basically doing DTY(d+k) to access the extra slot k. If we k == 0 then DTY(d) is just its TY_KIND. Note that creating a new TY_CHAR or TY_DERIVED involves creating a new type and then manually setting the various DTY(d+k).


Symbols are all stored in stb.stg_base. A symbol is added to the symbol table as needed using the function getsymbol.

getsymbol(const char *name)
  return getsym(name, strlen(name));

This function calls the function getsym which uses the macro installsym which ends calling the function installsym_ex.

getsym(const char *name, int olength)
  int sptr; /* pointer to symbol table entry */

 sptr = installsym(name, olength);
 // ... postprocess some symbol kinds ...
 return sptr;
#define installsym(name, olength) installsym_ex(name, olength, IS_GET)

This function always creates a new symbol, using the macro NEWSYM and it usually adds the new symbol to the hash table.

installsym_ex(const char *name, int olength, IS_MODE mode)
  int length;
  SPTR sptr;     /* pointer to symbol table entry */
  bool toolong;

   * Trim identifier if it is too long.
  // ... omitted ...

  if (mode != IS_QUICK) {
     * Loop thru the appropriate hash link list to see if symbol is
     * already in the table.
    // ... omitted ...

  /* Symbol not found.  Create a new symbol table entry. */

  if (mode != IS_QUICK) {
    // Add the new symbol to the hash table if requested.
    // .. omitted ..

  // Register the name in the string table.
  // .. omitted ..

  if (toolong) {
    report_too_long_identifier(name, olength, max_idlen);

  return sptr;
/* symbol creation macros */
#define NEWSYM(sptr)         \
  sptr = (SPTR)stb.stg_avail++; \
  if (sptr >= stb.stg_size)    \
    realloc_sym_storage();   \
  BZERO(&stb.stg_base[sptr], char, sizeof(SYM))

If you look closely the macro above you will see how it uses all the stg_* fields related to the (dynamic) symbol table itself. Also not how it zeroes the new symbol.

The string table

Symbols have names. Or if they don't have a real name in the program (e.g. temporaries) we just make one for them up. Those names are often repeated (you would be surprised how often the variable names I, J, K, II, JJ, KK... are used in a Fortran program) so it does not make sense to waste memory storing the same string several times.

The function putsname is used to add a new string in the table.

putsname(const char *name, int length)
  int nptr; /* index into character storage area */
  char *np; /* pointer into character storage area */
  int i;    /* counter */

  nptr = stb.namavl;
  stb.namavl += (length + 1);
  while (stb.namavl > stb.n_size) {
    /* To avoid quadratic behavior, we increase the storage area size
       by a factor, not a constant.  Use unsigned arithmetic here
       to avoid risk of overflow. */
    unsigned n = 2u * stb.n_size;
    if (n > MAX_NMPTR) {
      n = MAX_NMPTR;
      if (stb.namavl > n)
        symini_errfatal(7); /* names table overflow */
    NEED(stb.namavl, stb.n_base, char, stb.n_size, n);
  np = stb.n_base + nptr;
  for (i = 0; i < length; i++)
    *np++ = *name++;
  *np = '\0';

  return nptr;

This function always adds a string so we still need a mechanism to be able to reuse some of these strings.

The hash table

For now we will assume that the name of a symbol is enough to identify it (we will see it may not really be but bear with me for now). When the compiler is checking the semantic correctness of the input it will repeatedly need to check symbols. As the name of a symbol identifies it in the symbol table we need a way to quickly retrieve it. If we don't do anything else, we would have to traverse stb.stg_base from 0 to stb.stg_avail and check its name.

This is where the hash table comes into play. This table associates strings with symbols and it is stored in stb.hashtb. Given a string we can obtain its hash using the macro HASH_ID (which I presume means "hash the given identifier").

#define HASHSIZE 9973
#define HASH_ID(hv, p, len)                            \
  hv = p[(len)-1] | (*p << 16) | (p[(len) >> 1] << 8); \
  if ((int)(len) > 3)                                  \
    hv ^= (p[1] << 4);                                 \
  hv %= HASHSIZE;

Once we have the value of the hash in hv we can now traverse the hash table. The values in stb.hashtb are an index in the table stb.stg_base (this is, stb.hashtb is a SPTR). Because hash functions can collide (e.g. two different strings can map to the same hash value) we need a way to handle this. There are several techniques but flang uses separate chaining using an intrusive list. Basically each symbol contains the link to the next colliding symbol. So the way to traverse all the elements given a hash value is like this

SPTR sptr;
for (sptr = stb.hashtb[hashval]; sptr != 0; sptr = HASHLKG(sptr)) {
  // do something with sptr

this works because HASHLKG is the hash link getter for a given SPTR. As I mentioned above, all properties of a symbol are accessed via macros and this one is not an exception.

#define HASHLKG(s)   (stb.stg_base[s].hashlk)

The function installsym_ex can check the hash table before creating a new symbol, this way the existing symbol is returned.

installsym_ex(const char *name, int olength, IS_MODE mode)
  int length;
  SPTR sptr;     /* pointer to symbol table entry */
  bool toolong;

   * Trim identifier if it is too long.
  // ... omitted ...

  if (mode != IS_QUICK) {
     * Loop thru the appropriate hash link list to see if symbol is
     * already in the table.
    // Here length == min(strlen(name), max-length-allowed-by-fortran)
    HASH_ID(hashval, name, length);
    for (sptr = stb.hashtb[hashval]; sptr != 0; sptr = HASHLKG(sptr)) {
      const char *sname;
      int np = NMPTRG(sptr);  // This returns the index within the string
                              // table that contains the name of this symbol.
      if (np + length >= stb.namavl) // Defensive programming I think.
      sname = stb.n_base + np; // Get the proper string of this symbol
      // Quick check with the first letter of the identifier.
      // Also check same length (although "sname" could have other null bytes inbetween!)
      if (sname[0] != name[0] || sname[length] != '\0')
      // Check the first "length" characters
      if (strncmp(name, sname, length) != 0)
      nmptr = np;
      if (HIDDENG(sptr))

      /* Matching entry has been found in symbol table. Return it. */

      return sptr;

 // .. rest of the function omitted ..

Symbol constants

The stb structure has a section where it keeps distinguished constants i0, i1, k0, k1, ... These distinguished constants are registered as symbols using the getcon function.

getcon(INT *value, DTYPE dtype)
  int sptr;    /* symbol table pointer */
  int hashval; /* index into hashtb */

   * First loop thru the appropriate hash link list to see if this constant
   * is already in the symbol table:

  hashval = HASH_CON(value);
  if (hashval < 0)
    hashval = -hashval;
  for (sptr = stb.hashtb[hashval]; sptr != 0; sptr = HASHLKG(sptr)) {
    if (DTY(dtype) == TY_QUAD) {
      if (DTYPEG(sptr) != dtype || STYPEG(sptr) != ST_CONST ||
          CONVAL1G(sptr) != value[0] || CONVAL2G(sptr) != value[1] ||
          CONVAL3G(sptr) != value[2] || CONVAL4G(sptr) != value[3])

      /* Matching entry has been found.  Return it:  */
      return (sptr);
    if (DTYPEG(sptr) != dtype || STYPEG(sptr) != ST_CONST ||
        CONVAL1G(sptr) != value[0] || CONVAL2G(sptr) != value[1])

    /* Matching entry has been found.  Return it:  */
    return (sptr);

  /* Constant not found.  Create a new symbol table entry for it: */

  ADDSYM(sptr, hashval);
  CONVAL1P(sptr, value[0]);
  CONVAL2P(sptr, value[1]);
  if (DTY(dtype) == TY_QUAD) {
    CONVAL3P(sptr, value[2]);
    CONVAL4P(sptr, value[3]);
  DTYPEP(sptr, dtype);
  SCOPEP(sptr, 1);

  return (sptr);

As symbol constants do not have a name, the compiler uses a different hash function, HASH_CON, for them based on the contents of the constant. We will look at constants in a later chapter but enough to know for now that they can hold up to 4 integers (but most of them will just hold 2 values), but note how the hash function just checks the first two (I think this has some interesting consequences with small constants of kind TY_QUAD).

#define HASH_CON(p) ((p[0] ^ p[1]) % HASHSIZE)

If you're wondering about ADDSYM it is just a convenience that creates the symbol and immediately links it in the hash table.

#define ADDSYM(sptr, hashval) \
  NEWSYM(sptr);               \
  LINKSYM(sptr, hashval)

What the integers of a constant mean depends on the particular DTYPE used. For integers values are represented in a "big endian" fashion. Just take a look how flang initializes stb.i0 and stb.i1.

  INT tmp[2];
  // .. omitted ..
  tmp[0] = tmp[1] = (INT)0;
  stb.i0 = getcon(tmp, DT_INT);
  // .. non-relevant code related to k0 and k1 omitted ..
  tmp[1] = (INT)1;
  stb.i1 = getcon(tmp, DT_INT);
  // .. rest of the function omitted ..

If you match that code with the getcon function above then you will infer that CONVAL1G(stb.i1) == 0 and CONVAL2G(stb.i1) == 1.


This was again a bit long, the key points of this chapter are

  • A lot of symbolic information is represented in the stb global variable.
  • Symbolic entities are represented as indexes of tables.
  • A symbol, represented by SPTR, is an index in stb.stg_base.
    • Given a SPTR, macros exist to get and set (put) attributes of the symbol.
  • A data type, represented by DTYPE, is an index in stb.dt_base.
    • Entries in that table are compound of several items. The first item of which (of type TY_KIND) determines the length of the table.
    • The first item given a DTYPE is accessible using the macro DTY.
  • There is a hash table that is used both for identifiers and constant symbols.
    • The hash table is intrusively linked as an attribute of the symbols.

That's all for today :)