At this point we should have a reasonable picture of how flang parses and generates semantic information. So now it is time to explore with more detail what is actually synthesized and how it can be used later in the compiler. In this chapter we are going to see the symbol table.
Our Fortran programs deal with abstract, symbolic entities. These entities, which often have a name, are called collectively symbols. Those symbols are created by the front end as it finds declarative statements in our code. Declarations assigns several properties, or attributes, to the symbols. The front end also needs to look up these symbols in order to check their properties. For instance, in a derived type construct like
The front end will need symbols for T, X, Y and W. T is a derived type. X and Y are components of the derived type T. X is the first component of T and has type INTEGER. Y is the second component of T and has type REAL and rank 1 with size 2. W is a variable whose type is TYPE(T).
For resource-management purposes we want to group the symbols under a single data structure usually called the symbol table.
Flang symbol table
Flang has symbol table is in a global variable called stb, declared in symacc.h (I think symacc means symbol access). This variable has type STB which is a struct with several fields described below.
The variable is (apparently) defined twice, an empty one in symacc.c and another one in tools/flang2/utils/symtab/symtabdf.h (this is valid in C code but incredibly confusing, don't do it). The latter is a file generated when compiling flang itself. This is the real definition and initializes the first 5 fields which are made of tables that won't change during the compilation (no idea at this why these tables are not decoupled from the STB).
Static arrays defined when building flang
const char *stypes[ST_MAX + 1];
This an array indexed by the different symbol kinds. Each entry in this array contains a human-readable name of the symbol kind.
OVCLASS ovclass[ST_MAX + 1];
This an array indexed by the different symbol kinds. Each entry in this array contains an overload class for the symbol kind. We will see below what is this.
const char *ocnames[OC_MAX + 1];
This an array indexed by the different overload clases. Each entry in this array contains a debug name of the overload class itself.
const char *scnames[SC_MAX + 1];
This an array indexed by the different storage clases. Each entry in this array contains a debug name of the storage class itself. We will see below that is this.
const char *tynames[TY_MAX + 1];
This an array indexed by the different types. Each entry in this array contains a human-readable name of the type itself.
int i0, i1;
These represent the symbols for the constants 0 and 1 of type INTEGER. Because these are very common values, having a symbol for them is handy in the compiler.
int k0, k1;
These represent the symbols for the constants 0_8 and 1_8 respectively.
SPTR flt0, dbl0, quad0;
These represent the symbols for the constants 0.0_4, 0.0_8 and 0.0_16 respectively.
SPTR fltm0, dblm0, quadm0;
These represent the symbols for the constants -0.0_4, -0.0_8 and -0.0_16. Because of the numeric model used by Fortran involves a sign plus magnitude representation, there is a negative zero (or signed zero) distinct to the regular zero.
SPTR flt1, dbl1, quad1;
These represent the symbols for the constants 1.0_4, 1.0_8 and 1.0_16 respectively.
SPTR flt2, dbl2, quad2;
These represent the symbols for the constants 2.0_4, 2.0_8 and 2.0_16 respectively.
SPTR flthalf, dblhalf, quadhalf;
These represent the symbols for the constants 0.5_4, 0.5_8 and 0.5_16 respectively.
Data type table
INDEX_BY(ISZ_T, DTYPE) dt_base;
This is a table, that may grow, indexed by an integer and keeps information of the different types
Number of elements that can be stored in the table dt_base.
Highest index valid in the table dt_base.
This is an integer that represents the current scope.
SPTR hashtb[HASHSIZE + 1];
This is a hash table of symbols.
The symbol table itself
SPTR firstusym, firstosym;
Contain the first symbol that is not an intrinsic or a predefined one. I'm not sure about the difference between the two apparently firstusym may change when finishing the handling of internal subprograms (probably because we want subsequent sibling internal subprograms start with an "empty" table).
INDEX_BY(SYM, SPTR) stg_base;
Table of symbols (of type SYM) indexed by SPTR. A SPTR is an integer (actually an enumerator but used everywhere else as an index). This is the table proper. When adding a symbol it is added in this table.
Size of stg_base.
Next symbol free in stg_base.
Table of strings. This is just a sequence of null-terminated strings. When using strings, we will find their contents by having the offset in n_base stored somewhere else.
Size of n_base.
Next position in n_base available.
This is the next label available, when the compiler needs to emit an internal label in the code. flang1 starts from 99999 and decreases this value for each label. flang2 starts from zero and increases the value instead
This does not seem to be used anywhere(?)
Seemingly unused things
This is not used anywhere. Apparently it is allocated and then deallocated. But nothing else uses it. flang2 also does not seem to bother to deallocate it.
The size of w_base but otherwise not used for anything.
Not used for anything apparently.
Distinguished data types
This is the type index for the default INTEGER of the target.
This is the type index for the default REAL of the target.
This is the type index for the default COMPLEX of the target.
This is the type index for the default LOGICAL of the target.
This is the type index for the default DOUBLE PRECISION of the target.
This is the type index for the default DOUBLE COMPLEX of the target. This is an extension of flang.
This is the INTEGER type index for the type used for Cray pointers. This is an extension of flang.
Integers. Integers everywhere
If the flang code weren't stuck in the 1980s the design would look a bit different nowadays. Each kind of entity (symbols, data types, trees, etc) would be represented using a pointer of a particular type. Unfortunately this is not how the flang front end works. Instead all entities are designated using an index or an offset. This offset is meaningful only within its associated table. Thus is a SPTR is just an integer that indexes the stb.stg_base and a DTYPE indexes stb.dt_base. A similar thing happens with the trees (AST).
That said, the flang code rarely uses these tables directly, instead it uses macros to access the properties given one of those integers. Get the wrong integer and you will be accessing the wrong table and using wrong data. The whole picture is a bit grim at this point from a maintenance point of view. Probably there are technical reasons in the past that justified a design like this (limited memory, compilers were also more limited at that time, reek of Fortran-style engineering, etc). None of these reasons seem to have been revisited very seriously in the code which explains its unpleasant dated look and feel. The code shows some shy attempts to move to a more declarative setting by using enums, see below, but this is still not enough: in C enums and integers are almost interchangeable, and many places the code still gives up and uses plain int.
If moving to a more "graph-like" approach using pointers for the entities is not feasible in this code base, an intermediate solution could be what in modern programming languages is often called a "new type". Here we want to use the same physical representation (integers indexing arrays) to mean different things. So we need a different type. Typically in C this is achievable by wrapping the type (in this case an integer) in a struct and passing that struct by value (as we were doing with the original integer). Most convention calls are designed in a way that passing a struct like this is equivalent to pass the original integer.Type checking rules of C will catch many wrong uses and the macros can be reused by just readjusting them. Often, though, more strictness may raise issues when the original code was doing dubious things that even if out of the type discipline, are still correct. So this process is likely to be painful anyways: the technical debt repayment is going to involve a fat bill.
From the definition above we see that flang uses several tables. A first set of them are static and won't change (stypes, ovclass, ocnames, scnames and tynames) and their sizes are known upfront (ST_MAX, OC_MAX, SC_MAX and TY_MAX).
A second set of them are dynamic and may grow depending on the program unit being compied.
There is a table of data types in stb.dt_base which can contain up to stb.dt_size entries. The next available slot for a new datatype (when we need a new one) is represented in stb.dt_avail. This table is reallocated dynamically as needed. The table is indexed by integers (of type ISZ_T which happens to be a long). That table is a bit special because it has entries of variable length. While scalar types like INTEGER will just take one entry, types associated to arrays will take up to three entries. The first entry is always of type TY_KIND which is a set of kinds of types. That type determines whether more data follows. The set of kinds of types is static and predefined in symtab.h.
Because some of the kinds are fundamental to the language and do not need anything else to be represented, the data type table is prepopulated with a few DTYPEs which match the above types.
The macro DTY is used to obtain the kind (this is one of the TY_ above) of a given DTYPE.
Each kind of type has a few properties related, these are defined in a table generated when compiling flang. These properties can be queried with the following macros (dttypes is a table that associates each TY_KIND with its properties).
We add new types in the table using the function get_type (note that intrinsic types are already there and don't need registering).
Length of the string (represented by a tree).
DTYPE of the element.
Index inside aux.arrdsc_base.
There is a function get_array_dtype to create DTYPEs of this kind.
Linked list of components of the derived type.
Size in bytes.
Symbol T associated to a TYPE(T)
Tree for constant initialization.
So the way to access to the extra information of a DTYPE is basically doing DTY(d+k) to access the extra slot k. If we k == 0 then DTY(d) is just its TY_KIND. Note that creating a new TY_CHAR or TY_DERIVED involves creating a new type and then manually setting the various DTY(d+k).
Symbols are all stored in stb.stg_base. A symbol is added to the symbol table as needed using the function getsymbol.
This function calls the function getsym which uses the macro installsym which ends calling the function installsym_ex.
This function always creates a new symbol, using the macro NEWSYM and it usually adds the new symbol to the hash table.
If you look closely the macro above you will see how it uses all the stg_* fields related to the (dynamic) symbol table itself. Also not how it zeroes the new symbol.
The string table
Symbols have names. Or if they don't have a real name in the program (e.g. temporaries) we just make one for them up. Those names are often repeated (you would be surprised how often the variable names I, J, K, II, JJ, KK... are used in a Fortran program) so it does not make sense to waste memory storing the same string several times.
The function putsname is used to add a new string in the table.
This function always adds a string so we still need a mechanism to be able to reuse some of these strings.
The hash table
For now we will assume that the name of a symbol is enough to identify it (we will see it may not really be but bear with me for now). When the compiler is checking the semantic correctness of the input it will repeatedly need to check symbols. As the name of a symbol identifies it in the symbol table we need a way to quickly retrieve it. If we don't do anything else, we would have to traverse stb.stg_base from 0 to stb.stg_avail and check its name.
This is where the hash table comes into play. This table associates strings with symbols and it is stored in stb.hashtb. Given a string we can obtain its hash using the macro HASH_ID (which I presume means "hash the given identifier").
Once we have the value of the hash in hv we can now traverse the hash table. The values in stb.hashtb are an index in the table stb.stg_base (this is, stb.hashtb is a SPTR). Because hash functions can collide (e.g. two different strings can map to the same hash value) we need a way to handle this. There are several techniques but flang uses separate chaining using an intrusive list. Basically each symbol contains the link to the next colliding symbol. So the way to traverse all the elements given a hash value is like this
this works because HASHLKG is the hash link getter for a given SPTR. As I mentioned above, all properties of a symbol are accessed via macros and this one is not an exception.
The function installsym_ex can check the hash table before creating a new symbol, this way the existing symbol is returned.
The stb structure has a section where it keeps distinguished constants i0, i1, k0, k1, ... These distinguished constants are registered as symbols using the getcon function.
As symbol constants do not have a name, the compiler uses a different hash function, HASH_CON, for them based on the contents of the constant. We will look at constants in a later chapter but enough to know for now that they can hold up to 4 integers (but most of them will just hold 2 values), but note how the hash function just checks the first two (I think this has some interesting consequences with small constants of kind TY_QUAD).
If you're wondering about ADDSYM it is just a convenience that creates the symbol and immediately links it in the hash table.
What the integers of a constant mean depends on the particular DTYPE used. For integers values are represented in a "big endian" fashion. Just take a look how flang initializes stb.i0 and stb.i1.
If you match that code with the getcon function above then you will infer that CONVAL1G(stb.i1) == 0 and CONVAL2G(stb.i1) == 1.
This was again a bit long, the key points of this chapter are
A lot of symbolic information is represented in the stb global variable.
Symbolic entities are represented as indexes of tables.
A symbol, represented by SPTR, is an index in stb.stg_base.
Given a SPTR, macros exist to get and set (put) attributes of the symbol.
A data type, represented by DTYPE, is an index in stb.dt_base.
Entries in that table are compound of several items. The first item of which (of type TY_KIND) determines the length of the table.
The first item given a DTYPE is accessible using the macro DTY.
There is a hash table that is used both for identifiers and constant symbols.
The hash table is intrusively linked as an attribute of the symbols.