Walk-through flang – Part 6
At this point we should have a reasonable picture of how flang parses and generates semantic information. So now it is time to explore with more detail what is actually synthesized and how it can be used later in the compiler. In this chapter we are going to see the symbol table.
Symbols
Our Fortran programs deal with abstract, symbolic entities. These entities, which often have a name, are called collectively symbols. Those symbols are created by the front end as it finds declarative statements in our code. Declarations assigns several properties, or attributes, to the symbols. The front end also needs to look up these symbols in order to check their properties. For instance, in a derived type construct like
The front end will need symbols for T
, X
, Y
and W
. T
is a derived type. X
and Y
are components of the derived type T
. X
is the first component of T
and has type INTEGER
. Y
is the second component of T
and has type REAL
and rank 1 with size 2. W
is a variable whose type is TYPE(T)
.
Symbol table
For resource-management purposes we want to group the symbols under a single data structure usually called the symbol table.
Flang symbol table
Definition
Flang has symbol table is in a global variable called stb
, declared in symacc.h
(I think symacc
means symbol access). This variable has type STB
which is a struct with several fields described below.
The variable is (apparently) defined twice, an empty one in symacc.c
and another one in tools/flang2/utils/symtab/symtabdf.h
(this is valid in C code but incredibly confusing, don't do it). The latter is a file generated when compiling flang itself. This is the real definition and initializes the first 5 fields which are made of tables that won't change during the compilation (no idea at this why these tables are not decoupled from the STB
).
Field declaration | Meaning | |
---|---|---|
Static arrays defined when building flang | ||
const char *stypes[ST_MAX + 1]; |
This an array indexed by the different symbol kinds. Each entry in this array contains a human-readable name of the symbol kind. | |
OVCLASS ovclass[ST_MAX + 1]; |
This an array indexed by the different symbol kinds. Each entry in this array contains an overload class for the symbol kind. We will see below what is this. </tr> | |
const char *ocnames[OC_MAX + 1]; |
This an array indexed by the different overload clases. Each entry in this array contains a debug name of the overload class itself. </tr> | |
const char *scnames[SC_MAX + 1]; |
This an array indexed by the different storage clases. Each entry in this array contains a debug name of the storage class itself. We will see below that is this. | |
const char *tynames[TY_MAX + 1]; |
This an array indexed by the different types. Each entry in this array contains a human-readable name of the type itself. | |
Distinguished constants | ||
int i0, i1; |
These represent the symbols for the constants 0 and 1 of type INTEGER . Because these are very common values, having a symbol for them is handy in the compiler. |
|
int k0, k1; |
These represent the symbols for the constants 0_8 and 1_8 respectively. |
|
SPTR flt0, dbl0, quad0; |
These represent the symbols for the constants 0.0_4 , 0.0_8 and 0.0_16 respectively. |
|
SPTR fltm0, dblm0, quadm0; |
These represent the symbols for the constants -0.0_4 , -0.0_8 and -0.0_16 . Because of the numeric model used by Fortran involves a sign plus magnitude representation, there is a negative zero (or signed zero) distinct to the regular zero. |
|
SPTR flt1, dbl1, quad1; |
These represent the symbols for the constants 1.0_4 , 1.0_8 and 1.0_16 respectively. |
|
SPTR flt2, dbl2, quad2; |
These represent the symbols for the constants 2.0_4 , 2.0_8 and 2.0_16 respectively. |
|
SPTR flthalf, dblhalf, quadhalf; |
These represent the symbols for the constants 0.5_4 , 0.5_8 and 0.5_16 respectively. |
|
Data type table | ||
INDEX_BY(ISZ_T, DTYPE) dt_base; |
This is a table, that may grow, indexed by an integer and keeps information of the different types | |
int dt_size; |
Number of elements that can be stored in the table dt_base . |
|
int dt_avail; |
Highest index valid in the table dt_base . |
|
Scope | ||
int curr_scope; |
This is an integer that represents the current scope. | |
Hash table | ||
SPTR hashtb[HASHSIZE + 1]; |
This is a hash table of symbols. | |
The symbol table itself | ||
SPTR firstusym, firstosym; |
Contain the first symbol that is not an intrinsic or a predefined one. I'm not sure about the difference between the two apparently firstusym may change when finishing the handling of internal subprograms (probably because we want subsequent sibling internal subprograms start with an "empty" table). |
|
INDEX_BY(SYM, SPTR) stg_base; |
Table of symbols (of type SYM ) indexed by SPTR . A SPTR is an integer (actually an enumerator but used everywhere else as an index). This is the table proper. When adding a symbol it is added in this table. |
|
int stg_size; |
Size of stg_base . |
|
int stg_avail; |
Next symbol free in stg_base . |
|
Strings table | ||
char *n_base; |
Table of strings. This is just a sequence of null-terminated strings. When using strings, we will find their contents by having the offset in n_base stored somewhere else. |
|
int n_size; |
Size of n_base. | |
int namavl; |
Next position in n_base available. |
|
Labels table | ||
int lbavail; |
This is the next label available, when the compiler needs to emit an internal label in the code. flang1 starts from 99999 and decreases this value for each label. flang2 starts from zero and increases the value instead | |
int lb_string_avail; |
This does not seem to be used anywhere(?) | |
Seemingly unused things | ||
INT *w_base; |
This is not used anywhere. Apparently it is allocated and then deallocated. But nothing else uses it. flang2 also does not seem to bother to deallocate it. |
|
int w_size; |
The size of w_base but otherwise not used for anything. |
|
int wrdavl; |
Not used for anything apparently. | |
Distinguished data types | ||
DTYPE dt_int; |
This is the type index for the default INTEGER of the target. |
|
DTYPE dt_real; |
This is the type index for the default REAL of the target. |
|
DTYPE dt_cmplx; |
This is the type index for the default COMPLEX of the target. |
|
DTYPE dt_log; |
This is the type index for the default LOGICAL of the target. |
|
DTYPE dt_dble; |
This is the type index for the default DOUBLE PRECISION of the target. |
|
DTYPE dt_dcmplx; |
This is the type index for the default DOUBLE COMPLEX of the target. This is an extension of flang. |
|
DTYPE dt_ptr; |
This is the INTEGER type index for the type used for Cray pointers. This is an extension of flang. |
Kind | Extra slots | Description |
---|---|---|
TY_CHAR |
1 |
|
TY_ARRAY |
2 |
get_array_dtype to create DTYPE s of this kind.
|
TY_DERIVED |
5 |
|
So the way to access to the extra information of a DTYPE
is basically doing DTY(d+k)
to access the extra slot k
. If we k == 0
then DTY(d)
is just its TY_KIND
. Note that creating a new TY_CHAR
or TY_DERIVED
involves creating a new type and then manually setting the various DTY(d+k)
.
Symbols
Symbols are all stored in stb.stg_base.
A symbol is added to the symbol table as needed using the function getsymbol
.
getsym
which uses the macro installsym
which ends calling the function installsym_ex
.
This function always creates a new symbol, using the macro NEWSYM
and it usually adds the new symbol to the hash table.
If you look closely the macro above you will see how it uses all the stg_*
fields related to the (dynamic) symbol table itself. Also not how it zeroes the new symbol.
The string table
Symbols have names. Or if they don't have a real name in the program (e.g. temporaries) we just make one for them up. Those names are often repeated (you would be surprised how often the variable names I
, J
, K
, II
, JJ
, KK
... are used in a Fortran program) so it does not make sense to waste memory storing the same string several times.
The function putsname
is used to add a new string in the table.
This function always adds a string so we still need a mechanism to be able to reuse some of these strings.
The hash table
For now we will assume that the name of a symbol is enough to identify it (we will see it may not really be but bear with me for now). When the compiler is checking the semantic correctness of the input it will repeatedly need to check symbols. As the name of a symbol identifies it in the symbol table we need a way to quickly retrieve it. If we don't do anything else, we would have to traverse stb.stg_base
from 0 to stb.stg_avail
and check its name.
This is where the hash table comes into play. This table associates strings with symbols and it is stored in stb.hashtb
. Given a string we can obtain its hash using the macro HASH_ID
(which I presume means "hash the given identifier").
Once we have the value of the hash in hv
we can now traverse the hash table. The values in stb.hashtb
are an index in the table stb.stg_base
(this is, stb.hashtb
is a SPTR
). Because hash functions can collide (e.g. two different strings can map to the same hash value) we need a way to handle this. There are several techniques but flang uses separate chaining using an intrusive list. Basically each symbol contains the link to the next colliding symbol. So the way to traverse all the elements given a hash value is like this
this works because HASHLKG
is the hash link getter for a given SPTR
. As I mentioned above, all properties of a symbol are accessed via macros and this one is not an exception.
The function installsym_ex can check the hash table before creating a new symbol, this way the existing symbol is returned.
Symbol constants
The stb
structure has a section where it keeps distinguished constants i0
, i1
, k0
, k1
, ... These distinguished constants are registered as symbols using the getcon
function.
As symbol constants do not have a name, the compiler uses a different hash function, HASH_CON
, for them based on the contents of the constant. We will look at constants in a later chapter but enough to know for now that they can hold up to 4 integers (but most of them will just hold 2 values), but note how the hash function just checks the first two (I think this has some interesting consequences with small constants of kind TY_QUAD
).
If you're wondering about ADDSYM
it is just a convenience that creates the symbol and immediately links it in the hash table.
What the integers of a constant mean depends on the particular DTYPE
used. For integers values are represented in a "big endian" fashion. Just take a look how flang initializes stb.i0
and stb.i1
.
If you match that code with the getcon
function above then you will infer that CONVAL1G(stb.i1) == 0
and CONVAL2G(stb.i1) == 1
.
Wrap-up
This was again a bit long, the key points of this chapter are
- A lot of symbolic information is represented in the
stb
global variable. - Symbolic entities are represented as indexes of tables.
- A symbol, represented by
SPTR
, is an index instb.stg_base
. - Given a
SPTR
, macros exist to get and set (put) attributes of the symbol. - A data type, represented by
DTYPE
, is an index instb.dt_base
. - Entries in that table are compound of several items. The first item of which (of type
TY_KIND
) determines the length of the table. - The first item given a
DTYPE
is accessible using the macroDTY
. - There is a hash table that is used both for identifiers and constant symbols.
- The hash table is intrusively linked as an attribute of the symbols.
That's all for today :)