Now that we have a stream of tokens we can start performing syntactic analysis.
The lexer is providing us a stream of tokens. But we still have to check if such sequence forms a valid program.
var i : int;
i := 1;
i := )i +1);
All the tokens in the program above are valid tokens, but the second assignment statement (line 3) is not an assignment statement because there is an incorrect token
:=. On the other hand, the following program
if 3 then end
is syntactically valid but semantically incorrect because the condition expression does not have boolean type but integer type.
The syntactical analyzer will detect invalid token sequences but will not assess their semantic validity.
There is an extense and interesting bibliography about language recognition, but this is a blog not a compilers crash course so we will have to cut down the explanation: we will build a recursive descent recognizer based on the syntax description of part 1. If you are interested in parsing techniques an excellent (and a tad bit expensive too) reference is Parsing Techniques: A Practical Guide by Grune and Jacobs.
The strategy is very intuitive. Recognizing a rule like 〈declaration〉 →
; requires recognizing the elements in the right hand side of the rule. When an element in the right hand side is just a token (like
;) then we only have to verify that the next token is of the expected kind at that point. When the element in the right hand side is another rule (like 〈identifier〉 or 〈type〉) we will just recursively recognize the rule. An element of the form w* is equivalent to a rule like 〈rule〉 → ϵ | w〈rule〉 (where ϵ means an empty sequence of elements). An element of the form w+ is equivalent to a rule like 〈rule〉 → w | w〈rule〉.
Some rules are of the form 〈rule〉 → A | B | C. In such case we have to choose which A, B or C are going to recognize. We can break the tie peeking the current token. If each A, B or C start with different tokens, our problem is solved. If this is not the case, then we may have to peek some more tokens. Tiny is such a simple language that no more than one token will have to be peeked. When the alternative involves an empty sequence of tokens (like w* above) then we will consider that ϵ matches if the current token cannot start any of the alternatives.
In our implementation the syntactic and semantic analysis will be performed at the same time. Not that it has to be this way. For bigger languages splitting these two steps may be more appropiate. Tiny is so small that doing it is not worth. That said, in this post we will only focus on the syntactic recognition. For now, we only need to know that, upon recognizing a rule, a semantic value may be computed. What exactly this semantic value is, is not important now.
Since we will combine syntactic and semantic analysis we will call this step parsing. A
Parser class will drive the parsing process.
The parser will get tokens from the lexer, so we pass a reference to the lexer (recall that he lexer will synthesize tokens from the input file).
The main goal of our parser is parsing a program. So we will have a
parse_program should return a semantic value, but at this moment we do not care.
Let's recall the syntax of 〈program〉
〈program〉 → 〈statement〉*
As said above 〈statement〉* is equivalent to 〈rule〉 → ϵ|〈statement〉〈rule〉. We will call this rule 〈statement-seq〉. Like this.
〈statement-seq〉 → ϵ | 〈statement〉〈statement-seq〉
Inside a 〈program〉 the 〈statement-seq〉 ends when the end-of-file is found. This suggests that we just have to keep parsing statements until we find an end-of-file and a possible implementation of
parse_program does this.
This is fine but if you check the syntax of tiny, you will see that the condition of finalization of a 〈statement-seq〉 is not always the end of file. Sometimes can be
end (in the then or else part of an if statement, int the body for statement and in the body of a while statement) and sometimes is
else (in the then part of an if statement). So this means that
parse_statement_seq can be reused if we parameterize the finalization condition. Something like this.
And now we rewrite
Now we can proceed to parse a statement. Let's recall the syntax of a statement.
〈statement〉 → 〈declaration〉 | 〈assignment〉 | 〈if〉 | 〈while〉 | 〈for〉 | 〈read〉 | 〈write〉
Now we have one of those alternatives. Fortunately tiny is so simple that is easy to tell by just peeking the current token which kind of statement it can be.
We peek the current token and we check which statement it can initiate. If no statement can be initiated given the current token, the we call a diagnostic function with the unexpected token. We do some minimal error recovery by skiping all tokens until a semicolon is found.
error_at is a function that tells GCC to emit a diagnostic in the given location we just complain of an unexpected token. For instance the following erroneous program.
will emit the following diagnostic.
If the front end has signaled any error, once it finishes, GCC will stop and return a non-zero error code. So no assembler is emitted at all for erroneous inputs.
A user-friendly front end, though, should attempt to continue in order to diagnose more errors to the user. A front end that stops at the first error may be OK but then forces the user to repeatedly invoke the compiler to discover new errors. It seems, thus, sensible to try to diagnose as much as possible each invocation of the compiler (some compilers have a configurable error limit to avoid spending more time diagnosing errors than doing useful work). This implies that after an error has been diagnosed the front end has to recover from it. To do this the front end will have to use some error recovery strategy.
The strategy that we will use for tiny is rather simple and it is commonly known as panic mode. When an un expected token appears, the parser attempts to advance the input to some sensible position. Here we skip after a semicolon in the hope that a correct statement will start there. Note that error recovery is always a best effort. Until the compiler is able to read the mind of the programmer, it can only guess where the real error happened. It is not unlikely that a cascade of errors is generated because the parsing restarts in the wrong place. It is not the case of tiny but some programming languages are noticeably hard when it comes to diagnosing syntactic errors.
Ok, now we can parse a program and its statement sequence. Let's see how we parse each individual statement.
A variable declaration statement has the following form.
So a straightforward implementation of a parser of this statement is the one below.
Here we use a function
skip_token that given a token id, checks if the current token has that same id. If it has, it just skips it and returns true. Otherwise diagnoses an error and returns false. When skip_token fails (i.e. returns false) we immediately go to panic mode and give up parsing the current statement. As you can see this code quickly becomes tedious and repetitive. No wonder there exist tools, like ANTLR by Terence Parr, that automate the code generation of recursive descent recognizers.
skip_token simply forwards to
expect_token checks the current token. If its id is the same as the one we expect, it skips and returns it, otherwise it diagnoses an error and returns an empty pointer (i.e. a null pointer).
When parsing a variable declaration we invoke a
parse_type function, that parses the rule 〈type〉.
Its associated parsing function is rather obvious too.
Note that we return a boolean because we want the caller know if the parsing of the type succeeded.
Another interesting statement is the if-statement. Let's recall its syntax definition.
As shown, deriving a parse function for the rule 〈if〉 is not obvious because the two forms share a lot of elements. It may help to split the rule 〈if〉 in two rules follows.
〈if〉 → 〈if-then〉
From this definition it is clear that we have to parse first an
if, followed by an expression, followed by a
then and followed by a statement sequence. In this case the statement sequence will finish when we encounter an
end or an
else token. If we find an
end we are done parsing the if statement. If we find an
else, it means that we still have to parse a statement sequence (this time the sequence finishes only if we encounter an
end) and then an
skip_after_end is similar to
skip_after_semicolon but with an
end token. Note that these
skip_x functions must protect themselves from an unexpected end of file.
Remaining statements are parsed likewise and they do not bear special complexity except for a pervasive rule appearing in several of the statements: expression. This rule is so special that has its own parsing technique.
Parsing expressions is complex because the sublanguage of expressions must be flexible enough to express lots of different kinds of computations. Expressions can be understood as being formed by two kinds of elements: operators that most of the time correspond with some punctuation (or keywords like
not) and operands that correspond to other expressions (usually a subset of the expression sublanguage). Operators have an arity, which means the number of operands they operate, and a
fixity which defines the position of the operator respect its operands in the syntax. Arity of most operators is either unary, a single operand, or binary, two operands (some languages have ternary operators like the conditional operator though they may need to include extra operators). When it comes to
fixity operators can be prefix, the operands appear after the operator, or postfix, the operands appear before the operator. For binary operators an extra fixity is possible called infix: the operator appears between the two operands.
Some programming languages have only prefix operators (in some form the LISP family works this way) This simplifies a lot the syntactic analysis as all unary expressions are of the form 〈op〉 〈operand1〉 and all binary expressions of the form 〈op〉 〈operand1〉 〈operand2〉. Some notations (like the Reverse Polish notation) only use postfix operators, this has the same advantages as using only prefix operators.
While using prefix or postfix notation may be OK, most programming languages, including tiny, choose to use a notation closer, though not exactly the same, to the mathematical notation of arithmetic where most operators are infix. Infix notation introduces an additional problem though: it is ambiguous unless we define some operator priority and associativity. Operator priority, following the rules of basic arithmetic, is what tells us that a * b + c is equivalent to (a*b) + c and not a * (b + c). Associativity is what tells us that a + b + c is (a + b) + c and not a + (b + c). Associativity is most of the time left-to-right, like in the case of a + b + c, but it can be right-to-left like in exponentiation. Tiny does not not have exponentiation so all binary operators will associate left-to-right. In addition, some operators will be unary like -x or +x or
not x. Parentheses
) can be used to change the priority of operands if needed.
Let's recall first the definition of expressions in tiny.
〈expression〉 → 〈primary〉 | 〈unary-op〉 〈expression〉 | 〈expression〉 〈binary-op〉 〈expression〉
This definition is not very useful because it does not define the priority of the operators. We defined, though, the priority of the operators in a table.
By following the table of priorities above, it is possible to derive the following syntax. The lower the level, the higher the priority of the operand.
〈expression〉 → 〈sixth-level〉
〈fifth-level〉 → 〈fifth-level〉
〈fourth-level〉 → 〈fourth-level〉
〈third-level〉 → 〈third-level〉
〈first-level〉 → 〈primary〉
By restricting lower priority expressions in the right hand side of an expression (but allowing lower or equal priority expressions in the left hand side) we automatically force a left-to-right association. This is why a + b + c cannot be parsed as a + (b + c) because it would mean that in the right hand side of the first + directly appears another + operand, which is not possible because it has the same priority and we explicitly disallowed that in the syntax above.
Unfortunately we cannot apply our algorithm because some of the rules are left-recursive. A left-recursive rule is of the form 〈rule〉 → 〈rule〉X. This means that our algorithm to parse the rule would need first to parse the rule but without having consumed any token from the input. So it would lead use to an infinite recursion. It is, indeed, possible to rewrite the rule so it is not left-recursive. For instance, 〈third-level〉 (and similarly the other left-recursive rules) can be rewritten as
〈third-level〉 → 〈second-level〉
but unfortunately this would change the association of the expressions: now they would be associated right-to-left. Most tiny operators will behave associatively (because the mathematical properties of the operations) so it would not make much difference in terms of evaluation but the integer division operator is not associative. Consider
If we evaluate (100/10)/2 the result is 5. If we evaluate 100/(10/2) the result is 20. Since the semantics of the language call for left-to-right association the result in tiny must be 5.
Clearly we need another strategy: priority parsing.
The notion of priority appears more or less naturally in the syntax of expressions. Can we use it to get a more or less sensible algorithm? The answer is yes, it is called a Pratt parser and it is suprisingly simple yet powerful.
Pratt parser for expressions
A Pratt parser defines the concept of binding power as some sort of priority number: the higher the binding power the more priority the operand has. This parser associates three extra values to the tokens of expressions: a left binding power, a null denotation function and a left denotation function.
Parsing an expression requires a right binding power. A top level expression will use the lowest priority possible. Then the parser starts by peeking the current token t1 and skipping it. Then it invokes the null denotation function of t1. If this token cannot appear at this point then its null denotation function will diagnose an error and the parsing will end at this point. Otherwise the null denotation function will do something (that may include advancing the token stream, more on this later). Once we are back from the null denotation, the parser checks if the current right binding power is lower or than that of the current token (call it t2, but note that it may not be the next one after t1). If it is not, parsing ends here. Otherwise the parser skips the token and the left denotation function is invoked on t2. The left denotation function (will do something, including advancing the current token, more on this later). Once we are back from the left denotation we will check again if the current token has a higher left binding power than the current right binding power and proceed likewise.
Ok, I tried, but the explanation above is rather dense. Behold the stunning simplicity of this parser at its core.
Intuitively the idea is that while we encounter tokens of higher priority than the priority of the expression we need to parse them first, otherwise if we find a lower priority token we stop parsing. This only makes sense if we recursively invoke
parse_expression, that we will.
First let's see the null denotations. They represent the action that we have to do when we find a token at the beginning of an expression.
Parser::null_denotation (const_TokenPtr tok)
switch (tok->get_id ())
if (!parse_expression ())
tok = lexer.peek_token ();
if (!parse_expression (LBP_UNARY_PLUS))
if (!parse_expression (LBP_UNARY_MINUS))
if (!parse_expression (LBP_LOGICAL_NOT))
There is little to do now for identifiers, real, integer and string literals. So they trivially return true (lines 6 to 10).
If the current token is
( (line 11) it means that we have to parse a whole expression. So we do by recursively invoking
parse_expression (with the lowest priority possible, as if it were a top-level expression). When we return from parse_expression we have to make sure that the current token is
) (line 16).
If the current token is +, - or not (lines 18, 24, 30) it means that this is a unary operator. We will invoke parse_expression recursively with the appropiate priority for each operand (
LBP_LOGICAL_NOT, more on this later).
It may not be obvious now, but
tok, is not the current token in the input stream but the previous one since
parse_expression already skipped
tok before calling
The left denotation will be called for each token that can appear in an infix position. In tiny they will just be operators but sometimes other punctuation may appear.
Rather than making a relatively large switch (like we did in
null_denotation), here we call a function that given a token will return us a pointer to the member function that implements the left denotation for token
tok. We could have taken the same approach in the
null_denotation function but given that there are much less unary operators it looked like unnecesary.
By using X-Macros again we define our binary handlers for further consumption.Function get_binary handler is implemented using
Now we can provide implementations of the binary operators. At this point all of them will look the same, so let's consider only the binary addition.
Finally we are only missing to define the left binding power of our tokens: recall that the higher is this number, the higher is the priority. This numbers fulfill the priority defined in the table above.
Phew. This has been long. But now we are in a position to recognize the syntax of tiny. In the next chapter we will assess the semantic validity of the input.
That's all for today.