Compiler Syntax Analysis

Syntax analysis is the second phase of a compiler.

  • Take a stream of tokens from the lexer as input.
  • Analyze the structure of the tokens.
  • Detect syntax errors.
  • Out put a parsing tree to present the structure.

Use context-free grammar (CFG) for analysis.

Context-free grammar (CFG)

Why CFG?
  • CFG is more powerful than regular expressions.
  • CFGs are not the most powerful formal language (see Chomsky hierarchy),
  • but CFGs are enough to describe the structures for most of the languages.

CFG

Context free grammar

A Context free grammar is a , where

  • is a finite set called Variables
  • is a finite set, disjoint from , called the terminals,
  • is a finite set of production rules, with each rule being a variable and a string of variables and terminals in the form , and
  • is the start variable.

The production rules are same as the rules in the regular definition.

We can use the rules to derive the strings in a context-free language.

Terminals are the symbols which cannot produce more symbols. In a compiler, they are tokens.

Variables only exists when the production is unfinished.
They are not symbols in the alphabet, neither tokens. Thus, they are not in the strings of the language.

CFG example

For example,

  • Variables
  • Terminals

The production rules can also be "combined" and rewritten as


  • The symbol "|" is same as it is in regular expressions, meaning that there are multiple options.

The rules can be combined only if the LHS are the same.
Sometimes the set of production rules is enough to define a grammar because

  • Variables can be obtained by looking at the LHS of each production rule,
  • Terminals are the symbols on the RHS of the rules excluding the variables, and
  • The start symbol is usually denoted by .

Then, the above grammar can be

Derivations

Derivations

Given a grammar , we can generate strings in the language by using a derivation.
A derivation (informally) is a procedure to apply the production rules from the start symbol to a string which only contains terminals.

Each production is denoted by .

For example, given the grammar

left-most derivation, meaning that each iteration we replace the left most nonterminal.

Derivation Rule Used
Start symbol
Rule 1
Rule 5
Rule 2
Rule 5
Rule 5

E

E

plus

E

id

E

times

E

id

id

Similarly, the right-most derivation replaces the right most nonterminal in each iteration.

Derivation Rule Used
Start symbol
Rule 1
Rule 2
Rule 5
Rule 5
Rule 5

Each sequence of nonterminals and tokens that we derive at each step is called a sentential form.

  • The last sentential form only contains tokens and is called a sentence, which is a syntactically correct string in the programming language.

  • If is a sentence and is the start symbol, we can write

  • means derive in zero or more steps. It also means the RHS is derivable by the LHS.

  • From the above example,

Ambiguous

The grammar is ambiguous if there is a sentence in from which it is possible to construct multiple parse trees (using any type of derivation).

E

E

times

E

E

plus

E

id

id

id

Some properties
  • Each derivation (left-most, right-most, or otherwise) corresponds to exactly one parse tree, whether the grammar is ambiguous or not.
  • 每个推导(无论是最左推导、最右推导或其他推导)对应唯一一个语法树,无论语法是否有歧义。
  • Each parse tree corresponds to multiple derivations, whether the grammar is ambiguous or not.
    每个语法树可以对应多个推导,无论语法是否有歧义。
  • Each parse tree corresponds to exactly one left-most derivation and exactly one right-most derivation, whether the grammar is ambiguous or not.
    每个语法树对应唯一的最左推导和最右推导。
  • All derivations of the same sentence correspond to the same parse tree if the grammar is not ambiguous.
    如果语法没有歧义,同一语句的所有推导对应相同的语法树。
  • Multiple derivations of the same sentence may not correspond to the same parse tree if the grammar is ambiguous.
    如果语法没有歧义,同一语句的所有推导对应相同的语法树。
  • In general, deciding a grammar is ambiguous or not is undecidable, or uncomputable (like the halting problem).
  • 一般来说,判断一个语法是否有歧义是不可判定或不可计算的。

Ambiguity Elimination

A compiler cannot use an ambiguous grammar because each input program must be parsed to a unique parse tree to show the structure.

To remove ambiguity in a grammar, we can transform it by hand into an unambiguous grammar.
This method is possible in theory but not widely used in practical because of the difficulty.

Most compilers use additional information to avoid ambiguity.

CFG VS Regular expressions

Feature Context-Free Grammar Regular Expression
Expressive Power Can handle nested/recursive structures Cannot handle nested structures
Grammar Structure Production rules with non-terminals Concatenation, union, Kleene star
Applications Programming language syntax, parsing Lexical analysis, simple text matching
Recognizer Pushdown Automaton (PDA) Finite Automaton (DFA/NFA)
Example Language Balanced parentheses Simple keywords, identifiers

Limit of CFG

  • To prove a language is not context-free, you need pumping lemma for context free languages.

  • To powerup the CFG/PDA, we can try again to add another stack to the machine, same as what we did to finite automata.

  • This upgrade is ultimate. A finite automata with two stacks is as powerful as a Turing machine, which is the most powerful computational model that human can implement right now.

Pushdown Automata

Intuitively, a PDA is a finite automata plus a stack with some modifications on the transitions to fit stack behaviors.

  • An NPDA is (formally) a 6-tuple , where
    • is a finite set of states,
    • is a finite set of the input alphabet,
    • is a finite set of the stack alphabet,
    • is the transition function,
    • is the start state, and
    • is the set of accept states.

For example, the context-free language has grammar

Top-Down Parsing

Parser

  • Now, we want to implement a parser which

    • takes a sequence of tokens as input;
    • analyze the structure of the input;
    • generates a parse tree; and
    • indicates the syntax errors if there is any.
  • There are two types of parsers:

    • top-down
    • bottom-up.

Top-down parsing tries to construct a sequence of tokens from the grammar which is same as the input.
Bottom-up parsing tries to match the input tokens with grammar rules.

Top-down Parsing

Top-down parsing is only for a subset of context-free grammars.

  • Top-down parsing means that we generate the parse tree from top to bottom.

  • Remember that the internal vertices in a parse tree are the nonterminals (variables), while the leaves are the terminals (tokens).

  • The parse tree construction depends on the production rules in the grammar.

  • If we want to construct the parse tree from top to bottom, the production rules have to be nicely designed.

Leftmost derivation

The RHS of each production rule always starts from a different terminal.
The parsing on the left branch always finishes before the right branch.

For example, the following grammar parses single variable declarations (if we do not care about types here).

Suppose the input is .
The input tokens can be uniquely parsed into

E

id

T

;

Recursive-Descent Parser

However, our life cannot be that easy in most of cases.

  • The previous grammar does not allow multiple variable declarations in one statement.

So, consider this grammar.

and try to parse

We have two different choices for the first derivation.

We cannot decide which production rule can be used for the derivation.
The parser will pick a valid rule randomly.

If the guess is wrong, the parser cannot proceed parsing on some input tokens and it will backtrack.

Actually, you have seen backtracking in other places … Depth First Search.

We can express the parsing by using a DFS tree.

  • Each vertex in the tree is a valid sentential form (a stream of terminals and nonterminals). In this example, the stream to the RHS of , given by a step number .
  • Each edge is defined by applying a production rule.

The minor difference here is that we don’t need to go back to the root (as DFS) when the parsing is finished.

Note that this DFS is different from the parse tree because a vertex presents a sentential form, which is equivalent to a (partial) parse tree.

DFS

Each nonterminal is implemented by an individual function.

  • Iterate on every rule starts with on LHS.
  • Remember to resume the tokens when derivation fails.

Implementation

The parse tree generation can be done at Row 4 and Row 6, when the algorithm continues parsing.

You can also record the derivation rules and generate the parse tree when parsing is finished.

Resuming the parsed tokens at Row 9 can be implemented by a stack. When a token is matched with a terminal, push the token into the stack. When error occurs, we pop the tokens and put them back to the input.

The main routine starts from matching the start variable.

The parser reports syntax error when the main routine has tried all possible production rules. A syntax error can be

  • the parser receives a token which cannot be parsed;
  • after processing all tokens, the parse is incomplete (some leaves are nonterminals); or
  • when the parsing is complete (all leaves are terminals), there are some tokens remained in the input.

Disadvantages of Recursive-Descent Parser

  • The implementation (in code) can be very complicated/ugly because it needs to try many production rules, imaging many try-catch scopes nested with each other.
  • It waste a lot of time on backtracking.
  • It needs additional space to store the tokens for resuming purpose.
  • In practical, nobody uses recursive descent parser.

Grammar

Predictive Parser

To avoid backtracking, we want to design a parser which can guess the next token from the input.

A correct guess can let the parser uniquely choose a production rule for parsing.

The idea is using a temporary space to store some tokens in advanced, like a buffer. Then, use the buffered tokens to make decisions.

This technique is very similar to space-time trade-off in the dynamic programming, what we have learned in algorithms.

The time complexity of recursions with backtracking can go up to .

But, if we can choose rules uniquely, the time complexity becomes .

Consider the grammar

which is equivalent to the above one and try to parse

  • In the first iteration, the parser has to use the rule .

  • In the second iteration, there are two production rules "" or " "

  • This uncertainty can be easily solved by just “looking” at the next token from the lexer. This token is called lookahead token.

  • Normally, when lexer returns a token to parser, the token is consumed. But the lookahead token remains in the input.

  • The lookahead token in this example is " ". So, the parser knows the rule is .

Because the parser reads tokens from left to right, does leftmost derivation, and looks at most one lookahead token(could not be a non-terminal); this parser/grammar is called .

Some grammars may not be .

  • For example, the one we have seen.
  • Looking one token ahead is insufficient for this grammar.

  • ; and , agree on the first token.

  • If the lookahead token is “”, we cannot decide which rule will be used. Thus, to parse this grammar without recursions, the parser needs to look 2 tokens ahead.

    • This is grammar.

In general, we can design grammar, for constant .

In practical, more lookahead tokens make no big differences but increase the implementation difficulties.

More importantly, grammar is already powerful enough.

Left factoring

Left-factoring is a grammar transformation to convert a grammar to .

Back to the example,

is not because the first token of the two rules are the same. One lookahead token is not enough to the two productions.

We can solve this issue by introducing a new nonterminal . 提取相同的token

And convert the grammar to

The new grammar is , which is equivalent to

This conversion is called left-factoring.

More formally, for each nonterminal which has multiple productions starting with the same prefix :

where is a non-empty sequence of grammar symbols (terminals and nonterminals), and are (possibly empty) sequences of grammar symbols.

Create a new nonterminal and transform the rules as

Left Recursion Elimination

Some grammars are not good enough even after left factoring.

  • For example, again to parse the variable declaration
  • No production rule has a same prefix on RHS. But lookahead tokens do not work.

Suppose, the parser at some point needs to derive the nonterminal .

One may try: if the lookahead token is , the parser uses
If the lookahead token is , it uses .

Be careful! is a nonterminal but not a token. A lexer can never find such a lookahead token.

As a result, , will never be used. Obviously, this is wrong.

The problem is cause by . The RHS starts from a nonterminal, which cannot be used to match with a lookahead token.

This type of grammar is called left recursive.

Note that this still left-most derivation. Left-most or right-most derivation is not related to grammar itself. It only relates to in which order we derive the nonterminals.

Formally, a left recursive grammar has a valid derivation ,
where is a nonterminal and is a string of grammar symbols.

In general, a left recursive production rule has the form

This production can be derive

The parsing stops at , and can produce as many as possible.

So, we can let derives first and followed by some .

Note that the production from can be
But this is not

Formally, for each nonterminal which has one or more productions with RHSs starting with the same nonterminal :

where and are (possibly empty) sequences of grammar symbols.

Create a new nonterminal and transform the grammar as

Exercise

Given the following grammar, try to do left factoring and eliminate left recursions.


Ans:


Limitations

In the last example, the grammar is left associative before conversion, but becomes right associative after conversion.

This totally changes the structure of the parse tree for some expressions, like
".

In fact, top-down parsing cannot solve this issue. We need to either use a bottom-up parser or be stuck into the implementation details.

There are also some unambiguous context-free grammar cannot be converted into even after doing left factoring and left recursion elimination.

  • For example, the one we used to show ambiguity elimination

After left factoring becomes

Here are some last words about (recursive) predictive parsers.

  • A predictive parser can always correctly predict what it has to do next.
  • Predictive parsers can always be implemented by a recursive parser without using lookahead tokens.
  • Without further specification, we consider recursive parsers and predictive parsers are the same.
  • One major disadvantages of (recursive) predictive parsers is that they are not very efficient in implementations. Each production rule is implemented as a function. The parser needs to make many function calls and returns, which consume a lot of resources.

To avoid the function calls, we introduce nonrecursive predictive parsers.

Non-recursive Parser

To avoid recursions, we introduce non-recursive predictive parsers.

Intuitively, predictions are required when a nonterminal has multiple production rules.

The predictions are based on the next token. One token is enough because we are parsing grammars.

However, this method has troubles when the RHS of some rules start with nonterminals.

  • For example
  • We cannot decide whether we are going to use or 𝐴 without looking at and .

Thus, we can "preprocess" the grammar to find the first tokens derived by each nonterminal before parsing the tokens.

The result of preprocessing is presented by a table, called parsing table.
The rows are the nonterminals and the columns are terminals.
The entry is a production rule meaning that if the input token is , we will apply the rule .
The parser can parse the input tokens by checking the parsing table and using a stack.

This method is similar to the push-down automata but presented differently.

  • Initially, push $ and into the stack ( on top of $ ) where is the first nonterminal.
    Also insert $ to the end of the input stream of tokens.

Parsing using a parsing table

  • In each iteration,

    • Assume the next input token is ;
    • Pop the first item from the stack denoted by ;
    • If is a terminal, then try to match it with the next input token ;
    • If is a nonterminal, then in the parsing table is a production rule, denoted by . Then, push everything in to stack from right to left.
  • When the stack pops $ and all input tokens are consumed, the parsing halts.

  • $ is an artificial token, means the end of the stream.

Example

Given the parsing table and try to parse

id + * ( ) $
E E → TE′ E → TE′
E′ E′ → +TE′ E′ → ε E′ → ε
T T → FT′ T → FT′
T′ T′ → ε T′ → *FT′ T′ → ε T′ → ε
F F → id F → (E)
Production Rule Input Stack Pop
(id + id) * id $ $E
(id + id) * id $ $E'T
(id + id) * id $ $E'T'F
(id + id) * id $ $E'T')E(
Token Matched id + id) * id $ $E'T')E
id + id) * id $ $E'T')E'T $E
id + id) * id $ $E'T')E'T'F $T
id + id) * id $ $E'T')E'T'id $F
Token Matched + id)** * id $ $E'T')E'T'
+ id) * id $ $E'T')E'
+ id) * id $ $E'T')E'T+
Token Matched id) * id $ $E'T')E'T
id ) * id $ $E'T')E'T'F
id ) * id $ $E'T')E'T'id
Token Matched ) * id $ $E'T')E'T'
) * id $ $E'T')E'
) * id $ $E'T'
Token Matched ) * id $ $E'T'
* id $ $E'T'F*
Token Matched id $ $E'T'F
id $ $E'T'id
Token Matched $ $E'T'
$ $E'T
$ $E
Token Matched $

Some entries in the parsing table are empty.

Meaning that id you try to parse the token by a nonterminal , the parser returns an error message.

For example, assume the input tokens are , the parsing will be stuck somewhere in the middle.

The error handling will be discussed at the end.

Intuitively, a parsing table enumerates all possible tokens can be derived by a nonterminal.

When a parser reads a token from input, it has a unique production rule to apply.

Thus, we need to analyze the grammar and find the prefix (the first token) of all possible sentential form derived by each nonterminal.

First

By putting all the firstly produced tokens into a set, we defined .

  • For example,
  • because
  • because .
  • because .
  • because .

Formally, for each grammar symbol is computed by

Find_First

Note: each on line 3-5 can be either a terminal or a nonterminal.

Follow

When a terminal produces , things get complicated.

  • For example,
  • The parser need to use the rule to parse .

However, is not an input token which can be used for making predictions.

To handle this case, we also define , the next token possibly derived by some rules after .

  • In this example, because .

  • When the parser receives as the next token, it knows is the correct rule.

Find_Follow

Parsing Table Construction

Recall the parsing table is , where is the number of nonterminals and is the number of terminals.

The entry is a production rule that the parser will apply when the current nonterminal is 𝐴 and the next token is .

For each production in the (left factored, non-left recursive) grammar, do the following:

  1. For each token in , add the grammar production to
  2. If , then for each token in , add to
  3. All other entries in the table are left blank and correspond to a syntax error.

Note that Rule 2 is applied when is because is in .

To create a parsing table for a non-recursive parser,

  • eliminate any ambiguity from the grammar,
  • eliminate left recursion from the grammar,
  • left factor the grammar,
  • compute the sets for all tokens and nonterminals,
  • compute the sets for all nonterminals, and
  • use those and sets to construct the parsing table.

Bottom-Up Parsing

Limit of Top-Down Parsing

Back to the example in Top-Down Parsing. The following grammar is left-recursive.

Thus, top-down parsing needs to eliminate left recursion first.

However, the conversion changes the structure of some sentences, like .

(id - id) - id

E

E

Unsupported markdown: list

T

E

Unsupported markdown: list

T

T

id

id

id

id - (id - id)

E'

T

E'

id

Unsupported markdown: list

T

E'

id

Unsupported markdown: list

T

E'

id

ε

Bottom-Up Parsing

Bottom-up parsing is more natural to human.

Think about how we analyze the arithmetic expressions.
First, we calculate the subexpression in parentheses or the operator of the highest precedence.
Then, the intermediate result will involve in the following calculations.
The procedure is not from left to right.

Same thing happens when we analyze sentences in a program.

To the parse tree construction, the bottom-up procedure starts from isolated leaves, merges some of them to form a subtree, and eventually constructs the entire tree by placing a root.

The parse tree generated by bottom-up has no difference with by top-down, the internal vertices are nonterminals and leaves are tokens.

In fact, you have seen the similar procedure in Algorithm – the construction of Huffman code.

Because the bottom-up parsing is an inverse of top-down paring, the basic strategy is to use the LHS of some production rules to replace the RHS after reading some input tokens.

Thus, in the bottom-up parsing, there are two basic operations.

  • Shift
    • The parser needs to read more tokens from the input.
    • The tokens have already read are insufficient for any production rule.
  • Reduce
    • The parser has already read enough tokens.
    • It can replace the RHS by the LHS of some production rules.

The parser is called shift-reduce parser.

The parser also needs a stack to hold the tokens which are read but not consumed (reduced) yet.

Example of Bottom-Up Parsing

  • Let’s look at a small example first. Given the following grammar and try to parse .
No. Stack Input Output
0 abbcbcde
1 a bbcbcde shift
2 ab bcbcde shift
3 aA bcbcde reduce using A → b
4 aAb cbcde shift
5 aAbc bcde shift
6 aA bcde reduce using A → Abc
7 aAb cde shift
8 aAbc de shift
9 aA de reduce using A → Abc
10 aAd e shift
11 aAB e reduce using B → d
12 aABe shift
13 S reduce using S → aABe

The above example shows how bottom-up parsing works on a small instance.

However, to formally define a bottom-up parser, we need to generalize this procedure to all grammars. Many details will be discussed.

Even within this small example, careful reader may find the above procedure has some problems.
In Step2, we read b. Then, we immediately reduce 𝑏 to A in Step3.
But the thing is different in Step4. We read b again, but the parser waits until it reads another c and reduces Abc to A.

In general, similar to the predictive parser, there might be multiple options in bottom-up parsing. We need to clearly define which option is used under a certain condition.

LR Parsing Algorithm

LR Parser

There are many different ways to solve the above problem, like using recursions (Same as what we did in the predictive parsing, recursions simply enumerate all possible options.)

In this course, we only introduce parsers (k lookahead tokens, LR parser for short) because

  • parsers can be used to almost all (but not all) context-free grammars.
  • They are the most powerful non-backtracking shift-reduce parsers;
  • they can be implemented very efficiently; and
  • they are strictly more powerful than grammars.

When a reduction is performed, the of the production rule is already on the stack, which is some additional information to help decision making.

The parser puts state symbols into the stack to speedup checking the content on the stack.

Imaging that you want to check what’s on the given stack. You need to pop everything out, see if the content is same as what you want, then push everything back. This is very inefficient.

Thus, we use a state symbol to indicate the current content on the stack.

This symbol is called state symbol because it is exactly same the states in a DFA.

We can consider each state in a DFA means that "To reach this state, the input must be some specific strings."

Also, each is corresponds to an , like a pair.

To parse a grammar, you are also given a parsing table.
This parsing table again enumerates the actions the parser needs to take in all possible situations.

Each row in the table represents a state in PDA.
The columns are split into two individual fields: ACTION and GOTO.

  • Each column in ACTION represents a terminal.
  • Each column in GOTO represents a nonterminal.
  • The value of the entry ACTION shows the action taken by the parser when the current state is and the input token is
  • The value are of two types:
    • , meaning that the parser pushes the next input token into the stack and move to the state ;
    • , meaning that the parser reduces some (non)terminals using the production rule .
  • The value of the entry is a state symbol, for example, meaning that the parser goes to the state after it reduces the stack uses a production rule , for some .
LR Parsing Algorithm
  1. Put the special symbol $ at the end of the input.
  2. Put state 0 at the bottom of the stack.
  3. In each iteration, suppose the current configuration is , the current state is , and the next input token is .
    • If is "shift S", then the next configuration is .
    • If is "reduce ", then the next configuration is , where and . At the same time, the parser outputs .
    • If is "accept" and the current configuration is , where is the start symbol of the grammar, the parser accepts the input.
    • For all other cases, like is blank, the parser finds a syntax error and switch to error recovery.

Example

State id + * ( ) $ E T F
0 s5 s4 1 2 3
1 s6 acc
2 r2 s7 r2 r2
3 r4 r4 r4 r4
4 s5 s4 8 2 3
5 r6 r6 r6 r6
6 s5 s4 9 3
7 s5 s4 10
8 s6 s11
9 r1 s7 r1 r1
10 r3 r3 r3 r3
11 r5 r5 r5 r5
Stack Input Output
0 id + id * id$
0id 5 + id * id$ shift 5
0F 3 + id * id$ reduce F → id
0T 2 + id * id$ reduce T → F
0E 1 + id * id$ reduce E → T
0E 1 + 6 id * id$ shift 6
0E 1 + 6 id 5 * id$ shift 5
0E 1 + 6 F 3 * id$ reduce F → id
0E 1 + 6 T 9 * id$ reduce T → F
0E 1 + 6 T 9 * 7 id$ shift 7
0E 1 + 6 T 9 * 7 id 5 $ shift 5
0E 1 + 6 T 9 * 7 F 10 $ reduce F → id
0E 1 + 6 T 9 $ reduce T → T * F
0E 1 $ reduce E → E + T
0E 1 $ accept

SLR Parsing

Previously, we have introduced the parsing algorithm using the parsing table.
Now, the question is how to construct a parsing table for a given context-free grammar.

In fact, there are many ways to construct a parsing table for different 𝐿𝑅 parsers:

, ,, etc. (by increasing power).

For with are not used in practice because of the complexity.

We will focus on (simple ), then move on to , , and .

SLR Parser

To construct an parsing table, we do three things

  • augment the context-free grammar;
  • construct a DFA based on the computation of set of items;
  • represent the DFA using the transition table;

The construction in is very similar to transforming an NFA to a DFA, except that it will be based on sets of items of the same behavior instead of sets of NFA states.

Augmented Grammar

Given a grammar with a start symbol .

we construct the corresponding augmented grammar by artificially introducing a nonterminal and a production rule.

This artificial production rule seems to be useless, but it guarantees that the parser accepts the input when it reduces using .

items

The items are originally defined for parser, but do the same thing in .

An item (item, for short)is simply a grammar production with a dot somewhere in its RHS.

  • For example, the production can create 4 items

As a special case, the production only generates

Each item represents a state that the parser is currently in.
If the parser is in the state corresponding to , it means the parser has already push into the stack and expects to match from the input.

If is a nonterminal and the grammar also has a production , then the parser also expects to see .

Thus, the parser state corresponds to correspond .

Closure of items

Keep this intuition in mind, we find the closure of a set of items recursively using the following algorithm.

  • Every item in is in .

  • If is an item in and if is a production, where and are nonterminals and , , and are any sequence of terminals and nonterminals, then add the item to .

  • Repeat the above two steps until does not change.

  • For example,

Goto Function

Then, we define function.

For all the items of the form in a set of items , where is an arbitrary grammar symbol (token or nonterminal), we define as the closure of the set of items of the form .

  • For example, given the set of items , to find :

    • the item does not create any new item by taking "+";
    • the item implies the new item
  • Thus,

更新时,当一个token 时non-terminal 时,继续传递

Set of items

Set of items construction algorithm

Next, we construct the DFA to decide the tokens in the stack.
This method is called set-of-items construction algorithm.
Each set of items represents a state in the DFA.

  1. Compute and add the unmarked to , where is the start symbol of the augmented grammar and is the set of DFA states.
  2. While there is an unmarked DFA state , do the following:
    1. For each grammar symbol , do the following:
      1. Compute the DFA state .
      2. If and then add the unmarked to .
      3. If then define .
    2. Mark in .

Example

For example,

Initially,

Then, we consider the function from by taking each grammar symbol.

  • For :
Production Goto Symbol:

Thus,

Similarly,

  • DFA

    E

    T

    F

    (

    id

    I0

    I1

    I2

    I3

    I4

    I5

In the following, will be omitted for symbol .
Consider the function for




  • and are omitted because they are for all .
    DFA

    E

    T

    F

    (

    id

    Unsupported markdown: list
    Unsupported markdown: list

    F

    (

    E

    id

    I0

    I1

    I2

    I3

    I4

    I5

    I6

    I7

    I8


We continue computing the function from new states:




E

T

F

&(

id

Unsupported markdown: list

T

F

(

id

Unsupported markdown: list

F

(

id

F

(

E

id

)

Unsupported markdown: list

I0

I1

I2

I3

I4

I5

I6

I9

I7

I10

I8

I11

In the last iteration

E

T

F

&(

id

Unsupported markdown: list

T

F

(

id

Unsupported markdown: list

F

(

id

F

(

E

id

)

Unsupported markdown: list

id

I0

I1

I2

I3

I4

I5

I6

I9

I7

I10

I8

I11

Here is the list of sets of items.

i
0
1
2
3
4
5
6
7
8
9
10
11

Set of items construction algorithm

In the DFA we constructed above, we don’t need final states.
Because we use states to show the content in the stack.
And the stack is reduced if DFA reaches a final state.
More interestingly, a state, together with the string to reach the state, form a pair in the configuration.

To SLR parsing table

Parsing Table Construction

First, we construct the part of a parsing table.

  • If (where is a token) is an item in the set of items and , then set to "shift j".
  • If is an item in the set of items , then set to "reduce " for all in .
  • If (where S' is the start symbol of the augmented grammar) is an item in the set of items , then set to accept.

For item 2, we give a number to each grammar production and put in the table "reduce" followed by the production's number.

Then, we construction the part.

  • If , where is a nonterminal, then set to ""

Summary

Here are the steps to create a parsing table for an parser:

  • Eliminate any ambiguity from the grammar.
    • Eliminate left recursion from the grammar.
    • Left factor the grammar.
  • Augment the grammar with a new start symbol.
  • Compute the sets of items for the grammar and build the corresponding DFA.
  • Number the grammar productions.
  • Compute the sets for all tokens and nonterminals.
  • Compute the sets for all nonterminals.
  • Use the sets of items, DFA, and the sets to construct the parsing table.

Other Bottom-Up Parsing

Parsing Table

conflict

In fact, cannot avoid conflict entirely.

Consider the following ambiguous grammar.

Construct the set of items

In addition, contains the token ‘ = ’.

Thus, the parsing table construction algorithm will put both = and into the entry 𝐴𝐶𝑇𝐼𝑂𝑁 [2 , ] , which is conflict.

Sometimes, when current state is , next token is "=", parser does not reduce


To eliminate the conflict, we analyze the above grammar.

Then, we can see reduce(R → L) is not supposed to be performed in all the cases when the next input token is in FOLLOW(R).

  • If L appears on the LHS of =, reduce(R → L) is only needed when there is a * before L.
  • And if an L is on the RHS of =, this specific L can only be followed by $, but not =.

For example, if the input is * id = id, the parsing is as follows:

  • · * id = idshift
  • * · id = idshift
  • * id · = idshift
  • * L · = idreduce(L → id)
  • * R · = idreduce(R → L) (highlighted in red)
  • L · = idreduce(L → * R)
  • L = · idshift
  • L = id ·shift
  • L = L ·reduce(L → id)
  • L = R ·reduce(L → R) (highlighted in red)
  • S ·reduce(S → L = R)

Item

In addition to items, an item consists of an item and followed by a token.

For example, .

This additional token does not do anything in most cases.

Only for the items in the type . The parser reduces the stack using the rule only if the input token is .

Thus, is a token in .

For other tokens in , the parser does not reduce.

Therefore, one state in the DFA for items is possibly split into multiple states in the DFA for .

For the previous grammar, the state which has is split into two states:

  • one has
  • and the other has

However, even the parsing table is upgraded in this way, some grammar can still create shift–reduce conflict.

Then, we can let each item be followed by a pair – the combination of two tokens, which results in an item using two lookahead tokens.

This upgrading process can continue and define an grammar, like . But for this course, is enough, and we stop here.

item

Instead of trying to solve more problems but creating chaos, sometimes we want to ignore some problems and keep our life easier.
Thus, LALR(1) item is introduced.

Consider two items: and .

  • An item merges these two items as .
  • This undoes the splitting from to .
  • The merging reduces the number of states and loses very little power.

Back to the example, $ is the LALR(1) item.