Chapter 6
Intermediate-Code
Generation
In the analysis-synthesis model of a compiler, the front end analyzes a source
program and creates an intermediate representation, from which the back end
generates target code. Ideally, details of the source language are con ned to the
front end, and details of the target machine to the back end. With a suitably
de ned intermediate representation, a compiler for language i and machine j
can then be built by combining the front end for language i with the back
end for machine j . This approach to creating suite of compilers can save a
considerable amount of e ort: m n compilers can be built by writing just m
front ends and n back ends.
This chapter deals with intermediate representations, static type checking,
and intermediate code generation. For simplicity, we assume that a com-
piler front end is organized as in Fig. 6.1, where parsing, static checking, and
intermediate-code generation are done sequentially; sometimes they can be com-
bined and folded into parsing. We shall use the syntax-directed formalisms of
Chapters 2 and 5 to specify checking and translation. Many of the translation
schemes can be implemented during either bottom-up or top-down parsing, us-
ing the techniques of Chapter 5. All schemes can be implemented by creating
a syntax tree and then walking the tree.
Static Intermediate intermediate Code
Parser Checker Code code Generator
Generator
front end back end
Figure 6.1: Logical structure of a compiler front end
Static checking includes type checking, which ensures that operators are ap-
plied to compatible operands. It also includes any syntactic checks that remain
357
,358 CHAPTER 6. INTERMEDIATE-CODE GENERATION
after parsing. For example, static checking assures that a break-statement in
C is enclosed within a while-, for-, or switch-statement; an error is reported if
such an enclosing statement does not exist.
The approach in this chapter can be used for a wide range of intermediate
representations, including syntax trees and three-address code, both of which
were introduced in Section 2.8. The term \three-address code" comes from
instructions of the general form x = y op z with three addresses: two for the
operands y and z and one for the result x.
In the process of translating a program in a given source language into code
for a given target machine, a compiler may construct a sequence of intermediate
representations, as in Fig. 6.2. High-level representations are close to the source
language and low-level representations are close to the target machine. Syntax
trees are high level; they depict the natural hierarchical structure of the source
program and are well suited to tasks like static type checking.
Source High Level Low Level Target
Program Intermediate Intermediate Code
Representation Representation
Figure 6.2: A compiler might use a sequence of intermediate representations
A low-level representation is suitable for machine-dependent tasks like reg-
ister allocation and instruction selection. Three-address code can range from
high- to low-level, depending on the choice of operators. For expressions, the
di erences between syntax trees and three-address code are super cial, as we
shall see in Section 6.2.3. For looping statements, for example, a syntax tree
represents the components of a statement, whereas three-address code contains
labels and jump instructions to represent the ow of control, as in machine
language.
The choice or design of an intermediate representation varies from compiler
to compiler. An intermediate representation may either be an actual language
or it may consist of internal data structures that are shared by phases of the
compiler. C is a programming language, yet it is often used as an intermediate
form because it is exible, it compiles into ecient machine code, and its com-
pilers are widely available. The original C++ compiler consisted of a front end
that generated C, treating a C compiler as a back end.
6.1 Variants of Syntax Trees
Nodes in a syntax tree represent constructs in the source program; the children
of a node represent the meaningful components of a construct. A directed
acyclic graph (hereafter called a DAG ) for an expression identi es the common
subexpressions (subexpressions that occur more than once) of the expression.
As we shall see in this section, DAG's can be constructed by using the same
techniques that construct syntax trees.
,6.1. VARIANTS OF SYNTAX TREES 359
6.1.1 Directed Acyclic Graphs for Expressions
Like the syntax tree for an expression, a DAG has leaves corresponding to
atomic operands and interior nodes corresponding to operators. The di erence
is that a node N in a DAG has more than one parent if N represents a com-
mon subexpression; in a syntax tree, the tree for the common subexpression
would be replicated as many times as the subexpression appears in the original
expression. Thus, a DAG not only represents expressions more succinctly, it
gives the compiler important clues regarding the generation of ecient code to
evaluate the expressions.
Example 6.1: Figure 6.3 shows the DAG for the expression
a + a * (b - c) + (b - c) * d
The leaf for a has two parents, because a appears twice in the expression.
More interestingly, the two occurrences of the common subexpression b-c are
represented by one node, the node labeled ,. That node has two parents,
representing its two uses in the subexpressions a*(b-c) and (b-c)*d. Even
though b and c appear twice in the complete expression, their nodes each have
one parent, since both uses are in the common subexpression b-c. 2
+
+
d
a ,
b c
Figure 6.3: Dag for the expression a + a * (b - c) + (b - c) * d
The SDD of Fig. 6.4 can construct either syntax trees or DAG's. It was
used to construct syntax trees in Example 5.11, where functions Leaf and Node
created a fresh node each time they were called. It will construct a DAG if,
before creating a new node, these functions rst check whether an identical node
already exists. If a previously created identical node exists, the existing node
is returned. For instance, before constructing a new node, Node(op; left; right),
we check whether there is already a node with label op, and children left and
right, in that order. If so, Node returns the existing node; otherwise, it creates
a new node.
Example 6.2: The sequence of steps shown in Fig. 6.5 constructs the DAG
in Fig. 6.3, provided Node and Leaf return an existing node, if possible, as
, 360 CHAPTER 6. INTERMEDIATE-CODE GENERATION
PRODUCTION SEMANTIC RULES
1) E ! E1 + T E:node = new Node(0 +0 ; E1 :node; T:node)
2) E ! E1 , T E:node = new Node(0 ,0 ; E1 :node; T:node)
3) E ! T E:node = T:node
4) T ! ( E ) T:node = E:node
5) T ! id T:node = new Leaf (id; id:entry)
6) T ! num T:node = new Leaf (num; num:val)
Figure 6.4: Syntax-directed de nition to produce syntax trees or DAG's
1) p1 = Leaf (id; entry-a)
2) p2 = Leaf (id; entry-a) = p1
3) p3 = Leaf (id; entry-b)
4) p4 = Leaf (id; entry-c)
5) p5 = Node(0 ,0 ; p3 ; p4 )
6) p6 = Node(0 0 ; p1 ; p5 )
7) p7 = Node(0 +0 ; p1 ; p6 )
8) p8 = Leaf (id; entry-b) = p3
9) p9 = Leaf (id; entry-c) = p4
10) p10 = Node(0 ,0 ; p3 ; p4 ) = p5
11) p11 = Leaf (id; entry-d)
12) p12 = Node(0 0 ; p5 ; p11 )
13) p13 = Node(0 +0 ; p7 ; p12 )
Figure 6.5: Steps for constructing the DAG of Fig. 6.3
discussed above. We assume that entry-a points to the symbol-table entry for
a, and similarly for the other identi ers.
When the call to Leaf (id; entry-a) is repeated at step 2, the node created
by the previous call is returned, so p2 = p1 . Similarly, the nodes returned at
steps 8 and 9 are the same as those returned at steps 3 and 4 (i.e., p8 = p3
and p9 = p4 ). Hence the node returned at step 10 must be the same at that
returned at step 5; i.e., p10 = p5 . 2
6.1.2 The Value-Number Method for Constructing DAG's
Often, the nodes of a syntax tree or DAG are stored in an array of records, as
suggested by Fig. 6.6. Each row of the array represents one record, and therefore
one node. In each record, the rst eld is an operation code, indicating the label
of the node. In Fig. 6.6(b), leaves have one additional eld, which holds the
lexical value (either a symbol-table pointer or a constant, in this case), and