Chapter 1
Introduction
Programming languages are notations for describing computations to people
and to machines. The world as we know it depends on programming languages,
because all the software running on all the computers was written in some
programming language. But, before a program can be run, it rst must be
translated into a form in which it can be executed by a computer.
The software systems that do this translation are called compilers.
This book is about how to design and implement compilers. We shall dis-
cover that a few basic ideas can be used to construct translators for a wide
variety of languages and machines. Besides compilers, the principles and tech-
niques for compiler design are applicable to so many other domains that they
are likely to be reused many times in the career of a computer scientist. The
study of compiler writing touches upon programming languages, machine ar-
chitecture, language theory, algorithms, and software engineering.
In this preliminary chapter, we introduce the di erent forms of language
translators, give a high level overview of the structure of a typical compiler,
and discuss the trends in programming languages and machine architecture
that are shaping compilers. We include some observations on the relationship
between compiler design and computer-science theory and an outline of the
applications of compiler technology that go beyond compilation. We end with
a brief outline of key programming-language concepts that will be needed for
our study of compilers.
1.1 Language Processors
Simply stated, a compiler is a program that can read a program in one lan-
guage | the source language | and translate it into an equivalent program in
another language | the target language; see Fig. 1.1. An important role of the
compiler is to report any errors in the source program that it detects during
the translation process.
1
,2 CHAPTER 1. INTRODUCTION
source program
Compiler
target program
Figure 1.1: A compiler
If the target program is an executable machine-language program, it can
then be called by the user to process inputs and produce outputs; see Fig. 1.2.
input Target Program output
Figure 1.2: Running the target program
An interpreter is another common kind of language processor. Instead of
producing a target program as a translation, an interpreter appears to directly
execute the operations speci ed in the source program on inputs supplied by
the user, as shown in Fig. 1.3.
source program
Interpreter output
input
Figure 1.3: An interpreter
The machine-language target program produced by a compiler is usually
much faster than an interpreter at mapping inputs to outputs . An interpreter,
however, can usually give better error diagnostics than a compiler, because it
executes the source program statement by statement.
Example 1.1: Java language processors combine compilation and interpreta-
tion, as shown in Fig. 1.4. A Java source program may rst be compiled into
an intermediate form called bytecodes. The bytecodes are then interpreted by a
virtual machine. A bene t of this arrangement is that bytecodes compiled on
one machine can be interpreted on another machine, perhaps across a network.
In order to achieve faster processing of inputs to outputs, some Java compil-
ers, called just-in-time compilers, translate the bytecodes into machine language
immediately before they run the intermediate program to process the input. 2
,1.1. LANGUAGE PROCESSORS 3
source program
Translator
intermediate program Virtual
Machine output
input
Figure 1.4: A hybrid compiler
In addition to a compiler, several other programs may be required to create
an executable target program, as shown in Fig. 1.5. A source program may be
divided into modules stored in separate les. The task of collecting the source
program is sometimes entrusted to a separate program, called a preprocessor.
The preprocessor may also expand shorthands, called macros, into source lan-
guage statements.
The modi ed source program is then fed to a compiler. The compiler may
produce an assembly-language program as its output, because assembly lan-
guage is easier to produce as output and is easier to debug. The assembly
language is then processed by a program called an assembler that produces
relocatable machine code as its output.
Large programs are often compiled in pieces, so the relocatable machine
code may have to be linked together with other relocatable object les and
library les into the code that actually runs on the machine. The linker resolves
external memory addresses, where the code in one le may refer to a location
in another le. The loader then puts together all of the executable object les
into memory for execution.
1.1.1 Exercises for Section 1.1
Exercise 1.1.1: What is the di erence between a compiler and an interpreter?
Exercise 1.1.2: What are the advantages of (a) a compiler over an interpreter
(b) an interpreter over a compiler?
Exercise 1.1.3: What advantages are there to a language-processing system in
which the compiler produces assembly language rather than machine language?
Exercise 1.1.4: A compiler that translates a high-level language into another
high-level language is called a source-to-source translator. What advantages are
there to using C as a target language for a compiler?
Exercise 1.1.5: Describe some of the tasks that an assembler needs to per-
form.
, 4 CHAPTER 1. INTRODUCTION
source program
Preprocessor
modi ed source program
Compiler
target assembly program
Assembler
relocatable machine code
Linker/Loader library les
relocatable object les
target machine code
Figure 1.5: A language-processing system
1.2 The Structure of a Compiler
Up to this point we have treated a compiler as a single box that maps a source
program into a semantically equivalent target program. If we open up this box
a little, we see that there are two parts to this mapping: analysis and synthesis.
The analysis part breaks up the source program into constituent pieces and
imposes a grammatical structure on them. It then uses this structure to cre-
ate an intermediate representation of the source program. If the analysis part
detects that the source program is either syntactically ill formed or semanti-
cally unsound, then it must provide informative messages, so the user can take
corrective action. The analysis part also collects information about the source
program and stores it in a data structure called a symbol table, which is passed
along with the intermediate representation to the synthesis part.
The synthesis part constructs the desired target program from the interme-
diate representation and the information in the symbol table. The analysis part
is often called the front end of the compiler; the synthesis part is the back end.
If we examine the compilation process in more detail, we see that it operates
as a sequence of phases, each of which transforms one representation of the
source program to another. A typical decomposition of a compiler into phases
is shown in Fig. 1.6. In practice, several phases may be grouped together,
and the intermediate representations between the grouped phases need not be
constructed explicitly. The symbol table, which stores information about the