Chapter 3
Lexical Analysis
In this chapter we show how to construct a lexical analyzer. To implement a
lexical analyzer by hand, it helps to start with a diagram or other description for
the lexemes of each token. We can then write code to identify each occurrence of
each lexeme on the input and to return information about the token identi ed.
We can also produce a lexical analyzer automatically by specifying the lex-
eme patterns to a lexical-analyzer generator and compiling those patterns into
code that functions as a lexical analyzer. This approach makes it easier to mod-
ify a lexical analyzer, since we have only to rewrite the a ected patterns, not
the entire program. It also speeds up the process of implementing the lexical
analyzer, since the programmer speci es the software at the very high level of
patterns and relies on the generator to produce the detailed code. We shall
introduce in Section 3.5 a lexical-analyzer generator called Lex (or Flex in a
more recent embodiment).
We begin the study of lexical-analyzer generators by introducing regular
expressions, a convenient notation for specifying lexeme patterns. We show
how this notation can be transformed, rst into nondeterministic automata
and then into deterministic automata. The latter two notations can be used as
input to a \driver," that is, code which simulates these automata and uses them
as a guide to determining the next token. This driver and the speci cation of
the automaton form the nucleus of the lexical analyzer.
3.1 The Role of the Lexical Analyzer
As the rst phase of a compiler, the main task of the lexical analyzer is to
read the input characters of the source program, group them into lexemes, and
produce as output a sequence of tokens for each lexeme in the source program.
The stream of tokens is sent to the parser for syntax analysis. It is common
for the lexical analyzer to interact with the symbol table as well. When the
lexical analyzer discovers a lexeme constituting an identi er, it needs to enter
that lexeme into the symbol table. In some cases, information regarding the
109
,110 CHAPTER 3. LEXICAL ANALYSIS
kind of identi er may be read from the symbol table by the lexical analyzer to
assist it in determining the proper token it must pass to the parser.
These interactions are suggested in Fig. 3.1. Commonly, the interaction is
implemented by having the parser call the lexical analyzer. The call, suggested
by the getNextToken command, causes the lexical analyzer to read characters
from its input until it can identify the next lexeme and produce for it the next
token, which it returns to the parser.
token
source Lexical to semantic
Parser
program Analyzer analysis
getNextToken
Symbol
Table
Figure 3.1: Interactions between the lexical analyzer and the parser
Since the lexical analyzer is the part of the compiler that reads the source
text, it may perform certain other tasks besides identi cation of lexemes. One
such task is stripping out comments and whitespace (blank, newline, tab, and
perhaps other characters that are used to separate tokens in the input). Another
task is correlating error messages generated by the compiler with the source
program. For instance, the lexical analyzer may keep track of the number
of newline characters seen, so it can associate a line number with each error
message. In some compilers, the lexical analyzer makes a copy of the source
program with the error messages inserted at the appropriate positions. If the
source program uses a macro-preprocessor, the expansion of macros may also
be performed by the lexical analyzer.
Sometimes, lexical analyzers are divided into a cascade of two processes:
a) Scanning consists of the simple processes that do not require tokenization
of the input, such as deletion of comments and compaction of consecutive
whitespace characters into one.
b) Lexical analysis proper is the more complex portion, which produces to-
kens from the output of the scanner.
3.1.1 Lexical Analysis Versus Parsing
There are a number of reasons why the analysis portion of a compiler is normally
separated into lexical analysis and parsing (syntax analysis) phases.
,3.1. THE ROLE OF THE LEXICAL ANALYZER 111
1. Simplicity of design is the most important consideration. The separation
of lexical and syntactic analysis often allows us to simplify at least one
of these tasks. For example, a parser that had to deal with comments
and whitespace as syntactic units would be considerably more complex
than one that can assume comments and whitespace have already been
removed by the lexical analyzer. If we are designing a new language,
separating lexical and syntactic concerns can lead to a cleaner overall
language design.
2. Compiler eciency is improved. A separate lexical analyzer allows us to
apply specialized techniques that serve only the lexical task, not the job
of parsing. In addition, specialized bu ering techniques for reading input
characters can speed up the compiler signi cantly.
3. Compiler portability is enhanced. Input-device-speci c peculiarities can
be restricted to the lexical analyzer.
3.1.2 Tokens, Patterns, and Lexemes
When discussing lexical analysis, we use three related but distinct terms:
A token is a pair consisting of a token name and an optional attribute
value. The token name is an abstract symbol representing a kind of
lexical unit, e.g., a particular keyword, or a sequence of input characters
denoting an identi er. The token names are the input symbols that the
parser processes. In what follows, we shall generally write the name of a
token in boldface. We will often refer to a token by its token name.
A pattern is a description of the form that the lexemes of a token may take.
In the case of a keyword as a token, the pattern is just the sequence of
characters that form the keyword. For identi ers and some other tokens,
the pattern is a more complex structure that is matched by many strings.
A lexeme is a sequence of characters in the source program that matches
the pattern for a token and is identi ed by the lexical analyzer as an
instance of that token.
Example 3.1: Figure 3.2 gives some typical tokens, their informally described
patterns, and some sample lexemes. To see how these concepts are used in
practice, in the C statement
printf("Total = %d\n", score);
both printf and score are lexemes matching the pattern for token id, and
"Total = %d\n" is a lexeme matching literal. 2
In many programming languages, the following classes cover most or all of
the tokens:
, 112 CHAPTER 3. LEXICAL ANALYSIS
TOKEN INFORMAL DESCRIPTION SAMPLE LEXEMES
if characters i, f if
else characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=
id letter followed by letters and digits pi, score, D2
number any numeric constant 3.14159, 0, 6.02e23
literal anything but ", surrounded by "'s "core dumped"
Figure 3.2: Examples of tokens
1. One token for each keyword. The pattern for a keyword is the same as
the keyword itself.
2. Tokens for the operators, either individually or in classes such as the token
comparison mentioned in Fig. 3.2.
3. One token representing all identi ers.
4. One or more tokens representing constants, such as numbers and literal
strings.
5. Tokens for each punctuation symbol, such as left and right parentheses,
comma, and semicolon.
3.1.3 Attributes for Tokens
When more than one lexeme can match a pattern, the lexical analyzer must
provide the subsequent compiler phases additional information about the par-
ticular lexeme that matched. For example, the pattern for token number
matches both 0 and 1, but it is extremely important for the code generator to
know which lexeme was found in the source program. Thus, in many cases the
lexical analyzer returns to the parser not only a token name, but an attribute
value that describes the lexeme represented by the token; the token name in-
uences parsing decisions, while the attribute value in uences translation of
tokens after the parse.
We shall assume that tokens have at most one associated attribute, although
this attribute may have a structure that combines several pieces of information.
The most important example is the token id, where we need to associate with
the token a great deal of information. Normally, information about an identi-
er | e.g., its lexeme, its type, and the location at which it is rst found (in
case an error message about that identi er must be issued) | is kept in the
symbol table. Thus, the appropriate attribute value for an identi er is a pointer
to the symbol-table entry for that identi er.