saks.rtf

Columns

The Column that Needs a Name

Dan Saks

Parsing C++ Declarations, Part 2

You can gain lots of insight into C++ by writing a program to analyze C++ source code. Dan constructs such a program and kicks off his C++ parser project.

This is the second of two articles that describe the design and implementation of a program that translates C++ declarations into English. In the first part (CUJ, February 1996), I explained the program's general behavior and its high-level design. I also described in detail the interface to the character scanning functions. This month, I'll show the implementation of the scanning functions and, of course, the design and implementation of the parser itself.

A brief recap is probably in order. My decl program reads a sequence of zero or more C++ declarations from the standard input. It checks each declaration for proper syntax, and if the syntax is correct, writes the English translation of the declaration to the standard output. For example, given the declaration:
const char *a[10];
decl responds with:
a is ...
array with 10 elements of type...
pointer to ...
const char
and given the input:
int T::*pm;
decl responds with:
pm is ...
pointer to member of T with type ...
int
decl recognizes object and function declarations that satisfy the syntax for simple-declaration, as defined by the EBNF in Table 1. EBNF is the Extended Backus-Naur Form, which I described several months ago (see "A Sensible Grammar Notation," CUJ, October 1995). The EBNF grammar for the corresponding tokens appears in Table 2. Since I'm using decl to illustrate how declarations synthesize types, the grammar omits parts of genuine C++ declarations, such as initializers and storage-class-specifiers, that do not participate in forming types.

decl imposes a few other simplifying restrictions on its input. It accepts type keywords, such as char, int, and double, but not adjectives, such as long, short, signed, and unsigned. It also accepts user-defined type names; however, type names must begin with an uppercase letter (see the rule for name in Table 2) .

For example,
X f();
is okay, but
x f();
is not, because the type-specifier x begins with a lowercase letter. As I explained in Part 1, the case of the first letter in an identifier gives decl a simple way to distinguish names (previously-declared identifiers) from other (as yet undeclared) identifiers. This avoids the need to maintain a symbol table.

decl does not accept a parameter list between the parentheses of a function declarator. (The rule for function-suffix in Table 1 refers to the non-terminal parameter-clause, but the grammar nevers defines parameter-clause.) decl does accept an array dimension in an array declarator, but the dimension must be a single integer, as in a[10], or a single name, as in a[N]. The dimension cannot be a full-blown expression, as in full-blown C++. Again, the name in the dimension, if any, must begin with an uppercase letter.

decl is built from two primary components, a scanner and a parser. The scanner is a stream-like class that reads characters from the input, filters out the whitespace characters, and partitions the remaining characters into tokens. The parser acquires tokens from the scanner, pieces them into declarations, and translates them into English.

Listing 1 shows scanner.h, the header that defines the scanner and token classes. The scanner's public interface consists of four functions. The constructor attaches a scanner to an istream. get returns the next token scanned from that istream. unget "pushes" the current token back into the scanner (to be returned by a future call to get), and sets the current token to its prior value. current simply returns the current token.

A token object is a pair of values. The token class has public member functions for inspecting those values: text returns the token as a string object, and kind returns an enumeration that categorizes the token. The public nested type token::category enumerates the different categories.

The token class has a public default constructor so that the parser can declare tokens to hold objects returned from scanner::get. It also has a copy constructor and a copy assignment operator, both compiler-generated and public.

The token class has a third constructor
token(category, const string &);
for manufacturing new token objects. This constructor is only for the scanner's use. Therefore, the token class declares the constructor private and then grants the scanner friendship, just so the scanner can access that one private constructor.

scanner.h also defines one global function, image, which converts a value of type token::category to a null-terminated character string, primarily for display purposes.

scanner.h in Listing 1 is identical to the one from Part 1 except for one detail previously missing from the scanner class. The scanner now explicitly declares a copy constructor:
scanner(const scanner &);
and a copy assignment operator:
scanner &operator=(const scanner &);
as private members. There's a risk that the compiler might generate these functions, and in so doing, sneak some nasty surprises into my program. Since the decl program doesn't need these copy operations, the safest course is to stifle the compiler's urge to create them.

In this particular instance, the private declarations aren't really necessary. The scanner class has a data member in_stream of type istream &. C++ does not generate a copy constructor and a copy assignment for any class that has a data member with a reference type. However, I think it's better style to disable the copy operations explicitly, rather than rely on such a subtle rule.

Token Look Ahead

Listing 2 contains scanner.cpp, which defines the non-inline member functions and static data members for the scanner class. The private member scan does the bulk of the work. The get and unget functions add a thin layer of code to support one token lookahead. Let's look at get and unget first.

In nearly all cases, decl can make its parsing decisions based on the category of the current token. However, there is one place where the parser needs to know about the token beyond the current token in order to make up its mind about the current token. In this case, the parser gets the next token, peeks at it, and then ungets it (puts it back). The parser can then finish processing the current token, and then get the next token for keeps.

The decl parser needs to look at most one token ahead. I opted for a simple implementation that would meet that requirement, rather than a more general scheme to support further lookahead. The scanner has three private token objects, previous_token, current_token, and next_token. The scanner constructor (Listing 1) initializes all of these to the token::NO_VALUE category by default.

If the parser never calls unget, then the scanner never uses next_token, and next_token is always the NO_VALUE token. Thus, in the "normal" case, get (Listing 2) skips to its else-part, which simply copies current_token to previous_token, and scans a new current_token from the input stream.

Calling unget backs up the scanner by one token. It copies current_token to next_token, and copies previous_token to current_token. Thus, the next call to get won't scan the input stream, but rather will use the previously "ungot" token sitting in next_token as the current token.

This unget mechanism is adequate to let the parser look ahead one token, but no more. In particular, the mechanism does not work properly if the parser calls unget twice without an intervening call to get. Although I probably should build in some defense against consecutive unget calls, I don't want to get sidetracked on that right now. I may return to this issue in a future article.

Scanning Characters

The scan function does most of the scanner's work. It reads the characters one at a time from in_stream (the scanner's private istream object), pieces them into strings, and categorizes the resulting token.

As I explained in Part 1, not all the productions (grammar rules) in Table 2 represent complete tokens. A production in Table 2 represents a token only if its left-hand-side also appears on the right-hand-side of a production in Table 1. Thus, integer-literal, identifier, name, type-keyword, and cv-qualifier are token categories. The other productions — digit, letter, lowercase-letter, and uppercase-letter — define sets of characters used in composing tokens.

The tokens in Table 2 do not cover all the token categories. Each of the operators appearing as a quoted string in Listing 1, namely
,  (  )  [  ]  &  ::  *
is also its own category, as is the EOF character.

Tokens may be separated by whitespace characters, so the scan function begins by reading and skipping characters until it finds a non-space character. Then it selects a preliminary category for the token based on that first non-space character. For example, the only tokens that can begin with a digit are integer-literals, and the only tokens that can begin with an uppercase letter are names. Therefore, the algorithm of the scan function looks something like:
c = the first non-space character
if (c is a digit)
    scan an integer-literal
else if (c is an uppercase letter)
   scan a name
else ...
All of the keywords (type-keywords and cv-qualifiers) match the rule for identifiers. A token with the structure of an identifier is truly an identifier only if it does not match one of the keywords. The scanner distinguishes keywords by table lookup. That is, it scans a string that begins with a lowercase letter as if it were an identifier, and then hunts for that string in the keyword table to see if it's really a keyword. Thus, the trailing else ... in the scan algorithm above is really
else if (c is a lowercase letter)
    scan an identifier or keyword
else ...
The istream class offers a choice of functions and operators for reading characters from an istream. scan uses
int istream::get();
which behaves much like fgetc and getchar from <stdio.h>. That is, if it successfully reads a character, get returns that character as an int; otherwise it returns EOF.

Using istream::get and the standard isdigit function, a lexical rule such as
integer-literal =
    digit { digit } .
translates fairly directly into C++ code:
if (isdigit(c))
    {
    // set the category to integer-literal
    // do something with c
    c = in_stream.get();
    while (isdigit(c))
        {
        // do something with c
        c = in_stream.get();
        }
    }
The loop reads one character too many. It can't tell that it has read an entire integer-literal until it has read a character that is not a digit. The scanner must therefore put that character back into the input stream so that the next call to scan will resume scanning at the first character after the integer-literal. If it did not put that character back, the next call to scan would blindly get another character, and thus ignore the character after the integer-literal.

The draft C++ Standard offers a choice of functions for putting characters back into an input stream.
istream &istream::unget();
simply backs an istream up by one character without altering the contents of the istream.
istream &istream::putback(char c);
puts any character c back into the istream, not necessarily a character that was actually read from that istream. For this application, unget is more appropriate. However, neither of the compilers that I used to test this program (Borland C++ 4.5 and Watcom C++ 10.5) has an istream with an unget member. Therefore, I had to use putback.

In the code above, "do something with c" is simply tacking character c to the end of the string that is the text image of the token. You can append a single character to a string using
string &string::operator+=(char);
Inside the scan function, c has type int. Although there's a standard conversion from int to char, the compilers I used complained of ambiguities, so I added a cast:
text += char(c);
I don't mind using a cast here; it's probably better style anyway. I only wish I could have used the new cast notation:
text += static_cast<char>(c);
However, of the compilers I used, only Borland supports this notation. I prefer the comfort of compiling my code with more than one compiler, so I stayed with the old-style cast.

Combining all these pieces together, the code to scan an integer-literal looks like:
if (isdigit(c))
    {
    kind = token::INT_LITERAL;
    text += char(c);
    c = in_stream.get();
    while (isdigit(c))
        {
        text += char(c);
        c = in_stream.get();
        }
    in_stream.putback(c);
    }
In this code, the statements inside the loop also appear just above the loop. Therefore, I did a little optimization by hand by converting the while loop to a do-while loop. The resulting code appears in Listing 2.

It turns out that the code to scan an identifier and the code to scan a name share a fair amount of common code. Rather than implement them as distinct paths through the scan function, I folded them into one, as shown in Listing 2.

If scan determines that the first non-space character is neither a letter nor a digit, then it checks for operators. If the character is a colon, then scan gets another character in the hope of finding another colon. If that next character is not a colon, the scanner puts it back, and categorizes the token as NO_SUCH (an unrecognizable token).

If the first non-space character is not a colon, scan checks to see if it's some other operator. scan uses
strchr("*&[]();,", c)
to test if c is amember of the set of operators. strchr returns non-zero if character c appears somewhere in the string literal. If c is indeed one of the operators, scan computes its token category by simply casting c to the category type using:
kind = token::category(c);
As I explained in Part 1, this little trick works because type token::category (Listing 1) explicitly defines the category for each operator with the value of the character for that operator. Here again, I'd prefer to use a static_cast.

If the first non-whitespace character is not an operator, then scan checks it against EOF. If that fails, then the character is an unrecognizable token, and scan sets the category to NO_SUCH. At the very end, scan uses the private token constructor to glue the string and category into a single token object, which it returns.

Parsing Functions

The parser is the main part of the program. It acquires tokens from the scanner, verifies that they form syntactically correct declarations, translates the declarations into English, and writes the translations to the output.

The parser shares many similarities with the scanner. The rules for transforming syntax rules into parsing functions are essentially the same as the rules for transforming lexical rules into scanning functions. The key difference is that the parser gets tokens from a scanner object while the scanner gets characters from an istream. The parser calls its scanner object input.

There are more syntax rules than lexical rules, and the syntax rules are a bit more complicated, so the parser is larger than the scanner. Whereas the lexical rules map into different alternatives in a cascade of if-else statements, each syntax rule maps more-or-less into a distinct parsing function.

I say "more-or-less" because some productions are so simple they don't merit their own parsing function. Often, folding several simple productions into a single production reduces the overall complexity of the parser by reducing the number of parsing functions.

For example, you can replace constant-expression in
array-suffix =
    "["[constant-expression]"]".
with the right-hand-side of
constant-expression =
    name | integer-literal .
to yield the single production
array-suffix =
    "["[name|integer-literal]"]".
Similarly, you can fold both
simple-declaration =
    decl-specifier-seqdeclarator-list.
and
declarator-list =
    declarator{","declarator}.
into just one rule:
simple-declaration=
    decl-specifier-seq declarator{","declarator}.
Recall that, by the time the scanner reaches the first of the if statements corresponding to a lexical rule, the first character of the token is already available. Similarly, by the time the parser enters the body of a given parsing function, the first token of the corresponding rule is already available. Therefore, a parsing function doesn't start by getting a token; it starts by acting on the current token.

For example, the body of the parsing function for
array-suffix =
    "[" [ name | integer-literal ] "]" .
begins with something like
token t = input.current();
if (t.kind() == token::LEFT_BRACKET)
    input.get();
else
    error("'[' expected");
In other words, the parser is expecting a [ at this point. If it sees one, it just goes on to the next token. If not, it reports an error. Tests of this sort occur so often in the parser, I wrote a function, must_be, to do the job.

Thus, the body of the parsing function for array-suffix becomes something like
must_be(token::LEFT_BRACKET);
token t = input.current();
if (t.kind() == token::NAME)
    // do something with the name
else if (t.kind() == token::INT_LITERAL)
    // do something with the literal
must_be(token::RIGHT_BRACKET);
The if-else if statement in the middle does not end with
else
    error("name or integer expected");
because the name or integer-literal is optional. It is not an error if neither is present.

The body of the parsing function for
simple-declaration =
    decl-specifier-seq declarator "," declarator}.
looks something like
decl_specifier_seq();
declarator();
while (input.current().kind() == token::COMMA)
    {
    input.get();
    declarator();
    }
Here, the reference to decl-specifier-seq on the right-hand-side of the syntax rule translates to the call to the parsing function for decl-specifier-seq (ditto for declarator). The iterative part of the rule (the part enclosed in braces) translates into a while statement, whose body calls the declarator parsing function.

There is only one place where the parser must look ahead at an extra token and then back up (using scanner::unget). If the decl-specifier-seq parsing function encounters a type-name, it must look at the token after that name to see if it's a ::. A type-name followed by a :: is not part of the decl-specifier-seq; a type-name followed by anything other than a :: is. Whether or not there's a ::, the parser must back up and finish processing the type-name.

For example, in the declaration
const T n;
T is part of the decl-specifier-seq, and n is an object of type const T. But in
const T::*p;
T is part of a pointer-to-member declarator. p has type "pointer to member of T of type const int" (because of that darned "implicit int" rule). The :: after the type-name has a dramatic effect on the parse.

Translating into English

Each parsing function returns a string containing the English translation of the portion of the declaration that it parsed. Each parsing function that calls other parsing functions must splice those strings together to form the complete translation. The parser uses the string operators + and += to do the splicing.

For example, the body of the simple-declaration parsing function is:
string d = decl_specifier_seq();
string sd = declarator() + d + '\n';
while (input.current().kind() == token::COMMA)
    {
    input.get();
    sd += declarator() + d + '\n';
    }
return sd;
Given the input
const int *p, f();
the call to decl_specifier_seq returns the string
const int
which becomes the value of d. The first call to declarator (above the loop) returns
p is ...
pointer to ...
simple_declaration appends d and a newline to this string, and stores the result
p is ...
pointer to ...
const int
in sd. The next call to declarator (inside the loop) returns
f is ...
function returning ...
simple_declaration appends this string, another copy of d, and a newline to sd. The result is
p is ...
pointer to ...
const int
f is ...
function returning ...
const int
Starting the Parser

I packaged the entire parser as class parser. The parser class definition appears near the top of Listing 3. The parser's only explicitly defined public member is its constructor. All of the other parsing functions are private. The parser also has two private data members: input is the scanner object from which it gets tokens, and output is the ostream object to which it writes the translations.

The parser constructor embodies all the work of the decl program. Not only does it attach input and output streams to the parser, it also executes the parse and translation. I considered writing a separate public function to run the parse, but I couldn't see any real reason to do it. I believe using a constructor to initiate an entire process is an established idiom, and it seems to work well in this case. I may change my mind later, but right now I like it.

Coping with Syntax Errors

decl does a decent, but not outstanding, job of detecting and reporting syntax errors. Some of the error messages are right on the money. Others are a bit vague. It's the nature of C and C++. I invite you to study the error messages and I welcome suggestions for improvement.

One way to improve the messages would be to display a marker in the output under the offending symbol. This requires a more elaborate scanner — one that would keep track of each token's column position, so that the error reporting routines could place the marker properly.

To put it mildly, this version of decl does not recover from errors very well. In fact, it stops dead in its tracks. It would be better if the program could continue to accept more declarations after an error. There are various ways to do this, one of which uses exception handling. That sounds like a fun topic for next month. o

Dan Saks is the president of Saks &Associates, which offers consulting and training in C++ and C. He is secretary of the ANSI and ISO C++ committees. Dan is coauthor of C++ Programming Guidelines, and codeveloper of the Plum Hall Validation Suite for C++ (both with Thomas Plum). You can reach him at 393 Leander Dr., Springfield OH, 45504-4906, by phone at (513)324-3601, or electronically at dsaks@wittenberg.edu.