Dan walks a different path from C to C++ than he did seven years ago.
Copyright © 1998 by Dan Saks
As I explained last month, I'm revisiting the programming example I used seven years ago in my first few "Stepping Up to C++" columns. (See "C++ Theory and Practice: Basing Style on Design Principles," CUJ, March 1998.) Back then, I used the example to introduce fundamental C++ concepts. This time around, I'll assume you know the basics and use the example to show you how the newer facilities of C++ offer a wider, sometimes confusing, array of choices for rendering designs as code. In the coming months, I'll rewrite the program using alternative styles and try to provide some guidance to help you choose ones that are right for you.
My example starts with a program written in what Tom Plum and I call "Typesafe C," C that also compiles as C++ [1]. Writing in Typesafe C is largely a matter of declaring functions and data in C so that they satisfy the more finicky type-conversion rules of C++. For example, if p is a T * (for any complete object type T), then
p = malloc(sizeof(T));compiles as C, but not as C++. malloc returns void *, and C++ does not allow implicit conversion from void * to any type other than cv-qualified void*.To get it to compile in both C and C++, you must write the statement as
p = (T *)malloc(sizeof(T));Seven years ago, I was showing my readers how to move from C to C++, so starting with a C program made obvious sense. But now I'm trying to make some points about newer C++ features and programming style. I debated whether to start with a C++ program rather than a C program, but decided to stick with C.
C provides a well defined starting point for this exercise. If I had started with a C++ program, I'd have had a hard time deciding which C++ features to use initially. For example, the iostream library uses parameters of reference types. If I had used iostream instead of stdio, should I have also used references extensively? The iostream components are members of namespace std. Had I used iostream, should I have used the explicitly qualified names for the components, or should I have used using-directives? It is my intent to discuss these issues, but I want to take them one at a time.
A Cross-Reference Generator
My example is a program called xr. xr is a cross-reference generator. It reads text and writes an alphabetized list of the words that appeared in the input. Each word in the output appears on a separate line, followed by the sequence of unique line numbers on which that word appeared in the input. To xr, a word is any contiguous sequence of letters, digits, and underscores starting with a letter or underscore. In other words, a word is an identifier as in C or C++.
For example, if the word Jenny appears in the input once on line 8, twice on line 67 and once on line 5,309, then the cross-reference listing entry is
Jenny 8 67 5309My implementation makes the simplifying assumption that every output line will display properly on the output device. It does not try to break long output lines into several shorter ones. Long output lines will probably spill over to another line or truncate.
I got the idea for this program from Exercise 6-3 in Kernighan and Ritchie [2]. A solution to Exercise 6-3 appears in Tondo and Gimpel [3]. That solution behaves a little differently from my xr program:
Although the overall structures of the programs are very similar, they differ on many implementation details. You may find it interesting to compare their program with mine. I will mention some of those differences as appropriate.
- My program treats an underscore as a letter; theirs does not.
- Their program filters out "noise" words such as "a," "an," and "the;" mine does not.
xr is not a big program, but it's complicated enough to demonstrate the problems you can run into when developing and maintaining big programs. In particular, it uses a data structure that is composed from a few simpler structures. Therefore, it forces you to confront the most challenging problem in program design, namely, deciding exactly how to distribute functionality among the program's subcomponents.
xr reads text from standard input and writes text to standard output. Although the input is actually a sequence of individual characters, xr needs to see the input as a sequence of words. Therefore, xr uses an input function that groups characters into words, and discards the uninteresting characters between those words.
Actually, xr must be on the alert for more than just words. It needs to see newline characters so that it can count the lines as they go by. It also needs to see end-of-file so it knows when to quit.
Kernighan and Ritchie called their input function getword. I used that same name in my example seven years ago, but I don't like it anymore. I've caught myself saying awkward things such as "if the word that getword returns is really a word..." A sequence of input characters that forms a unit is commonly called a token or lexeme, so I called the input function get_token. I think a phrase such as "if the token is a word..." makes more sense.
Using get_token, the high-level algorithm for xr is fairly straightforward. It appears in Listing 1.
The first version of xr appears in Listing 2. The program is a single source file containing just four functions:
I believe it's fairly easy to match the statements in the body of main (Listing 2) with the high-level algorithm (Listing 1) .
- main
- get_token: get a token from standard input
- add_tree: add an entry to the cross-reference for a word and its corresponding line number
- put_tree: write the alphabetized cross-reference listing to standard output
This implementation uses a binary tree of tree_node objects to maintain each word in alphabetical order. Each tree_node in the tree holds a word and its associated set of line numbers. Each set of line numbers is implemented as a singly-linked list of list_node objects with separate pointers to the first and last list_node in the list. Each list_node contains one line number.
When add_tree adds a cross-reference entry for a word that is already in the tree, it simply adds the line number to that word's existing list (if the number isn't already there). Maintaining a pointer to the last list_node in each list makes adding a new line number very fast and easy.
Tags and Types
Listing 1 defines list_node as both a typedef and as a struct type:
typedef struct list_node list_node; struct list_node { unsigned number; list_node *next; };It does the same with tree_node. I explained the reasons for this style last year [4]. Here's a brief recap.
In C, a declaration such as
struct T { ... };declares T as a tag. A tag name is not a type name. Thereafter, C won't recognize T for what it is unless it's preceded by the keyword struct. For example,
T *p; // error in C struct T *q; // okay in CThe names of unions and enumerations are also tags rather than types.
If you're like me and prefer to think of struct names as type names, you can effectively turn tags into types using typedefs. For example,
typedef struct T T;defines type T as an alias for struct T. You can then write declarations such as
T x; T *p;C++ eliminates most of this nonsense. Tag names in C automatically become types in C++. Therefore,
struct list_node { unsigned number; list_node *next; // okay };is valid C++ even without the typedef. It's almost as if C++ generated the typedef, but not quite. See [4] for details.
Evidence of Complexity
The main function has just two jobs:
1. Keep track of the input line number.
2. Determine when a word and its line number should go into the cross-reference table.
main does not need to know how the word and line number are actually stored in the table; it only needs to know what values to place in the table and when to place them. main can do this job without knowing that the cross-reference is a tree. In fact, knowing this only makes main a little more complicated than it has to be.
When I say that main "knows" that the cross-reference is a tree, I mean you can tell just by looking at main's function body. main places words and line numbers into the cross-reference table by calling a function named add_tree. It writes the cross-reference to the output by calling a function named put_tree. Both functions accept xr as an argument, and xr has type tree_node *. add_tree also returns a tree_node *.
The knowledge that the cross-reference is a tree clutters main with a little bit of unnecessary detail about other parts of the program. Granted, this particular program is simple enough that this small clutter does little harm. However, writing large programs in this same style leads to software that's overly complicated and difficult to maintain.
In software design parlance, we say that main is "coupled" to the knowledge that the cross-reference is a tree, and unnecessarily so. Coupling is the extent to which one software component depends on another.
Coupling is not bad in and of itself. Two components can't communicate unless they are coupled in some way. For example, main depends on the existence of some function that adds a word and a line number to the cross-reference table. If that function goes away, then you must rewrite main to do the job some other way.
Since some minimal coupling is unavoidable, we might as well accept it as good. However, excess coupling is clearly bad. Excess coupling, such as main's awareness that the cross-reference is a tree, makes for brittle software.
You might discover that a data structure other than a tree is a better implementation for the cross-reference table. In that case, functions with names built around the word "tree" and parameters of type tree_node * would be totally inappropriate. In addition to changing the bodies of add_tree and put_tree to use the new data structure, you would have to eliminate all traces of trees from their function names, parameter types, and return types. This entails rewriting main as well as add_tree and put_tree.
Preserving the Illusion
If you refer back to my description of how xr works, you'll see that it never mentions trees. Neither does the algorithm in Listing 1. At this level in the program design, you don't need to know how the cross-reference table is implemented, but you should know what the table does. This is good design. It creates an illusion of simplicity. Good code preserves that illusion.
You can create an illusion of simplicity for xr by choosing names and placing declaration in ways that hide design details from those that need not know them. This practice is what different people call data hiding, information hiding, data abstraction, or encapsulation. Take your pick.
C++ offers surprisingly many ways to package components as abstractions. To experienced C++ programmers, the most obvious techniques use classes, often spread among headers and source files. Sometimes you can take advantage of incomplete types to produce better isolation. Now you can also mix in namespaces. In this and future articles, I will consider each of these language tools, separately and in combination.
Abstraction by Separate Compilation
Let's start by moving everything that has to do with the cross-reference data structure into a separate source file. Seven years ago, I was working with an operating system that supported file names of only up to eight characters, so I called that source file xrt.c. (That's 'x' as in "cross," 'r' as in "reference," and 't' as in "table.") Now that I can use longer file names, I'll use the more descriptive name cross_reference.c.
Moving the definitions for list_node, tree_node, add_tree, and put_tree to cross_reference.c eliminates all trace of trees from xr.c except for a few remnants in the body of main. Since main deals with only one cross-reference table, it doesn't need to know the table's name or location. Therefore, you can also move the declaration
treenode *xr = NULL;from main into cross_reference.c.
Without xr, main can't call add_tree and put_tree directly, but that's okay because we don't want main calling them. Instead, cross_reference.c provides functions
void cross_reference_add (char const *w, unsigned n); void cross_reference_put();for main to call as intermediaries. cross_reference_add simply calls add_tree(xr, w, n) and cross_reference_put() simply calls put_tree(xr).
Listings 3 through 5 show the implementation of xr using separate compilation. Listing 3 is the main source file, xr.c. Listing 4 is the cross-reference table implementation in cross_reference.c. Listing 5 contains cross-reference.h, a header that declares the interface to cross_reference.c. The program is still in Typesafe C.
Looking at the body of main you can no longer tell that the cross-reference table is a tree. The call
cross_reference_add(token, ln);effectively says "add the word (in token) and line number (in ln) to the cross-reference table, wherever and whatever that table is."
Actually, just moving the definition for variable xr to cross_reference.c does not make xr inaccessible from xr.c. A declaration in file scope (in C) or global namespace scope (in C++) such as
treenode *xr = NULL;has external linkage by default. Therefore, code in another source file, such as xr.c, could declare
extern void *xr;and gain access to xr. In C++, declaring an externally-linked object such as xr so that it has different types in different translation units violates the One Definition Rule (ODR). Strictly speaking, the violation is an error, but it's hard to catch so the C++ Standard does not require that compilers produce a diagnostic. Few, if any, do.
cross_reference.c (Listing 3) prevents this mishap (or malice) by declaring xr as
static treenode *xr = NULL;so that xr has internal linkage. Thus, a declaration such as
extern void *xr;in another source file cannot refer to the xr defined in cross_reference.c. It may refer to a different xr with external linkage elsewhere in the program, if one exists.
cross_reference.c also declares add_tree and put_tree with the keyword static so they too have internal linkage. Again, this has two benefits:
- No code outside cross_reference.c can call add_tree and put_tree.
- The names add_tree and put_tree cannot conflict with names elsewhere in the program.
For the most part, the second version of the program (using separate compilation) is an improvement over the first. The input processing in xr.c is now clearly distinct from the table processing in cross_reference.c. main more clearly describes what it does without unnecessary and intrusive details about the inner workings of the table. Secondly, the program is more maintainable. The structure of the cross-reference table is so well hidden that you can rewrite it using an entirely different data structure without impacting main. As long as you don't change cross_reference.h, you never need to recompile xr.c.
On the other hand, using separate compilation and internal linkage has a drawback. The interface functions cross_reference_add and cross_reference_put add an extra layer of function calls that was not in the original program. They make the program both a little bigger and a little slower.
Giving Namespaces a Try
You can eliminate that overhead by defining cross_reference_add and cross_reference_put as inline functions. You must then move the function definitions to the header file cross_reference.h so that the compiler will have seen the function bodies by the time it compiles the calls to these functions in xr.c. (xr.c #includes cross_reference.h.)
The function body of cross_reference_add (in the header) refers to both xr and add_tree. Therefore, both xr and add_tree must be declared in the header. In turn, the declarations for xr and add_tree refer to tree_node, so it must also be declared in the header. By the time you reach the ends of these threads, the declarations for list_node and put_tree also wind up in the header.
When you move the declarations for xr, add_tree, and put_tree into the header you must remove the keyword static. Otherwise, they will produce link errors. Remember that a name declared at file or namespace scope with the keyword static has internal linkage. A name with internal linkage cannot refer to an object or function defined in another translation unit. In general, headers are supposed to provide declarations that allow one translation unit to share facilities defined in another. However, internal linkage prevents such sharing.
Dumping all those declarations into the header turns all those names back into globals. If you used the cross-reference as part of a larger application, the names in the header might conflict with the same names declared in other headers. Of course, you could add the prefix cross_reference_ to every name that doesn't already have it, but that makes for some very long names. The alternative is to declare all those names in a namespace.
Listing 6 shows the header cross_reference.h with the declarations for the cross-reference table wrapped inside namespace cross_reference. Declaring cross_reference_add as a member of namespace cross_reference made the cross_reference_ prefix in the function name redundant. I shortened the function's name to just add, so its fully-qualified name is now cross_reference::add. cross_reference_put becomes cross_reference::put.
Listings 7 and 8 contain the corresponding source files cross_reference.cpp and xr.cpp, respectively. (The source files now have .cpp extensions because namespaces are strictly C++.)
The declaration for xr in the header (Listing 6) is:
extern tree_node *xr;which is not a definition. The declaration in the source file (Listing 7) is:
tree_node *xr = NULL;which is a definition as well. As I explained last year, an object declaration in C++ is a definition unless it has an explicit extern specifier and no initializer [5]. This is true whether the declaration is in the global namespace, a named namespace such as cross_reference, or in block scope.
The declaration in the header should be just a declaration, not a definition. If the declaration were a definition and you included the header in more than one translation unit, then you would get link errors when you linked those units together because xr would have multiple definitions.
Up Ahead
Wrapping all the declarations for the cross-reference table inside one namespace eliminates all the global names except for the name of the namespace itself. This dramatically reduces the potential for name conflicts, and that's good. However, using namespaces to package abstractions has weaknesses, which I'll discuss next time. o
References
[1] Thomas Plum and Dan Saks. C++ Programming Guidelines (Plum Hall, 1992).
[2] Brian Kernighan and Dennis Ritchie. The C Programming Language, 2nd Edition (Prentice-Hall, Englewood Cliffs, NJ, 1988).
[3] Clovis Tondo and Scott Gimpel. The C Answer Book, 2nd Edition (Prentice-Hall, Englewood Cliffs, NJ, 1989).
[4] Me, "C++ Theory and Practice: Workarounds for a Mistake," CUJ, September 1997.
[5] Me again, "C++ Theory and Practice: Storage Classes and Linkage," CUJ, November 1997.
Dan Saks is the president of Saks & Associates, which offers training and consulting in C++ and C. He is active in C++ standards, having served nearly seven years as secretary of the ANSI and ISO C++ standards committees. Dan is coauthor of C++ Programming Guidelines, and codeveloper of the Plum Hall Validation Suite for C++ (both with Thomas Plum). You can reach him at 393 Leander Dr., Springfield, OH 45504-4906 USA, by phone at +1-937-324-3601, or electronically at dsaks@wittenberg.edu.