Polyphase Merge Sorting

Features

Polyphase Merge Sorting

Tom Nelson

Sorting in memory is hard enough; when you have to spill data into temporary files, you have a major organizational problem on your hands.

Introduction

Most discussions of sorting found in programming literature center on internal (array) sorting. The principal constraint here, of course, is that the entire set of data must be present in memory during the sort. These discussions usually assume the reader knows what to do when the size of the data set exceeds available memory, or may simply refer to more weighty tomes such as Knuth [1] .

In contrast to internal sorting, an external merge sort can handle a data set of arbitrary size, usually (but not necessarily) too large to hold in memory at one time. A number of merge sorting algorithms are presently known [1] . Most take the same general approach. They begin by breaking the raw input into blocks. This initial phase pre-sorts each block internally, forming a sorted run, then writes it to one of a number of work files. (A run is an ordered sequence of data elements, either ascending or descending.) A second (merge) phase then makes one or more passes across the work files, merging runs to form longer runs on each pass. (Merging combines two or more runs, one run from each file, into a single run on another file, preserving sorted order.) The final result consists of a single run on the output file.

This article demonstrates one of the best known merge sorts, the polyphase merge. I make no sensational claims regarding the performance of this particular version, although the polyphase algorithm seems to be generally recognized as highly efficient by most cognoscenti. With appropriate modifications, this C++ implementation could serve as an industrial-strength, general-purpose sorting package. Rather than go to ultimate lengths to build the fastest sort in town, my primary goal was to formulate a relatively portable and encapsulated design.

The polyphase merge gains maximum efficiency (relative to other merge sorts) from its particular method of distributing data runs. Given N work files, the number of input files during a merge pass is N-1. At the beginning of each pass, the remaining work file is initially empty in preparation to receive output from the others. The sort arranges the number of runs in each of the work files so that one becomes empty at the completion of each merge pass. The empty file then becomes the new output file for the next pass.

This arrangement contrasts with other less efficient merge sorts, which must usually transfer the entire set of data records during each pass. A polyphase sort needs to transfer only a portion of the data at each pass, resulting in higher overall efficiency since much less data movement occurs.

Fibonacci Fun

To realize this degree of efficiency, the initial distribution of runs in the work files must conform to a sequence of Fibonacci numbers. Named for an early Italian mathematician, Fibonacci numbers seem to turn up in all sorts of odd places, such as the study of ancient pyramids. For our purpose, however, the polyphase sort must compute the ideal number of runs required for each work file at each level. Working backward from the desired final sequence (a single run on one file, the other files empty), the sort distributes runs to each file until the ideal distribution of runs is reached for that level. If input data remains, the sort must generate a new sequence for the next level and continue until exhausting the source data. Recorded at each level, the sort builds an ideal Fibonacci matrix, an example of which appears in Figure 1.

The preceding discussion ignores reality, at least slightly. Real-world data seldom, if ever, conform to the ideal number of runs required for a perfect merge. However, the sort easily handles this problem when it computes the ideal number of runs needed to reach the next level. Imagine for a moment that these ideal numbers form a set of imaginary or dummy runs. Then, each actual run distributed from input to a work file means one less dummy run for that file. When or if all the dummy runs are used up in this manner, the distribution phase proceeds to the next level. Any remaining dummy runs, when added to the actual number of runs generated, still preserve the ideal.

Figure 1 illustrates the distribution phase schematically. Here each succeeding level in the distribution matrix is skewed one place left for clarity. The first number in a new row is taken from the last number in the preceding row. The sort generates each of the succeeding numbers in a new row by adding the number appearing above it to the first number in the new row. As indicated earlier, the sort also maintains a count of the number of ideal runs needed for the next level. These are simply the difference between a number appearing in one row and the number in the same position in the preceding row.

Merging

Conceptually, merging is the reverse of run distribution (Figure 2). Recall that after the completion of the distribution phase for N work files, at most N-1 of them contain sorted runs for input and the remaining (output or target) file is initially empty. Notice also that in Figure 2 the list of input files forms a sequence with increasing numbers of runs. The leftmost file (at list head) always contains the least number of runs.

Given R runs in the head (leftmost) file, a merge pass for that level consists of merging R runs from each input file to the target file. It takes R mini- or subpasses to complete the entire pass, one run from each file.

Since the file at list head always contains the least number of runs (R), this file will be empty when R runs have been merged from each input file. The target (output) file now contains R runs also, except its runs are N-1 times longer (ideally) than before the pass. From a practical viewpoint, however, run length is immaterial to these proceedings. The critical factor is the number of runs on each input file. Due to Fibonacci magic, the target file ends up with the greatest number of runs when the pass completes. The merge has now been reduced by one level.

At the end of each pass, the merge routine must rotate the work files. It rewinds the target file and appends it to the tail end of the list of input files. Next, the routine detaches the file at list head, its data having been exhausted. Truncated to zero length, it becomes the new target file for the next merge pass. The new file appearing as list head now has the least number of runs. The merge pass repeats as before with larger runs being formed at each pass until only one run remains.

Those pesky dummy (or null) runs encountered during a merge pass complicate matters, though only slightly. The merge responds by decrementing the run count (both null and actual) for a file if it still has null runs at the beginning of a merge mini-pass. The routine will ignore that file during the mini-pass. If all input files contain null runs, the merge ignores them all and produces a single null run on the target file, known as a "dummy merge." Note that a dummy merge is only hypothetical; no actual movement of data occurs. The merge will frequently account for all null runs on the first pass, unless a dummy merge produced a null run on the target. Once all null runs are eliminated, however, the merge proceeds on an ideal basis.

A Practical Merge Sort

A realistic formulation of the foregoing algorithm must contend with a number of pragmatic concerns and limitations. Foremost among these are constraints on resource availability, file I/O and buffering considerations, as well as container usage. I set out to design the sort by breaking the task into a number of major parts:

a file I/O interface, extendible to handle data in files of unknown composition and structure

a file buffering scheme to handle data in discrete chunks of any size

forming and pre-sorting data runs read from input

maintaining a list of merge (work) files, with a container library interface

a high-level unit that implements the distribution and merge phases of the algorithm

All of this is in keeping with an encapsulated class design. I tried to design each part so that it could stand on its own and have a potential for other uses. This keeps maintenance costs (hopefully) to a minimum and allows you to indulge yourself with other options, like endlessly optimizing the sort, or experimenting with other merge sorting algorithms.

You will frequently need to push memory allocation to its limit to obtain as much sorting efficiency as possible. I chose to rely on default memory allocators, primarily to keep code to a minimum. operator new holds memory allocation to UINT_MAX bytes by default, unless overloaded to use non-standard allocators. All code presented here uses the default operator new. Additionally, I have not tested the internal pre-sort function used here (qsort) on an array larger than UINT_MAX bytes. (When compiling for DOS, you need to use huge model to use an array of this size.) However, the documentation I have for qsort (Borland C++ 4.0) mentioned no upper limit on array size.

In keeping with my general-purpose goals, I wanted the sort to employ generic (void *) pointers to data elements. Templates are an option here, offering type security at the expense of hefty code bloat. However, I view type security as a marginal issue in this case. The sort has no control over the actual contents of the source file submitted to it, templates or not. Not using templates also lets you decide at run time what type of data to sort. If you still want to use templates and avoid bloat, it's more realistic to surround the main PolySort class with a template-based wrapper class (see class TPolySort in file psort_t.h on the CUJftp site).

File I/O and Buffering

Listing 1 (ffstream.h) presents code for a simple FILE-based stream interface class. I chose to use FILE streams because fread and fwrite can process data in chunks, a natural for the chunk-style buffering scheme I devised. The name ffstream comes from an earlier attempt to derive the class from fstream. This works fine, but I had to write separate iostream-based versions of fread and fwrite (these appear on the ftp site). I saved some code space by using the standard library versions instead. Both ffstream::read and ffstream::write are virtual to allow you the option of extending them to handle data from files of unknown structure, such as database files. An example on the code disk includes an extended version of ffstream specialized to extract words from a text file for sorting. The main sorting routine is none the wiser.

Listings 2 and 3 present MergeFile, a file buffering class. The sort uses this class to interface with its temporary work files as well as the (permanent) source file. Most MergeFile members are probably self-explanatory, but a few merit some comment.

I configured the type conversion operator int() to assert valid construction for MergeFile objects. This allows you to use the returned object in conditional statements, as if the constructor returned a value. operator int() returns zero (not okay) if either a buffer couldn't be allocated, and/or a temporary work file wasn't created due to a lack of available file handles. When object construction fails, the client routine must delete the invalid object.

Member functions Get and Put extract and insert single items of buffered data. Nextg samples the buffer for the next item that Get will extract. AttachStream lets you attach a source ffstream to a merge buffer for run pre-sorting and distribution. Once attached, a MergeFile won't delete or modify the source file in any way. RenameOutput closes and renames the final output (work) file to a desired result, then creates a new temporary file for the next sort run, if any.

ResetIOMode prepares a work file for input or output. Note that input and output operations on a MergeFile are not mutually exclusive. Put calls can follow Get (or vice versa) without reseting the I/O mode, an action that may corrupt the buffer. Adequate safeguards to handle this potential loss of integrity should be provided, but I saved some code space by leaving this alone.

Generating Data Runs

Class PreSort (also in Listings 2 and 3) breaks input data into blocks and sorts them into runs with the supplied internal sort routine. Class MergeFile grants friendship to PreSort. The PreSort constructor takes references to the source MergeFile and an InternalSort object (discussed later).

The PreSort constructor must allocate memory for a sort buffer, which holds pointers to the actual source data, not the data itself. Member function GenerateRun thus works indirectly, sorting pointers instead of actual data. This optimizes the process considerably, since void * pointers are almost always smaller than the actual data. When finished, GenerateRun outputs the source data in sorted order to the supplied destination MergeFile by dereferencing the sorted pointers in its buffer.

For source data of sizes less than or equal to sizeof (void *), sorting efficiency drops rapidly. It then becomes faster to sort the data directly, instead of through indirect pointers. For simplicity I've not provided code to handle this situation. I assume source data of this size would be rarely encountered in practice, except for testing (see Listing 10).

I should mention one caveat in this regard. If you run the sort on source data smaller than sizeof (void *) (on int data, say, when compiled for DOS huge model), the size of the sort buffer needed becomes larger than the data buffer it points to. If the source buffer is at or near the UINT_MAX limit, PreSort will attempt to allocate a sort buffer larger than this limit. With system defaults in effect, you'll end up with only a portion of what you need, leading to memory overwrites. You could either force the size of the source buffer within limits, or overload operator new to handle non-standard allocation. As mentioned earlier, I'm not sure qsort can handle a larger buffer. Again, I included no code to effect either solution.

Class InternalSort (Listing 4) handles data comparison and sorting for PreSort. InternalSort maintains pointers to a non-member (or static member) function for data comparison and an internal sort function (qsort by default). Listing 5 contains typedefs for both functions. InternalSort also holds the primary record of the size (in bytes) of the data items to sort. The polyphase sort offers only this minimal level of type safety.

You may think you could more elegantly implement the comparison function as a virtual member function, but you run into problems. For one, qsort can't take a pointer to a non-static member function. qsort must also pass indirect (void **) data pointers to the comparison function, but the compiler won't let you easily dereference them to access the data you need to compare, due to strong type checking.

One solution is to cast the function pointer argument to fool qsort into thinking it's getting the proper type of comparison function. Although theoretically unsafe, so far it has worked fine. For the present, there appears to be no other practical solution, but you shouldn't lull yourself into complacency [2] . The technique provisionally allows you to write a comparison function that takes arguments of the specific type you're interested in, without resorting to casting.

The Merge List

The polyphase merge algorithm is commonly expressed using array terminology [1]. However, in developing my version of the polyphase sort I found it more intuitive to think in terms of linked lists. Lists proved to be somewhat more flexible than arrays when managing a set of MergeFiles. This is particularly true during a merge when work files must be rotated at the end of a merge pass. A list need only adjust a few pointers. To effect the same result with an array, you need to shift all array elements down one place, requiring a bit more overhead (which is probably negligible in this case).

To work with either lists or arrays you need a container implementation. My code uses an indirect, double-linked list from the Borland Container Library. Class MergeList (Listings 6 and 7) provides an interface to the template-based container library. Locating all container-related content in one module allows you to easily modify MergeList to use practically any container variant, even an optimized, roll-your-own version.

MergeList maintains pointers to two separate lists, both of which contain MergeFile objects. Also included are two iterator objects, one for each list. One list (the merge list) holds the basic assemblage of work files (review Figures 1 and 2) . This list changes only at the end of a pass, when files are rotated.

The other (active) list contains only those MergeFiles that actually participate in a merge mini-pass. Recall that if a MergeFile contains null runs, it's not included in the mini-pass. In that case, the merge routine simply decrements the null run count for that file and does not add it to the active list. In contrast, all MergeFiles without null runs are added to the active list at the beginning of each mini-pass. (Recall that a mini-pass merges one run from each input file to the target. A number of mini-passes constitute a single merge pass).

Using an active merge list helps to optimize the merge procedure. It eliminates tedious inspections for presence/absence of data by the merge routine. When a file reaches the end of each run, it simply drops out of the active list. Without an active list, the merge would need to inspect an end-of-run flag each time around the merge list while scanning for the next largest element to remove. This is potentially a time-consuming procedure, especially for large source files.

Sort and Merge

Listings 8 and 9 present class PolySort, the high-level realization of the polysort algorithm. The PolySort class constructor requires a reference to an InternalSort object (Listing 4) , the number of work files desired, and the size (in bytes) of the merge buffers. The sort needs at least three work files to proceed. It may also allocate less than the number of work files you specified, subject to resource availability (see the MergeList constructor in Listing 7) . Be careful when opening the maximum number of work files; you may not perceive at first why you can't open others. (Something like a TV repairman finding your set unplugged!)

The constructor also initializes operator new to retain its default behavior, i.e. to return a null pointer on allocation failure instead of throwing an exception. I chose this method to simplify the code, even though this may not be the best way of handling exceptions of this type.

PolySort::Sort executes the actual sort and merge. This function requires a pointer to an open source ffstream, and a name for the final output file. Note that if a file of the same name exists before calling the sort, RenameOutput will not succeed (Listing 3) . The sort will proceed normally, but the work file containing the sorted result will retain its temporary file name. You can also execute Sort on any number of input files, one at a time, as long as the size of the input data elements remains constant.

PolySort member functions _distribute_runs and _merge_runs implement the polyphase algorithm essentially as described earlier. One point to note is that _distribute_runs must watch for natural run continuations. This occurs when two consecutive runs written to an output file happen to form a single run on that file without a merge taking place. The same situation may also arise, though rarely, on the target file during a pass in _merge_runs. This will ultimately produce an input file that contains less than the expected number of runs. As critical as this may seem though, the merge routine covers its tracks by following a "merge until empty" strategy when reading input files. The routine still has the ideal number of runs to work from, even though the actual run count is short.

The sort calls special logging functions (all prefixed by d_) that write to standard output so that you can review its progress. All logging functions appear on the ftp site in files sortlog.h and sortlog.cpp. You can conditionally eliminate these functions from the compiled result to realize higher sorting efficiency.

Testing

A simple example of using the package to sort a file of random long integers appears in Listing 10. A more extensive example, sorting words in an ASCII text file, appears on the ftp site.

Extensive testing (and debugging) consumed much of the time needed to develop this sorting package. I ran the sort under difficult conditions with nasty input in an effort to break the code [3] . So far, the package has been tested on input files ranging from one word, up to 3 million randomly-generated "words" (roughly 24 MB of fixed-length source data).

To give you some idea of performance, I offer a few test results (Figures 3 and 4) . All tests were performed (compiled for MS-DOS huge model, with sort logging disabled) with an 80486 CPU running at 66 Mhz. All result times are in seconds. In all cases, the source file contained "words" one to seven bytes long made of randomly generated characters. I extended the input routine ffstream::read to convert variable-length words from the source file into constant-length eight-byte records for sorting.

Conclusion

This article has presented a flexible C++ implementation of the polyphase merge sorting algorithm. As a general-purpose number cruncher, it allows you to sort data of any type from files of arbitrary structure, without serious code bloat from templates.

Most of the deficiencies in this sorting package result from coding shortcuts to fit publication requirements. The code needs appropriate error and exception handling, a primary requirement for real-world applications. To raise sorting efficiency to acceptable levels for large source files, (depending on the operating environment) it must manage memory allocation more effectively to use larger merge buffers. Modifying the sort for use in a multi-threading or multi-tasking environment may also be considered, but is beyond the scope of this discussion. o

References

[1] D.E. Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching. Addison-Wesley, 1973.

[2] Refer to CUJ, October 1994 (p. 95) and February 1995 (p. 85) for discussions regarding the use of qsort in a C++ program.

[3] See CUJ, "Testing Sort Functions," July 1995, p. 33 for various ways to torture test a sort routine.

Tom Nelson is an independent programmer and technical writer whose current interests center on PC systems programming and OO design in C++. He welcomes comments and suggestions at 5004 W. Mt. Hope Rd., Lansing, MI 48917, or via e-mail at tnelso39@alliance.net. All source code appearing in this article is copyright © 1996 T.W. Nelson. Permission is hereby granted to use this code (including derivations) in any manner provided this copyright notice appear appropriately in source.