durham.rtf

Optimization & Efficiency

A Test Jig Tool for Pentium Optimization

Steve Durham

You have to measure performance from time to time, but that doesn't mean you have to do it all from scratch every time.

When It Has to Be Measured

Suppose you've come up with some great ideas on how to speed up a piece of code. In fact, you think the ideas are so good, so foolproof that you want to immediately replace the existing code with yours. Then some killjoy manager says, "Yes, very nice, but we need proof." It might be painful to hear, but let's face it — the manager is probably doing you a favor. You can't very well brag about performance optimization unless you can measure your progress.

I've slowly evolved a test jig (imaginatively dubbed TestJig) to handle many of the menial tasks involved in aligning data and timing code on Pentiums. TestJig runs under Windows. I have included a scaled-down version with this article. (My personal version carries along a great deal of dead code — which I keep around to scare the other code — and a few other features I never use. For example, making the sound card audibly scream and moan as the performance goes up and down is really neat but gets on people's nerves.)

Implementation

Most optimizations involve loops, and most loops operate over buffers. Therefore, the major components of TestJig, are the CMemBlock (for buffers) and CTestObject classes. CTestObject contains several CMemBlock instances, which you can use to hold your test data. CMemBlock will ensure that the data is 32-byte aligned; 32 bytes is the size of one cache line.

CTestObject becomes the base class for your test code. You instantiate CTestObject and use it to wrap the code in question (CIQ). As it stands, CTestObject is kind of a one-time-use object. You declare it, run the test, then delete it. I haven't included a method to deallocate buffers, so it's easiest to just destroy the object and create a new one.

When CTestObject::RunTest is called, it will run the CIQ once, read through the input and output buffers to make sure everything is in the cache [1], and then immediately run the CIQ again with the timer running. Alternately, if you set CTestObject::m_bColdCache to TRUE, RunTest instead goes into its patented Easter-egg mode and stuffs the cache with pure garbage, thus forcing your code to hunt down its data before it eats it.

The Timing Mechanism

All Pentiums come with a handy built-in cycle counter. The ineptly named Time Stamp Counter counts all CPU cycles since the Pentium was powered up. A cycle count is more useful than a real-time clock because you can add up the cycles in your loop and directly compare them to your measurements. They can be different when caching is involved, as is often the case with a Pentium. With cycle counts, you also don't have to worry about the speed of the CPU when timing your code. This is really helpful if you're working on a fast new system and its heartless owner comes back from lunch and railroads you off to somewhere else. Wherever you take your code (as long as it's a Pentium system), the measurements will look the same.

You can read the cycle counter through the ReaD Time Stamp Counter (RDTSC) instruction. Note that the RDTSC instruction is not recognized by the compiler, so you must _emit the instruction bytes to read the counter:
int startHi, startLo, endHi, endLo;
_asm
{
_emit 0Fh
_emit 31h
_mov startHi, edx
_mov startLo, eax
}
To measure elapsed cycles, just run this code before and after your test code. You may want to run the code once with no loop innards to measure the loop overhead. Once you time the overhead, fill in the loop details and run the timer again, subtracting out the overhead. Usually the overhead is smaller than your measurement uncertainty, so you can ignore it. (Remember, measuring code performance is not an exact science.)

In the TestObj.h header file (available on the code disk and online sources — see p. 3 for details), I've included three macros, TIMERDECL, TIMERSTART, and TIMERSTOP, that declare the start and end variables and _emit the opcodes. I just point this out so you can use these macros elsewhere if you want; CTestObject will actually call them for you.

Using the TestJig

I've compiled this code with Visual C++ 4.1. If you're running on a Pentium with Multimedia Extensions (MMX), use the /GM compiler option to compile the MMX opcodes. The source code should be self-explanatory, but here are some more detailed steps:

1. Copy an existing implementation (CT_MemCpyF, for example). Erase the existing non-member functions containing the CIQ and replace them with new ones. Edit RunC, RunPent, and RunMMX to call these functions. CTestObject will call these methods at the appropriate times while running the tests.

2. Edit RunTest to allocate memory to the correct buffer pointers. Note that CTestObject declares several pointers of different types for convenience. For example, the m_pucC1, m_psC2, and m_pfC1 variables are pointers to unsigned char, short, and float buffers, respectively. Also, several CMemBlock instances have already been declared for you.

If you pass one of the pointers to the overloaded CMemBlock::Allocate method, it will be magically altered to point to a 32-byte-aligned buffer. For example, try this:
m_MBC1.Allocate(&m_psC3, 1024);
The CMemBlock instance m_MBC1 will allocate a 32-byte aligned buffer of 1024 shorts (2048 bytes). The instance will set m_psC3 to point to that buffer. Note that a CMemBlock instance allocates a buffer only when Allocate is called. There is no Deallocate, as the buffer will be destroyed when the CMemBlock class is destroyed.

You can now pass m_psC3 into your test function(s). I've included the m_MBC1, 2, and 3 instances for the C version of your test code, m_MBP1,2,3 for the (regular) Pentium version, and m_MBM1,2,3 for the MMX version. I've included the C version so you can get the algorithm right and compare your Pentium and MMX functions to see if they are still producing the same output after you've expertly optimized them.

3. Edit RunTest to output the results. First, call CTestObject::CompareBuffer to add up the cumulative difference between two buffers. You can run this to ensure that your optimized version produced the same output as the C version. If CompareBuffer returns zero, you are probably okay, but you will want to try several different input buffers to be sure. Second, set the proper flags to indicate what output should be shown in the result window. Look at CTestObject::ShowResults and follow the examples for details. The most complicated TestJig output would look something like this:
ScreamingDeathSnippet(): C(1.32) P(0.78) : (0) M(0.32) :
    (1222)
This indicates that the C implementation took 1.32 cycles/byte, the optimized-Pentium version took 0.78 cycles/byte, and the MMX optimized version took 0.32 cycles/byte. The output of the optimized-Pentium version matched the output of the C version, but the MMX output was incorrect (but fast!).

4. Add a radio button to CTestJigDlg. (Follow the example of the existing tests.)

Conclusion

There are many ways to improve TestJig, but it works as it is, and the customer will never see it. You may want to add code to run tests on widely varying input buffers, or to test different buffer sizes. I sometimes use a list box to shovel the input and output buffers into so I can compare them visually. o

Reference

[1] This is the L1 cache. I discuss this cache in the companion article, "A Faster memcpy for the Pentium, " elsewhere in this issue.

Steve Durham, a.k.a. Bull, is a senior design engineer for Codesign, an embedded systems design and consulting firm. He has a BSEE (computer design) from the United States Military Academy. Steve is currently doing Pentium optimization work and can be reached at sdurham@cdcorp.com.