When your code attempts to fetch data from a memory address, the Pentium processor first checks whether the data at that address resides in the cache. If the data is in the cache, the processor gets it from there, not main memory, and your code is none the wiser. If the data is not in the cache, the cache first grabs a "line" of 32 data bytes, including the data requested, from main memory. The cache can read 32 bytes using burst mode, which is about twice as fast as normal consecutive reads.The processor then reads the requested data from the cache. (Note: a line always starts on a 32-byte boundary. If you request data from anywhere in the line, the entire line gets loaded into the cache.)
When a cache contains none of the data you need to access, it is said to be "cold." Pre-warming the cache consists of first loading it with the data to be copied, in this case, the contents of the source buffer. It may not be immediately clear how pre-loading the source buffer into the cache could save time. Remember, however, that when you read one byte from memory, the cache automatically loads 32 bytes at a time, and it does so very quickly. So you can pre-warm the cache by reading every 32nd byte of the source buffer. The pre-warming code pays a time penalty for only one byte out of 32. Then, when it's time to copy the data from source to destination, it will be in the cache. The faster access from the cache more than makes up for the time penalty of pre-warming.
The goal in pre-warming is to force the cache to load as many cache lines as possible in the shortest time possible. There are two main optimizations for doing this. The first is to do one read every 32 bytes across the source buffer. Note that this is of maximal benefit only if the data requested lies on a 32-byte boundary. If you try reading a word that straddles a 32-byte boundary, the cache will have to read in two cache lines (64 bytes), and the processor will have to wait for 33 bytes to arrive before it can continue.
The second pre-warming optimization is to read only one byte at a time. The cache makes data available to the processor as soon as it arrives from main memory. Thus, when the code requests one byte, the cache does not make the processor wait until all 32 bytes of the cache line have arrived. But if the pre-warming code asks for a dword, the cache will make the processor wait for four bytes to arrive before it hands them over. If the code asks for only one byte, it has to wait for only one byte to arrive before it can continue working. The rest of the bytes will load into the cache on their own. (This may not be entirely correct, as bytes are read into the processor and into the cache a dword at a time. However, I have timed both cases, and the code runs faster if I ask for only one byte instead of four. )
In addition to these two main optimizations, there are two little ones. First, if the code reads into two different registers, it can get two read instructions to pair. This is possible because the cache is dual-ported, allowing two different locations to be read simultaneously. Second, unrolling the pre-warming loop a bit will reduce the per-byte overhead. Taking all four optimizations into account, +. 3 shows what seems to be the fastest pre-warming code. [2]
You may have noticed that I do not warm the destination buffer. The cache loads a line only when you read from memory. You can write all day to a memory location and it will never get loaded into the cache. It would seem that without first reading from the destination buffer, you would take a penalty every time you write to memory. However, the write buffers alleviate this problem somewhat by allowing the processor to continue working after a write. The write buffers flush themselves out to main memory on their own. TestJig has shown that it is faster to pre-warm the source buffer instead of the destination buffer. o