Fixed-Point Arithmetic for Embedded Systems

Embedded Systems

Fixed-Point Arithmetic for Embedded Systems

Jean J. Labrosse

Fixed-point arithmatic is fast you just have to worry more about overflow and significance loss.

Most low-end microprocessors (typical of embedded processors) do not provide hardware-assisted floating-point math. This is probably because floating-point math typically doubles the cost of a microprocessor. ANSI C compilers allow you to use floating-point math, but it comes at a cost: floating-point libraries require extra ROM and RAM; but most importantly, they require more processing time than integer math. For example, floating-point addition could take hundreds of microseconds on a low-end eight-bit microprocessor, whereas it typically takes only a few microseconds to perform a 16-bit integer addition on the same processor. Multiplications and especially divisions are even worse.
Embedded systems programmers are often confronted with the task of writing the fastest and smallest possible code for real-time operations. This article will show how to perform basic arithmetic operations on fractional numbers using only integers. In other words, this article will answer the questions: "Without using floating-point arithmetic, how would you add 12.34 and 987.654, multiply 3.1416 by 5.4, or divide 0.00456 by 98.7?"
Throughout this article, I use 16-bit integers, but most of the concepts presented here apply to any integer size.

Understanding Fixed-Point Math

Fixed point is an alternative form for expressing numerical values. Fixed-point math is integer math, but because it allows fractions, it is much more versatile and often can substitute for slower and more cumbersome floating-point operations. The idea of fixed-point math is to trick the computer into thinking you are talking about an integer when in fact you, the programmer, know that you are dealing with a number that has a fractional component.
Figure 1 shows a 16-bit integer as it is most commonly represented, with the least-significant bit (LSB) on the right. In integer arithmetic, each bit position represents a power of 2, starting with 2⁰ at the LSB and representing progressively higher powers of 2 as you move from right to left. The bit string 0000000000010000, therefore, represents the number 16.
In integer arithmetic there is an implied decimal point (called a radix point) to the right of the rightmost bit position. Fixed-point math basically involves moving the radix point somewhere else. For example, Figure 2 shows the same 16-bit string. In this case, the programmer decides to place the radix point between the 5th and 6th bit positions, which makes the rightmost bit 2^-5. The string 0000000000010000 no longer represents 16, but 0.5. Another way to look at this is to say that the integer 16 has been scaled by 2^-5 (multiplied by 2^-5, or 0.03125):

16 x 2^-5 = 0.5

The computer still thinks it is working with the integer 16, but the programmer independently maintains a record of how the 16 should be scaled.
By manipulating the position of the radix point, a programmer can scale integers into fractional values. The location of the radix point defines a convention for how the program will interpret a string of 16 bits. As the radix point moves to the left (increasing the portion of the string used to represent the number's fractional part) the fraction is represented more precisely and the overall range of the number diminishes (because there are fewer whole-number places left to represent it).
The unsigned integer of Figure 2 can represent a number having a range of 0.0 to 2047.96875, while the signed integer can represent numbers between -1024.0 to 1023.96875. (The latter figures assume that the computer stores signed integers in two's complement format. The leftmost bit is the sign bit and the second-leftmost bit is the most significant bit of the integer. The value -1 has all bits set.) Both the signed and unsigned numbers have a resolution of 1/32 or 0.03125.
You can use fixed-point to represent distances, surfaces, volumes, temperatures, pressures, etc. Depending on the application, you can fix the position of the radix point elsewhere to suit the range of numbers you have to deal with.
Figure 3 shows how you can represent temperatures from -459.67 F (which equals 0 Kelvin, or absolute zero) to +2048 F, by using an 11-bit integer and a 4-bit fraction. In this format, the integer is said to be scaled by 2^-4; so, for example, 11528 represents a temperature of 11528 x 2^-4, or 720.5 F. In this format temperatures can be represented with a 1/16 F resolution. The temperature scale is an ideal use for fixed-point math because the range is well defined. The programmer can easily set the location of the radix point in advance.

Notation for Fixed-Point Numbers

When a program performs arithmetic operations (add, subtract, multiply, or divide) on fixed-point numbers, it actually manipulates integers. Microprocessors do not provide mechanisms to represent fixed-point numbers, so the programmer must personally keep track of the position of the radix points. To represent fixed-point numbers, I use the following notation:
Fixed-point number = <mantissa>S<exponent>
where S means that to determine the value of the fixed-point number, you must scale the mantissa by 2^exponent. The quantity <exponent> (which represents 2^exponent) is sometimes called the scale factor. The mantissa is always an integer number. I use this notation to differentiate the fixed-point notation from the floating-point notation <mantissa>E<exponent>. Following are some example uses of fixed-point notation:
5S-3	represents 5 x 2^-3
		= 5/8 = 0.6250
31S-8	represents 31 x 2^-8
		= 31/256 = 0.1211
-123S-16	represents -123 x 2^-16
		= (-123)/65536 = -0.001877
In this scheme, the fixed-point number is actually represented using an integer, whereas the exponent is maintained mentally by the programmer.

Calculating Scale Factor and Mantissa

You can represent a wide range of numbers in fixed-point by applying the appropriate scaling. The trick is in knowing where to put the radix point. Therefore, in this section I provide formulas for calculating the exponent (scale factor) and mantissa for 16-bit fixed-point implementations.

Positive Numbers (0.0 < x <= 65535.0)

Click here for Picture
where INT() indicates truncation of the quantity to an integer. log() is the logarithm of the number in parentheses — you can use a logarithm of any base (such as log₂(), log₁₀(), log_n()) as long as you use the same base in both the numerator and denominator. When x is 0.0, both the factor and mantissa are 0.

To represent the number 1.2345 using fixed-point notation, substitute 1.2345 in the above formulas as follows:
Click here for Picture
Thus, the number 1.2345 is written as 40452S-15.

Positive Numbers (x > 65535.0)

Click here for Picture

To represent the number 107573, substitute it into the above formulas:
Click here for Picture

The fixed-point representation is 53786S1. Note that in this case, we lost some resolution because we would actually need 17 bits to represent 107573.

Signed Numbers (-32767.0 <= x <= +32767, except 0.0)
Click here for Picture

|x| represents the absolute value of the number to be scaled. When x is 0.0, both the factor and mantissa are 0.

Signed Numbers (x < -32767.0 or x > +32767.0)
Click here for Picture
Performing Fixed-Point Calculations

Addition and Subtraction

To add or subtract two fixed-point numbers, the exponent of both numbers must be the same. For example, you cannot add the signed fixed-point number 20480S-15 (0.6250) with 31745S-18 (0.1211) because they do not represent the same order of magnitude. To add these numbers, you must first convert the number with the smaller exponent (in this case 2^-18) to the same order of magnitude as the other number. To do this, add 3 to the small number's exponent (which is the same as multiplying the number by 2³, or 8) and divide its mantissa by 8. The new representation is 3968S-15. The result of the addition is thus 24448S-15, or 0.746094.
While this procedure is pretty simple, things get a little trickier when you add two numbers and the result exceeds unity. For example:
0.99 + 0.99 = 1.98 or
32440S-15 + 32440S-15 = 64880S-15
What actually happens here is that the addition overflows because the maximum value for a signed 16-bit fixed number is 32767. In this case, you can avoid the overflow by scaling both numbers to S-14 instead of S-15, as shown below:
0.99 + 0.99 = 1.98
16220S-14 + 16220S-14 = 16220S-14
When you add or subtract two fixed-point numbers, you must be careful to ensure that the mantissa doesn't overflow.

Multiplication

To multiply fixed-point numbers, simply multiply their mantissas and add their exponents. The following shows how to multiply two signed 16-bit fixed-point numbers:
0.6250 x 0.1211 = 0.075688 or
20480S-15 x 31745S-18 = 650137600S-33
One thing to note here is that when you multiply two signed 16-bit numbers, the result is a 30-bit number. The above operation won't work unless your C compiler supports signed longs (32-bit numbers). To convert this 30-bit number back into a signed 16-bit format, divide it by 32768-15. Since 32768-15 is equivalent 1.0, dividing by it doesn't change the value represented, but divides the mantissa by 32768 while adding 15 to the exponent. (Note that dividing the mantissa by 32768 is easily accomplished by shifting it to the right 15 places.) In this case the result is 19840S-18 (or 0.75684).
For unsigned fixed-point numbers, the multiplication yields a 32-bit result. For example, 0.6250 x 0.1211 looks like this:
40960S-16 x 63491S-19 = 2600591360S-35
A division by 65536S-16 (also equivalent to 1.0) will make the above number fit back into an unsigned integer: 39681S-19 (or 0.075686). The result is more accurate than the signed version because more bits were used in the unsigned multiplication.

Division

Divisions are always trickier (and slower) than multiplications. For example, instead of dividing a number by 10, you should consider multiplying the number by 0.1 (or 26214S-18 signed). If you have to perform a division, however, simply divide the mantissas and subtract the exponents:
0.2345/(-10.987) = -0.021343 or
30763S-17/(-22501S-11) = -1S-6 (or -0.015625)
Note that the above result is totally incorrect. This is because the division produced a quotient of -1 and a remainder of 8235. C compilers don't know what to do with remainders. To avoid this problem, scale the dividend first by 32768S-15:
(30763S-17 x 32768S-15)/(-22501S-11) =
    44760S-21 (or -0.021343)
Now the mantissa of the result doesn't fit in a 16-bit signed number, so you'll have to adjust it as follows:
-44760S-21/2S-1 = 22380S-20 (or -0.021343)
2S-1 is another representation of 1.0.
The overflow problem will occur whenever the mantissa of the numerator is greater than the mantissa of the denominator. Your code will have to check for this situation.

Comparisons

Comparing two fixed-point numbers presents a problem similar to the problem of adding and subtracting: the exponent of both numbers must be the same. For example, comparing 20480S-15 with 31745S-18 requires that you adjust the number with the smallest exponent to match the scale of the larger. 31745S-18 would thus become 3968S-15. Once both numbers represent the same order of magnitude, comparing the two numbers is simply a matter of comparing the mantissas.

Examples of Using Fixed-Point Arithmetic

Circumference of a Circle

Suppose you need to compute the circumference of a circle that can vary in diameter from 1.22 to 20.8 inches. The circumference of the circle is given by:
Circumference = p x Diameter
Because diameters are positive quantities, you can use unsigned fixed-point numbers. [pi] can be represented as 517472S-14 (which comes out to 3.141602). Plugging 20.8 into Formula 1a gives a scaling factor of 11, so numbers for diameter will be represented as <mantissa>S-11. In other words, 11 bits of a 16-bit unsigned integer will be used to represent the fractional part of 20.8, and 5 bits will be used to represent the 20.
The circumference of the circle is computed in C as follows:
UWORD Circumference(UWORD diameter)
{
  UWORD x;
  x = (UWORD)((51472L * (ULONG)diameter) >> 16);
  return (x);
}
UWORD is a 16-bit unsigned integer; ULONG is a 32-bit unsigned long. [pi], which appears in the snippet as 51472L, and the diameter are represented by 16-bit unsigned integers. (Technically, they are both stored in longs prior to multiplication, but they still contain less than 16 significant bits.) Multiplying two unsigned integers yields a 32-bit result, so you must adjust the resultant mantissa by dividing by 65536 (that is, shifting right 16 places).
The exponent of the result is determined as follows. [pi] has the exponent of S-14 and the diameter has an exponent of S-11. However, the right shift is the same as dividing by 65536S-16, so the exponent of the result is (S-14 + S-11 - S-16) = S-9.
To get an idea how accurate the calculations are, I compute the error for the minimum and maximum circumferences. The minimum circumference representation is found by applying Formula 1b:
mantissa = INT(2^-factor x x + 0.5) = INT(2¹¹ x 1.22 + 0.5) = 2499
Since the exponent is S-11, the minimum circumference is 2499S-11. The multiplication yields 128628528S-25. After the shift, the result is 1962S-9 (3.83203), which is within about 0.02 percent of the correct result of 3.832743. Substituting the maximum circumference into formula 1b gives 42598S-11 as the fixed-point representation of 20.8. The multiplication yields 2192604256S-25. After the shift, the result is 33456S-9 (65.343750) which is also within about 0.02 percent of the correct result 65.345127.

Temperature Conversion

You can use fixed-point arithmetic to convert temperatures from Celsius (C) to Fahrenheit (F). The equation for converting F to C is:
Click here for Picture
To implement the above conversion equation using fixed-point arithmetic, you need to know the range of temperatures you'll be dealing with. Suppose you are interested in temperatures from -40 F to 250 F. The range chosen forces you to use signed integer arithmetic. Also, you need eight bits to represent temperatures up to 250 F, so seven bits will be used to represent fractional degrees. The bias of 32 F is represented as 4096S-7, while the conversion multiplier 5/9 can be represented as 18204S-15. The code to perform the conversion is:

WORD FtoC(WORD f) { /* result is S-7 */ return (((LONG)(f - 4096) * 18204) >> 15); }

The temperature in C is scaled S-7. Performing the conversion from C to F is just as simple. The equation is:
Click here for Picture
Again, the 32F constant is 4096S-7, while the multiplier 9/5 is 29491S-14. The conversion code is:

WORD CtoF(WORD c) { WORD x; x = ((LONG)c * 29491) >> 14; return (x + 4096); /* result is S-7 */ }

Note that to obtain the S-7 result, I had to divide the result of the multiplication by 16384 instead of 32768.

Conclusion

To use fixed-point arithmetic, you need to know the range of values that the variables can take. Fixed-point operations will generally execute quickly because most microprocessors are good at performing integer operations. This performance comes at the expense of accuracy and complexity. To improve accuracy you must use more bits. Using fixed-point arithmetic produces large errors when using numbers with small magnitudes (that is, numbers with small scale factors) and decent results using large numbers. For large numbers, the improvement in accuracy is the result of using more bits. Fixed-point works very well when the dynamic range of the numbers is small. o
This article was excerpted from the book Embedded Systems Building Blocks, Complete and Ready-to-Use Modules in C, by Jean J. Labrosse (R&D Technical Books, 1995). ISBN 0-87930-440-5.

Jean Labrosse is a Senior Consulting Services Engineer at Microtec in San Jose, CA. He has a Master's degree in Electrical Engineering and has been designing embedded real-time systems for many years. Labrosse is the author of two books published by R&D Technical Books: uC/OS, The Real-Time Kernel (ISBN 0-87930-444-8) and Embedded Systems Building Blocks, Complete and Ready-to-Use Modules in C (ISBN 0-87930-440-5).