Quad-Precision Math Functions, Part 2

Features

Quad-Precision Math Functions, Part 2

Dr. Tim Prince

When you need lots of precision, there's no substitute for a set of math functions that does the job right.

Introduction

This is the second part of a two-part article describing how to implement the math functions of Standard C in "quad" precision. Available on several commercially available computer architectures today, quad-precision permits an implementation to provide type long double with 113 bits of precision. Last month's installment covered the functions asinl, acosl, atan2l, atanl, coshl, cosl, and sinl. I conclude this month with long double versions of the remaining functions declared in <math.h>.
(See Listing 1. )
The Function expl

The preferred algorithm is generally the reverse of the procedure for computing logl. The argument is broken down into the sum of a small number whose exponential is near 1, and a number whose exponential is a power of 2. Arguments that are far into the regions where overflow or underflow would occur are taken care of simply as special cases.

A rational polynomial approximation is preferred, because it has a form equivalent to:
exp(x) = (q(x²) + x * p(x²))
          / (q(x²) - x * p(x²))
in which (in effect) only half of the terms have distinct values. In order to control roundoff errors, this has to be rewritten as:
q(x) = x + x³ * r(x²)
exp(x) = 2 * (0.5 + q(x))
        / ((x²*s(x²)-q(x))+2)
Roundoff errors in the denominator are minimized by adding the constant 2 last. The constant 0.5 is not large enough to swamp the roundoff errors incurred while evaluating the numerator and denominator and their ratio, so all additions have to be done carefully. With these precautions, the effect of roundoff errors is less with a rational approximation than with the polynomial, because the series converge much faster. The final multiplication by 2 is to be combined with the exponent adjustment. This makes no difference to the result in IEEE-compliant arithmetic, although it could save a couple of bits on the IBM 370 architecture, where values between 0.5 and 1 carry the most precision.

The coefficients for a Chebyshev economized polynomial, which requires 21 terms, are easier to calculate than those for the rational approximation. double precision is sufficient for the five smallest terms, which saves this choice from being unreasonable. The double section of the polynomial is written in pipelineable fashion. Writing out the leading terms 1 + (x + x² * (...)) gains accuracy more often than it loses, in comparison to the results of 1 + x * (1 + x * (...)).

The binary exponent of the result of this curve fit is adjusted to the final value. If it approaches the point of underflow, part of the adjustment is performed on a multiplicative constant which is initialized to 1.

Elefunt tests of powl, sinhl, and tanhl are more sensitive indicators of accuracy of expl than the tests of expl itself.

The Function logl

First, subnormals are handled by scaling them up into the normalized range. A zero argument is unaffected but has the bias LDBL_MANT_DIG pre-added into its logarithm base 2. Separate treatment of the case log(0) (giving a result of minus infinity, according to IEEE) is optional.

The next adjustment of the binary exponent is performed to translate the argument into the range [sqrt(.5), sqrt(2)]. The code looks only at the exponent and the high order 17 bits of the mantissa, so that the transformation can be done efficiently. In the old days, a more readable method was used in which the series was expanded in terms of (x - sqrt(.5)) / (x + sqrt(.5)), but this produces unacceptable roundoff errors. In the present scheme, inaccuracy of range reduction may cause the reduced range to extend slightly beyond the target range, but the accuracy of the argument is not polluted.

A Chebyshev economized polynomial approximation for
logl(x) / (z = (x - 1)/(x + 1)))
is evaluated. Expansion in odd powers of z with this choice of reduced range corresponds with the identity log(x) = -log(1 / x). Any errors found with respect to this identity should be purely the result of the roundoff error in the inversion. The elefunt test suite provides tests that favor this type of expansion and also tests that favor a rational approximation in powers of x - 1. There does not appear to be any advantage in using a doubly rational scheme.

The series representation:
(x + x) + x³ * ( .66666...)
improves accuracy in comparison to the expression:
x * (2 + x² *( .6666....)
Separate sections of code are used to control the roundoff error incurred while adding the binary exponent to the function result in base e or 10.

Elefunt tests show that separate multiplication of the two terms of log10 saves a bit of accuracy in the worst case. log10l remains significantly less accurate than logl, so schemes such as comparing the result of log10 with a table of accurate powers of 10 remain useful in certain applications, such as formatted binary-to-decimal conversion.

An extra-precision scheme for adding in the exponent to logl does not improve the worst-case error, according to elefunt, but it eliminates most of the errors for large values of logl. This does not solve the problem of loss of information, where expl(logl(x)) may lose up to 15 bits of precision, although logl(expl(x)) is accurate.

At one time, libraries commonly provided log functions that lost nearly all precision for arguments near 1. This problem has not been seen much lately. However, problems such as reporting logl(LDBL_EPSILON * LDBL_MIN) > 0 remain common, and the elefunt test of logl(x * x) - 2 * logl(x) often shows more errors than necessary.

The Function powl

The simple formula which could be expressed as the macro
#define powl(x, y)  expl((logl(x) * y)
is used only when y is too large to be cast to int. In such cases, I resign myself to the degradation of accuracy, which may be more than the number of bits in the exponent field of either x or the returned value. Otherwise, x is raised to the nearest int power of y by a binary multiplication chain. This takes care of odd situations such as negative x. If y == 0 the result is 1, although that may be debatable in the cases x == NaN or x == Infinity.

Any fractional part of y is taken care of in a logarithmic formula, but (if x >= LDBL_MIN) the argument is reduced to the range [0.5, 1] first so that logl does not lose accuracy. The rest of the calculation is taken care of by direct calculation of the integer part using log base 2, with remainders added in to the logl intermediate result. Overflow is not possible in this part of the calculation because only fractional remainders are used.

Roundoff errors occur mostly in the multiplication chain, for large values of |y|. These errors are much smaller than the errors which would be incurred by using logl for this part of the calculation. Raising to a negative integral power is undesirable, in expressions such as z * powl(x, -2), which produces (1 / x) * (1 / x) * z. This generates more roundoff error and takes more time than z / (x * x), which, unfortunately, is more vulnerable to premature overflow.

Premature abrupt underflow is a possibility when raising to a large negative non-integral power. This could occur only for results less than about LDBL_MIN / 8, with IEEE-compliant arithmetic. Preinversion avoids this, at some cost of accuracy, for integral powers. I confess to a slight remorse at doing things this way just so they pass elefunt, without completely eliminating possibilities for trouble.

The Function sinhl

Direct application of the formula:
sinhl(x) =0.5 * (expl(x) - expl(-x))
is not satisfactory, because relative error would overwhelm the result for small arguments. A Chebyshev economized polynomial is used for small arguments, while large arguments are handled as described under the coshl function (last month).

The Function tanhl

Writing the formula (in macro form):
#define tanhl(x)
(1 - 2 / (expl(x + x) + 1))
minimizes the effect of roundoff errors, except when x is small. Also, it produces correct results with large positive or negative arguments. Small x is defined as where |tanhl(x)| < 0.5, in which case this formula involves cancellation of significant bits.

Either expl(x + x) or expl(x)2 may be used, with the former slightly faster. With an accurate expl implementation, the difference in accuracy is not significant.

For small x (when the tanhl function has absolute value less than 0.5), a rational polynomial approximation of the form:
tanhl(x) =x + x^3 * p(x²) / q(x²)
is preferred. The coefficients may be scaled so that the highest-order coefficient in either p or q becomes 1, eliminating a multiplication. A Chebyshev economized polynomial will work, but this requires an excessive number of terms. While a rational approximation might be developed by splitting the odd and even terms of a Chebyshev economized fit to exp, this does not take advantage of the more rapid convergence of which the rational approximation is capable.

I am attracted to the idea that the same basic approximations could be used in tanhl as in expl, but elefunt tests show that the anti-cancellation formula must be used all the way up to |tanhl(x)| == 0.5.

The Function tanl

This is a rational approximation, even if it is calculated as sinl(x) / cosl(x). A faster way is to fit a direct rational approximation to tan, allowing both the numerator and denominator to deviate from sin-like behavior. This would allow the series to converge almost twice as fast. However, even the use of the ratio of the Chebyshev economized approximations for sinl and cosl is more satisfactory than a direct Chebyshev economization of a series for tan.

If the rational approximation covers |x| <= pi/4, the division can be delayed to expedite reflection into adjacent intervals according to:
tan(x + pi/2) = -1 / tan(pi / 2 - x)
In order to reduce rounding errors, the linear term is added last. There should never be any numerical infinity, as the range reduction is to be done with extra precision, so that the poles occur between two representable values of x.

High-order terms that are small enough not to require long double precision are grouped to encourage compilers to take advantage of instruction-level parallelism.

sinl and cosl are calculated along the way, as components of a structure that includes the tanl return value. If two or more of these results are needed, they may be evaluated efficiently by taking advantage of this, as long as the series remain as approximations to these individual functions.

Testing

I ran the elefunt test suite (described last month), using Plauger's C translation, with constants extended to 35 decimal places. Reported errors were less than two bits, except for log10l and extreme cases of powl, where errors of almost three bits were encountered. In HP's own FTN_Q.. implementations of powl and sinhl, elefunt reported errors of more than 12 bits.

As the comparison values produced by elefunt are rounded themselves, it is quite possible to produce a reported error of one bit when, in reality, there is no error of more than half a bit. Elefunt seldom reports errors of less than 0.6 bit, unless it reports zero error, so errors may be exaggerated slightly.

Root-mean-square errors reported by elefunt were more than five bits (a factor of 32) below a relative error of LDBL_EPSILON. This indicates that errors of more than half a bit are infrequent, with a majority of tests showing no error. No error means that, as far as elefunt is able to discover, a correctly rounded result is produced.

According to the elefunt tests, for most of the functions it was possible to find a number of terms which minimize root-mean-square error. Additional terms produced a slight increase in root-mean-square error and, sometimes, a decrease in maximum error. Apparently, an analytically closer curve fit may be offset by an increase in numerical roundoff errors.

Quad-precision math functions are extremely slow when compiled to use software floating-point arithmetic. The lack of a good optimized rational approximation for expl, tanl, and tanhl, which requires considerable resources to obtain, also exacts a performance penalty. According to my tests, the sinl and cosl presented here are both faster and more accurate than those included in HP's Fortran Library. Their speed on the other functions is gained at some cost in accuracy.

In summary, I believe the code presented here demonstrates that satisfactory results for the 113-bit precision math library can be obtained with C code that is portable, except for the lower-level manipulations required for exponent-field adjustment in the expl, logl, and powl functions. Even here, the interface to these adjustment functions can be kept portable, as shown in [4] . o

References

[1] William J. Cody, Jr. and William Waite. Software Manual for the Elementary Functions (Prentice-Hall, 1980).

[2] D. Stevenson, et al. " Proposed Standard for Binary Floating-Point Arithmetic," IEEE Computer, 1981.

[3] S.P. Harbison and G.L. Steele, Jr. C, A Reference Manual, 3rd Edition (Prentice-Hall 1991).

[4] P.J. Plauger, The Standard C Library (Prentice-Hall, 1992).

[5] Press, Flannery, et al. Numerical Recipes, The Art of Scientific Computing (Cambridge University Press, 1986).

[6] J. Thomas and J.T. Coonen, "An Introduction to Floating-Point C Extensions," C/C++ Users Journal, January 1995, pp. 49-57.

Tim Prince has worked in turbomachinery aerodynamics for over 30 years. For the last five years, computational fluid dynamics has been a large part of his job. He still does some FORTRAN program maintenance and general UNIX work. In his spare time he tries to explain to his wife and son that they probably know more about the software on their PC than he does.