Extended Precision - X86 Extended Precision Format - Need For The 80-bit Format

Need For The 80-bit Format

A notable example of the need for a minimum of 64 bits of precision in the significand of the extended precision format is the need to avoid precision loss when performing exponentiation on double precision values. The x86 floating point units do not provide an instruction that directly performs exponentiation. Instead they provide a set of instructions that a program can use in sequence to perform exponentiation using the equation:

In order to avoid precision loss, the intermediate results "log2 x" and "y log2 x" must be computed with much higher precision because effectively both the exponent and the significand fields of x must fit into the significand field of the intermediate result. Subsequently the significand field of the intermediate result is split between the exponent and significand fields of the final result when 2intermediate result is calculated. The following discussion describes this requirement in more detail.

An IEEE 754 double precision value can be represented as:

where s is the sign of the exponent (either 0 or 1), E is the unbiased exponent which is an integer that ranges from 0 to 1023, and M is the significand which is a 53-bit value that falls in the range 1 ≤ M < 2. Negative numbers and zero can be ignored because the logarithm of these values is undefined. For purposes of this discussion M does not have 53 bits of precision because it is constrained to be greater than or equal to one i.e. the hidden bit does not count towards the precision (Note that in situations where M is less than 1, the value is actually a denormal and therefore may have already suffered precision loss. This situation is beyond the scope of this article).

Taking the log of this representation of a double precision number and simplifying results in the following:

begin{align} log_2(2^{(-1)^s,times,E},times,M) & = (-1)^s,times,E,times,log_2 2,+,log_2 M \
& = pm,E,+,log_2 M\ end{align}

This result demonstrates that when taking base-2 logarithm of a number, the sign of the exponent of the original value becomes the sign of the logarithm, the exponent of the original value becomes the integer part of the significand of the logarithm, and the significand of the original value is transformed into the fractional part of the significand of the logarithm.

Because E is an integer in the range 0 to 1023, up to 10 bits to the left of the radix point are needed to represent the integer part of the logarithm. Because M falls in the range 1 ≤ M < 2, the value of log2 M will fall in the range 0 ≤ log2 M < 1 so at least 52 bits are needed to the right of the radix point to represent the fractional part of the logarithm. Combining 10 bits to the left of the radix point with 52 bits to the right of the radix point means that the significand part of the logarithm must be computed to at least 62 bits of precision. In practice values of M less than require 53 bits to the right of the radix point and values of M less than require 54 bits to the right of the radix point to avoid precision loss. Balancing this requirement for added precision to the right of the radix point, exponents less than 512 only require 9 bits to the left of the radix point and exponents less than 256 require only 8 bits to the left of the radix point.

The final part of the exponentiation calculation is computing 2intermediate result. The "intermediate result" consists of an integer part "I" added to a fractional part "F". If the intermediate result is negative then a slight adjustment is needed to get a positive fractional part because both "I" and "F" are negative numbers.

For positive intermediate results:

For negative intermediate results:

Thus the integer part of the intermediate result ("I" or "I-1") plus a bias becomes the exponent of the final result and transformed positive fractional part of the intermediate result: 2F or 21+F becomes the significand of the final result. In order to supply 52 bits of precision to the final result, the positive fractional part must be maintained to at least 52 bits.

In summary, the exact number of bits of precision needed in the significand of the intermediate result is somewhat data dependent but 64 bits is sufficient to avoid precision loss in the vast majority of exponentiation computations involving double precision numbers.

The number of bits needed for the exponent of the extended precision format follows from the requirement that the product of two double precision numbers should not overflow when computed using the extended format. The largest possible exponent of a double precision value is 1023 so the exponent of the largest possible product of two double precision numbers is 2047 (an 11-bit value). Adding in a bias to account for negative exponents means that the exponent field must be at least 12 bits wide.

Combining these requirements: 1 bit for the sign, 12 bits for the biased exponent, and 64 bits for the significand means that the extended precision format would need at least 77 bits. Engineering considerations resulted in the final definition of the 80-bit format (in particular the IEEE 754 standard requires the exponent range of an extended precision format to match that of the next largest, quad, precision format which is 15 bits).

Read more about this topic:  Extended Precision, X86 Extended Precision Format