U.S. Department of Transportation
Federal Highway Administration
1200 New Jersey Avenue, SE
Washington, DC 20590
202-366-4000
Federal Highway Administration Research and Technology
Coordinating, Developing, and Delivering Highway Transportation Innovations
![]() |
This report is an archived publication and may contain dated technical, contact, and link information |
|
Publication Number: FHWA-RD-04-080
Date: September 2004 |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Software Reliability: A Federal Highway Administration Preliminary HandbookPDF Version (697 KB)
PDF files can be viewed with the Acrobat® Reader® Chapter 7: Numerical ReliabilityThis chapter presents a method for computing the maximum and expected error in a numerical computation. Framework ReferenceThe framework used here for analyzing errors in a floating point number system appears in Nicholas J. Higham's Accuracy and Stability of Numerical Algorithms.(9) Errors in a Single Arithmetic OperationFloating Point Number Systems Floating point numbers, broadly defined to include the various sizes (such as double precision) are represented in the computer as ±0d1d2...dt Be where B is the base of the number system. This is usually 2 for the computer's internal arithmetic and 10 for numbers printed or displayed for human beings. d1,...,dt are integers in the range 0 d1 The precision of the number system is t. Beis B to the eth power. eth is an integer in some finite interval around 0. Usually the range of e is defined as follows (for some positive integer m): For typical double precision numbers written to the base 10: The precision t = 15 The range of exponent for 10 is ±308. The Range of a Floating Point Number System Note that in contrast to numbers in pure mathematics, there are numbers that cannot be well approximated in a particular floating point number system. For a typical double precision number system, ±10±400 are outside the range of the number systems. Approximation of Real Numbers by Floating Point NumbersIn studying numerical approximations, an important quantity is the relative error of the floating point approximation to the pure mathematical number. Let fl(x) stand for the floating point approximation of x. Then the relative error is defined as For a number x inside the number system, the relative approximation error is a small number, while for numbers outside the number system this number may be very large. In fact, Higham shows that for numbers in the range of a floating point number system,(9) Where Where B is the base of number system, and t is the precision, i.e., the number of digits in the mantissa. Intuitively, the correctness of Higham's formula for u can be demonstrated in the following way: Let d0.d1...dt-1Ep be a number in scientific notation, where d0 is the digit in front of the decimal, d1 through dt-1 are the digits to the right of the decimal, and p is a power of 10. This number has t digits in the mantissa, just like Higham's floating point numbers. The error in the mantissa of the number in scientific notation is up to 5*10-t, which is For the typical double precision arithmetic in decimal notation (B == 10), t = 15, so: Errors in Floating Point OperationsFor theoretical analysis of errors, it is assumed that an arithmetic operation is implemented as well as it can be in floating point arithmetic, i.e., if op is one of the four arithmetic operations +, -, *, /, Where x and y are pure mathematical numbers. Errors as Random Variables The above formula for relative error is equivalent to a form more convenient in analyzing errors. To further simplify the representation of error, it is convenient to introduce a random variable, e, in the interval ±u. For a particular real number or operation, the precise value of e is not usually known, so using a random variable is reasonable. However, Higham presents examples in which errors are not random. For the most part, it is assumed that e is uniformly distributed over the interval ±u, but this general-purpose assumption may not hold for particular computations. Using the random variable, the last inequality can be written as Higham also proves another convenient error bound, For a uniform random variable, over an interval of length d, the variance is Therefore for e, the variance is Pushing Errors through an Arithmetic OperationBased on the above framework provided by Higham, it is possible to calculate, step by step, along with its value, the expected and maximum error of a computed number. This section shows that computer numbers, as defined above, are closed under the arithmetic operations. In addition, formulas are provided to compute the variances and maximum errors under arithmetic operations. A Number Object for Error AnalysisMost numbers in the computer contain errors, either through measurement, roundoff, or both. (Exceptions are integers and some rational numbers from pure mathematics.) To characterize a number for error analysis, assume that the following are known:
If x is a number, v(x) and me(x) denote the variance and maximum error of x. Also, fl(x) is the computed value of a number. While x denotes its theoretical value, t(x) also will be used to denote the theoretical value, where an explicit notation is needed. The computed and theoretical values differ by a random error, e(x). Random Expressions for Computational ErrorsAddition
This says that the computed value of the sum differs from the true value by a random expression Note that the coefficients of the random variables are known, because fl(a) and fl(b) are the computer values corresponding to a and b. Because Where fabs is the absolute value function, v(a) and v(b) are known, provided that the entire computation is carried out with the number objects defined above, containing a variance and maximum error in addition to a point value. v(e1) is also known, because e1 is a random variable on an interval of known size, namely 2*u: (Note that when adding near opposites, the size of u can change, as discussed for subtraction below.) Therefore, the variance of a sum can be computed from the variance of its arguments up to an error by throwing away small higher powers of small random variables, and any computational error in computing successive variances. The maximum error of the sum is similarly computable from quantities known at the time the sum is computed: Therefore, when an addition is performed with computer numbers, it is possible to calculate the computer number sum from known quantities. Subtraction Subtraction is similar to addition. It is tempting to substitute -b in the addition formula in place of b, but to derive it this way requires that fl(-b) = fl(b), which is not necessarily the case. Following is a complete derivation of the formula for subtraction:
Because Then And because Then Loss of Precision in Subtracting Nearly Equal NumbersSubtracting nearly equal quantities, or adding near opposites, causes the range of the random variable e(a) to expand. This is due to a loss of precision in a-b. This lack of precision is, in turn, due to the fact that only a small number of digits in a and b are different. Only these different digits appear in a-b. The range of the random error in doing an operation depends on the number of digits in the mantissa of the answer. When this is small, the range of the random variable increases. Let And likewise for b, for binary digits di, exponent e, and number system base B. If e and d1...dk are equal for both a and b, there are only t-k digits that differ in the two numbers. Therefore, there are only t-k digits in the mantissa of fl(a-b) that result from the subtraction of the mantissas of a and b. The range of e1 in this case is Instead of The number of agreeing digits k in the above formula can be found by bit fiddling on the computer representations of fl(a) and fl(b). Assuming as above that the mantissa bits are numbered from 1 to t, with bit 1 the most significant, let b(i,x) be the ith bit of the mantissa of x. Then We can also find the number of agreeing digits with good accuracy by computing the minimum, for a and b of integer part log(base B)(fabs(c)/fabs(a-b)), where c in {a, b} and B is the base of the computer number system. This computation of the range of the random variable e1 should be done for all subtraction operations where the arguments are nearly equal and all addition operations where they are nearly equal in magnitude but differ in sign. The Effect of Errors in Computing Integer Part log(base B)(fabs(c)/fabs(a-b)) Note that there is some error in computing log(base B)(fabs(c)/fabs(a-b)). However, this error only matters when it causes a value to cross an integer boundary erroneously. Because the error is a multiple of the precision of computer arithmetic, it is small number, so only logs in a narrow interval around integers are suceptiblesusceptible to this error. Therefore, an error in computing integer part log(base B)(fabs(c)/fabs(a-b)) is possible but rare. Because the total computational error is generally a sum of a large number of random variables, and because higher order terms already are thrown away, the usually small estimation error caused by errors in integer part log(base B)(fabs(c)/fabs(a-b)) will be ignored. Multiplication
Then and This describes the random expression for the error in multiplication in terms of coefficients and random variables, where the coefficients, means, and variances of the random variables are known when the multiplication is performed. Division
Then and Computing Errors for an Entire ComputationAs seen in the previous section, for each of the four arithmetic operations, the random error in a computed number is approximated by a linear sum of random variables. In this linear sum, the coefficients and the variances of the random variables are known at the time the arithmetic operation is performed. The linear sum approximates the true random expression up to random variable terms containing squares or higher powers of random variables. Because all the random variables have ranges of 1E-7 for floating point, or 1E-15 for double precision, these higher power terms are much smaller than the linear terms, and so can be safely discarded. Given this, the maximum error and the standard error for a computed number can be computed as follows: First, order the inputs, intermediate results, and final results so that the inputs to each arithmetic operation appear in the list before the result of the operation. Now the point estimate, variance, and maximum error of each number in the list will be computed in turn. For measurement inputs to a computation, begin with the measurement (as a point estimate), the measurement's variance, and the maximum error of the measurement. For a computed number in the list, let the number be computed by a op b Where the random expression for op is c1*e(a) + c2*e(b) * c3*e1, where
Then the variance of a op b is Where fabs is the absolute value function. The maximum error is Given that there are a finite number of steps in a computation, repeated application of these formulas produces the maximum error and variance of the final result. (See appendix B for an example of errors in large sums.) |