U.S. Department of Transportation
Federal Highway Administration
1200 New Jersey Avenue, SE
Washington, DC 20590
202-366-4000


Skip to content
Facebook iconYouTube iconTwitter iconFlickr iconLinkedInInstagram

Federal Highway Administration Research and Technology
Coordinating, Developing, and Delivering Highway Transportation Innovations

Report
This report is an archived publication and may contain dated technical, contact, and link information
Publication Number: FHWA-RD-04-080
Date: September 2004

Software Reliability: A Federal Highway Administration Preliminary Handbook

PDF Version (697 KB)

PDF files can be viewed with the Acrobat® Reader®

Chapter 7: Numerical Reliability

This chapter presents a method for computing the maximum and expected error in a numerical computation.

Framework Reference

The framework used here for analyzing errors in a floating point number system appears in Nicholas J. Higham's Accuracy and Stability of Numerical Algorithms.(9)

Errors in a Single Arithmetic Operation

Floating Point Number Systems

Floating point numbers, broadly defined to include the various sizes (such as double precision) are represented in the computer as ±0d1d2...dt Be where

B is the base of the number system. This is usually 2 for the computer's internal arithmetic and 10 for numbers printed or displayed for human beings.

d1,...,dt are integers in the range

0 Less than or equal to di < B

d1 Equal or not equal to 0

The precision of the number system is t.

Beis B to the eth power. eth is an integer in some finite interval around 0. Usually the range of e is defined as follows (for some positive integer m):

For typical double precision numbers written to the base 10:

The precision t = 15

The range of exponent for 10 is ±308.

The Range of a Floating Point Number System

Note that in contrast to numbers in pure mathematics, there are numbers that cannot be well approximated in a particular floating point number system. For a typical double precision number system,

±10±400

are outside the range of the number systems.

Approximation of Real Numbers by Floating Point Numbers

In studying numerical approximations, an important quantity is the relative error of the floating point approximation to the pure mathematical number. Let fl(x) stand for the floating point approximation of x. Then the relative error is defined as

Click for a text description

For a number x inside the number system, the relative approximation error

Click for a text description

is a small number, while for numbers outside the number system this number may be very large. In fact, Higham shows that for numbers in the range of a floating point number system,(9)

Click for a text description

Where

Click for a text description

Where B is the base of number system, and t is the precision, i.e., the number of digits in the mantissa. Intuitively, the correctness of Higham's formula for u can be demonstrated in the following way: Let d0.d1...dt-1Ep be a number in scientific notation, where d0 is the digit in front of the decimal, d1 through dt-1 are the digits to the right of the decimal, and p is a power of 10. This number has t digits in the mantissa, just like Higham's floating point numbers. The error in the mantissa of the number in scientific notation is up to 5*10-t, which is

Click for a text description

For the typical double precision arithmetic in decimal notation (B == 10), t = 15, so:

Errors in Floating Point Operations

For theoretical analysis of errors, it is assumed that an arithmetic operation is implemented as well as it can be in floating point arithmetic, i.e., if op is one of the four arithmetic operations +, -, *, /,

Click for a text description

Where x and y are pure mathematical numbers.

Errors as Random Variables

The above formula for relative error is equivalent to

Click for a text description

a form more convenient in analyzing errors. To further simplify the representation of error, it is convenient to introduce a random variable, e, in the interval ±u. For a particular real number or operation, the precise value of e is not usually known, so using a random variable is reasonable. However, Higham presents examples in which errors are not random. For the most part, it is assumed that e is uniformly distributed over the interval ±u, but this general-purpose assumption may not hold for particular computations. Using the random variable, the last inequality can be written as

Click for a text description

Higham also proves another convenient error bound,

Click for a text description

For a uniform random variable, over an interval of length d, the variance is

Click for a text description

Therefore for e, the variance is

Click for a text description

Pushing Errors through an Arithmetic Operation

Based on the above framework provided by Higham, it is possible to calculate, step by step, along with its value, the expected and maximum error of a computed number. This section shows that computer numbers, as defined above, are closed under the arithmetic operations. In addition, formulas are provided to compute the variances and maximum errors under arithmetic operations.

A Number Object for Error Analysis

Most numbers in the computer contain errors, either through measurement, roundoff, or both. (Exceptions are integers and some rational numbers from pure mathematics.) To characterize a number for error analysis, assume that the following are known:

  • A point estimate of the number.
  • The variance of the number.
  • The maximum error of the number.

If x is a number, v(x) and me(x) denote the variance and maximum error of x. Also, fl(x) is the computed value of a number. While x denotes its theoretical value, t(x) also will be used to denote the theoretical value, where an explicit notation is needed. The computed and theoretical values differ by a random error, e(x).

Random Expressions for Computational Errors

Addition

Table 1: Formula for Addition
What Why
fl(a+b) = (1+e1)(fl(a)+fl(b))
  1. e1 is a random error introduced by addition operation.
  2. The computation is performed with computer approximations of a and b.
fl(a+b) = fl(a)+fl(b) + e1*(fl(a)+fl(b)) Algebra.
fl(a)+fl(b) = t(a)(1+e(a)) + t(b)(1+e(b))
= t(a) + t(b) + t(a)*e(a)) + t(b)*e(b)
Substituting theoretical value + errors for computed value.
fl(a+b) = t(a) + t(b) + (fl(a)-e(a))*e(a)) + (fl(b)-e(b))*e(b) + e1*(fl(a)+fl(b)) Using definition of t(), fl() and e().
fl(a+b) = t(a) + t(b) + fl(a)*e(a) + fl(b)*e(b) + e1*(fl(a)+fl(b)) Throwing away higher powers of random variables.

This says that the computed value of the sum differs from the true value by a random expression

Click for a text description

Note that the coefficients of the random variables are known, because fl(a) and fl(b) are the computer values corresponding to a and b.

Because

Click for a text description

Where fabs is the absolute value function,

Click for a text description

v(a) and v(b) are known, provided that the entire computation is carried out with the number objects defined above, containing a variance and maximum error in addition to a point value. v(e1) is also known, because e1 is a random variable on an interval of known size, namely 2*u:

Click for a text description

(Note that when adding near opposites, the size of u can change, as discussed for subtraction below.) Therefore, the variance of a sum can be computed from the variance of its arguments up to an error by throwing away small higher powers of small random variables, and any computational error in computing successive variances.

The maximum error of the sum is similarly computable from quantities known at the time the sum is computed:

Click for a text description

Therefore, when an addition is performed with computer numbers, it is possible to calculate the computer number sum from known quantities.

Subtraction

Subtraction is similar to addition. It is tempting to substitute -b in the addition formula in place of b, but to derive it this way requires that fl(-b) = fl(b), which is not necessarily the case. Following is a complete derivation of the formula for subtraction:

Table 2: Formula for Subtraction
What Why
fl(a-b) = (1+e1)(fl(a)-fl(b))
  1. e1 is a random error introduced by addition operation.
  2. The computation is performed with computer approximations of a and b.
fl(a-b) = fl(a)-fl(b) + e1*(fl(a)-fl(b)) Algebra.
fl(a)-fl(b) = t(a)(1+e(a)) - t(b)(1+e(b))
= t(a) - t(b) + t(a)*e(a)) - t(b)*e(b)
Substituting theoretical value + errors for computed value.
fl(a-b) = t(a) - t(b) + (fl(a)-e(a))*e(a)) - (fl(b)-e(b))*e(b) + e1*(fl(a)-fl(b)) Using definition of t(), fl() and e().
fl(a-b) = t(a) - t(b) + fl(a)*e(a) - fl(b)*e(b) + e1*(fl(a)-fl(b)) Throwing away higher powers of random variables.

Because

Click for a text description

Then

Click for a text description

And because

Click for a text description

Then

Click for a text description

Loss of Precision in Subtracting Nearly Equal Numbers

Subtracting nearly equal quantities, or adding near opposites, causes the range of the random variable e(a) to expand. This is due to a loss of precision in a-b. This lack of precision is, in turn, due to the fact that only a small number of digits in a and b are different. Only these different digits appear in a-b. The range of the random error in doing an operation depends on the number of digits in the mantissa of the answer. When this is small, the range of the random variable increases.

Let

Click for a text description

And likewise for b, for binary digits di, exponent e, and number system base B. If e and d1...dk are equal for both a and b, there are only t-k digits that differ in the two numbers. Therefore, there are only t-k digits in the mantissa of fl(a-b) that result from the subtraction of the mantissas of a and b. The range of e1 in this case is

Click for a text description

Instead of

Click for a text description

The number of agreeing digits k in the above formula can be found by bit fiddling on the computer representations of fl(a) and fl(b). Assuming as above that the mantissa bits are numbered from 1 to t, with bit 1 the most significant, let b(i,x) be the ith bit of the mantissa of x.

Then

Click for a text description

We can also find the number of agreeing digits with good accuracy by computing the minimum, for a and b of integer part log(base B)(fabs(c)/fabs(a-b)), where c in {a, b} and B is the base of the computer number system.

This computation of the range of the random variable e1 should be done for all subtraction operations where the arguments are nearly equal and all addition operations where they are nearly equal in magnitude but differ in sign.

The Effect of Errors in Computing Integer Part log(base B)(fabs(c)/fabs(a-b))

Note that there is some error in computing log(base B)(fabs(c)/fabs(a-b)). However, this error only matters when it causes a value to cross an integer boundary erroneously. Because the error is a multiple of the precision of computer arithmetic, it is small number, so only logs in a narrow interval around integers are suceptiblesusceptible to this error. Therefore, an error in computing integer part log(base B)(fabs(c)/fabs(a-b)) is possible but rare. Because the total computational error is generally a sum of a large number of random variables, and because higher order terms already are thrown away, the usually small estimation error caused by errors in integer part log(base B)(fabs(c)/fabs(a-b)) will be ignored.

Multiplication

Table 3: Formula for Multiplication
What Why
fl(a*b) = (1+e1)*fl(a)*fl(b)
  1. e1 is a random error introduced by multiplication operation.
  2. The computation is performed with computer approximations of a and b.
fl(a*b) = fl(a)*fl(b) + e1*fl(a)*fl(b) Algebra.
fl(a)*fl(b) = t(a)*(1+e(a)) * t(b)*(1+e(b))
= t(a)*t(b)*(1+e(a)+e(b)+e(a)*e(b))
Substituting theoretical value + errors for computed value.
fl(a)*fl(b) = t(a)*(1+e(a)) * t(b)*(1+e(b))
= t(a)*t(b)*(1+e(a)+e(b))
= t(a)*t(b) +t(a)*t(b)*(e(a)+e(b))
Throwing away terms in higher powers of random variables, because these terms are much smaller than the others, given the typical precision for computer arithmetic.
t(a)*t(b) = (fl(a)-e(a))*(fl(b)-e(b))
= fl(a)*fl(b) - e(a)*fl(b) - e(b)*fl(a) + e(a)*e(b)
= fl(a)*fl(b) - e(a)*fl(b) - e(b)*fl(a)
To express the random expression in terms of computationally known quantities, rewrite t(a)*t(b) in terms of fl(a) and fl(b), and throw away small terms.

Then

Click for a text description

and

Click for a text description

This describes the random expression for the error in multiplication in terms of coefficients and random variables, where the coefficients, means, and variances of the random variables are known when the multiplication is performed.

Division

Table 4: Formula for Division
What Why
fl(a/b) = (1+e1)*fl(a)/fl(b)
  1. e1 is a random error introduced by multiplication operation.
  2. The computation is performed with computer approximations of a and b.
fl(a/b) = fl(a)/fl(b) + e1*fl(a)/fl(b) Algebra.
fl(a)/fl(b) = t(a)*(1+e(a)) / t(b)*(1+e(b)) = t(a)/t(b)*(1+e(a))/(1+e(b)) Substituting theoretical value + errors for computed value.
1/(1+e(b)) = 1 - e(b)
  1. Using a Taylor series approximation for f(x+h), with f(x) = 1/x, f'(x) = -1/x2, x = 1, h = e(b).
  2. Discarding terms in powers of e(b) of 2 or higher.
(1+e(a))/(1+e(b)) = (1+e(a))*(1-e(b)) = 1+e(a)-e(b) Substituting, simplifying, and discarding terms in powers of e(b) of 2 or higher.
fl(a)/fl(b) = t(a)/t(b)+ t(a)/t(b)*(e(a)-e(b)) Substituting.
t(a)/t(b) = (fl(a)-e(a))/(fl(b)-e(b)) Writing t(a)/t(b) in terms of computationally known quantities, as part of the computation of the random error t(a)/t(b)*(e(a)-e(b)), using the e definitions.
(fl(a)-e(a))/(fl(b)-e(b)) = (fl(a)-e(a))*(Taylor approximation of 1 /(fl(b)-e(b))) = (fl(a)-e(a))*( 1/fl(b) -e(b)/fl(b)) = fl(a)*( 1/fl(b) -e(b)/fl(b)) -e(a)*( 1/fl(b) -e(b)/fl(b)) = fl(a)/fl(b) -e(b)*fl(a)/fl(b)) -e(a)/fl(b) = fl(a)/fl(b)*(1 -e(b) -e(a)) Using Taylor approximation of 1/(fl(b)-e(b)), simplifying, and discarding higher powers of random variables.
fl(a)/fl(b) = t(a)/t(b)+ fl(a)/fl(b)*(1 -e(b) -e(a))*(e(a)-e(b)) Substituting.
(1 -e(b) -e(a))*(e(a)-e(b)) = e(a)-e(b) Throwing away higher power of random variable terms.
fl(a)/fl(b) = t(a)/t(b)+ fl(a)/fl(b)*(e(a)-e(b)) Substituting.
fl(a/b) = t(a)/t(b)+ fl(a)/fl(b)*(e1+e(a)-e(b)) Substituting.

Then

Click for a text description

and

Click for a text description

Computing Errors for an Entire Computation

As seen in the previous section, for each of the four arithmetic operations, the random error in a computed number is approximated by a linear sum of random variables. In this linear sum, the coefficients and the variances of the random variables are known at the time the arithmetic operation is performed. The linear sum approximates the true random expression up to random variable terms containing squares or higher powers of random variables. Because all the random variables have ranges of 1E-7 for floating point, or 1E-15 for double precision, these higher power terms are much smaller than the linear terms, and so can be safely discarded.

Given this, the maximum error and the standard error for a computed number can be computed as follows: First, order the inputs, intermediate results, and final results so that the inputs to each arithmetic operation appear in the list before the result of the operation. Now the point estimate, variance, and maximum error of each number in the list will be computed in turn. For measurement inputs to a computation, begin with the measurement (as a point estimate), the measurement's variance, and the maximum error of the measurement.

For a computed number in the list, let the number be computed by

     a op b

Where the random expression for op is c1*e(a) + c2*e(b) * c3*e1, where

  • The c values are known coefficients (from the formulas of the previous section).
  • e(a) and e(b) are the random errors of the inputs to the operation. The point estimates of a and b and their variances and maximum errors are assumed to be known.

Then the variance of a op b is

Click for a text description

Where fabs is the absolute value function. The maximum error is

Click for a text description

Given that there are a finite number of steps in a computation, repeated application of these formulas produces the maximum error and variance of the final result. (See appendix B for an example of errors in large sums.)

Previous | Table of Contents | Next

Federal Highway Administration | 1200 New Jersey Avenue, SE | Washington, DC 20590 | 202-366-4000
Turner-Fairbank Highway Research Center | 6300 Georgetown Pike | McLean, VA | 22101