This report is an archived publication and may contain dated technical, contact, and link information

Publication Number: N/A
Date: 1999

Producing Correct Software

Applying Experimental Results

Summary

This web page presents techniques for applying a database of previously experimentally measured input-output pairs to test the correctness of software.

The Problem

Experimental values can be used to wrap a computer program. Each data record in the set of experimental values should contain the following:

Measurements of the inputs used by the program.
Measurements of the program outputs that occurred for those inputs.

When the program is asked to compute the outputs at a new point, the known experimental values are used to predict the outputs, using some interpolation techniques. Only rarely is a record for the new input point already in the database; usually known experimental results for inputs closest to the new points have to be combined to make an experimentally based prediction of the program output.

If the output of the program agrees with the prediction based on experimental data within an acceptable tolerance, the results of the program are accepted. Otherwise the results of the program are rejected, i.e., not used without further examination, perhaps by alerting an expert to the discrepancy between the experimentally predicted value and computed value.

To wrap a program with predictions based on observations, the following problems must be addressed:

What interpolation technique should be used to derive the experimentally based prediction?
How can one decide among possible interpolation techniques?

Some Interpolation Techniques
Note: This section is under development.

Local Sampling Techniques
Generalized Regression Neural Nets
Linear Regression
Polynomial Regression
Other Regression Techniques
Backpropagation/Feedforward Neural Nets
Projection Pursuit Regression

Deciding Between Interpolation Techniques
Notation

x, xi, xj, etc. are points in input space
ex(x) = experimental output at x
p(x), pi(x) etc. are predicted outputs at x
S = set of input points in experimental database. S excludes any point used in creating any interpolation function currently being studied
N = cardinality of S
err(x), erri(x) etc. are errors in predicted outputs, i.e., err(x) = ex(x)-p(x). If we need to distinguish errors of different prediction functions, we will use the notation err(p,x).
abs_err(x), abs_erri(x) etc. are errors in predicted outputs, i.e., abs_err(x) = abs(ex(x)-p(x)), the absolute value of the error
mean({f(x)|x in S}): This is the mean of the function f on x
var({f(x)|x in S}) = var({f(x)|x in S}): This is the variance of f on the sample S.

Comparing Estimators

Given a prediction function, p(x), one measures the error in the prediction when compared to measured experimental values. This error function is abs_err(x), the absolute value of the error. [Taking the absolute value prevents positive and negative errors from canceling each other out in the mean and other statistics.]

For each sample test point x in S, one can obtain abs_err(pa,x) and abs_err(pb,x) from the measured experimental value ex(x) and the predictions pa(x) and pb(x). The sets {abs_err(pa,x)|x in S} and {abs_err(pb,x)|x in S} are matched pair data, i.e., measurements of two random variables are made at the same sample points. One can compare functions on a matched pair sample by looking at the difference of the functions on the sample, i.e.,

{err_diff(x) | x in S}, where

err_diff(x) = abs_err(pb,x) - abs_err(pa,x).

The mean of err_diff on S is a measure of how much better pb predicts than pa predicts. If this mean is positive, pb is a better predictor of ex than pa. If, on the other hand, mean(err_diff) is nonpositive, pb is not a better predictor than pa. Therefor e, to test whether pb is a better predictor, one will find a confidence level for mean(err_diff) to be positive.

In the usual method of statistics, one finds the confidence interval for mean(err_diff)0 by determining the probability of an observed statistic obtaining at least its observed value if the desired conclusion (i.e., pb is better than pa, or in statistica l terms, mean(err_diff)0) is false. The sample mean follows the t distribution centered around the population mean. The integral of the t distribution with mean 0 from the observed value on S to infinity is the probability of obtaining the observed diffe rence of means if the predictor that was supposed to be worse is actually better or at least no worse than the supposed better predictor.

The t statistic with mean 0 is given by the formula

t = mean(err_diff on S)/sd(err_diff on S)

sd(err_diff) = (SUM(err_diff(xi)-mean(err_diff on S)^2/(N-1))^(1/2)

for xi in S. [Note that a better computational formula is sd = (N*SUM(err_diff(xi))^2 - SUM(err_diff(xi)^2/(N*(N-1)))^(1/2).]

The confidence that the expected predictor is better than the other predictor is given by the probability that t has a value less than or equal to the observed value. If t0 is the observed value of t, the confidence is the integral of t from infinity to t0. Standard tables of the t distribution in most statistics books give values of t0 which guarantee standard confidences, e.g., 90 percent, 95 percent, 99 percent, etc. By comparing the value of t0 on S with the values in the table, one can determine w hich of the standard confidence levels has been reached. The tables are for the t distribution with a standard deviation of 1; dividing the sample mean by the sample deviation converts the sample mean to this standard form.

Use of Absolute Values

From inspection of a table for the t distribution, it becomes apparent that the confidence level increases as the value of t increases. In turn, t increases as mean(err_diff) increases. Suppose that the error of a predictor was simply

predicted value - observed value

instead of the absolute value of that difference. If the worse predictor badly underpredicts a large positive observed value, while the better predictor has only a small error, err_diff would be a negative number. The t test only returns useful confidenc e intervals for t values near 2. Without taking absolute values of predictor errors, the t test can fail to detect significant differences in the magnitude of errors in predictors.

Example

Given

Target	1.1	2.2	3.3	4.4	5.5	6.4	7.6	8.9
Better	1.11	2.18	3.33	4.41	5.59	6.62	7.67	8.81
Worse	1.11	1.95	3.03	4.44	4.80	6.31	8.01	9.11
Err_diff	0.00	0.23	0.24	0.03	0.61	-0.13	0.34	0.120

where

Numbers listed under "Measured" are the observed experimental values.
Numbers listed under "Better" are the presumed better predictions.
Numbers listed under "Worse" are the presumed worse predictions.
"Err_diff" is the difference of absolute errors, i.e.,
err_diff(x) = abs_err(worse,x) - abs_err(better,x)
= abs(ex(x)-worse(x)) - abs(ex(x)-better(x))

As a result, statistics for err_diff on the 5 sample points are

mean = 0.180000
variance = 0.053028
standard deviation (s_d) = 0.230279
t = mean/s_d = 0.781661
degrees of freedom = 7
confidence level (that the presumed better predictor really is better) = a little better than 75 percent

Properties of an Improved Approximation

Note that the level of confidence in whether p1 improves p2 depends only on the t statistic, the ratio of the difference of means of the absolute errors divided by the standard error of this difference. This means that the uniformity by which a predictio n function predicts beyond its intended domain is a very desirable property for a prediction function, because it decreases the standard error of the error of the predictor, increasing the value of the t statistic.

Using the Normal Approximation

As the sample size increases, the t distribution approaches the normal distribution. The center of the distribution converges faster than the tail. For 30° of freedom, the value of t required for a 90 percent confidence level is about 2 percent greater t han the corresponding value of the normal distribution (1.310 vs. 1.282) while the value of t for 99.5 percent confidence is about 6 percent greater (2.750 vs. 2.576).

[TOC]