U.S. Department of Transportation
Federal Highway Administration
1200 New Jersey Avenue, SE
Washington, DC 20590
202-366-4000


Skip to content
Facebook iconYouTube iconTwitter iconFlickr iconLinkedInInstagram

Federal Highway Administration Research and Technology
Coordinating, Developing, and Delivering Highway Transportation Innovations

Report
This report is an archived publication and may contain dated technical, contact, and link information
Publication Number: FHWA-RD-04-080
Date: September 2004

Software Reliability: A Federal Highway Administration Preliminary Handbook

PDF Version (697 KB)

PDF files can be viewed with the Acrobat® Reader®

Chapter 3: Software Testing

This chapter discusses:

  • Definition of software testing.
  • Criteria for choosing test cases.
  • Practical limitations of testing.

Definition of Software Testing

Software testing is the process of experimentally verifying that a program operates correctly. It consists of:

  • Running a sample of input data through the target program.
  • Checking the output against the predicted output.

Why Test?

Testing is a necessary part of establishing software correctness, even software that has been proven correct. This is because testing catches:

  • Failures caused by details considered irrelevant during a correctness proof. For example, in a proof, Input/Output (I/O) operations might be ignored, because they do not affect data of interest, but if they crash, the provably correct program fails.
  • Failures caused by errors that were overlooked during a correctness proof.
  • Corruption of a computation by other parts of a large system.

Scope of Chapter

Only some highlights of the extensive literature on software testing are included in the handbook. Simple statistical procedures for analyzing test results are summarized in chapter 9 in Verification, Validation, and Evaluation of Expert Systems, an FHWA Handbook.(5)

What Constitutes Testing

From a statistical standpoint, testing is an experiment, and the conclusions one can draw are governed by the theorems of statistics. In particular, for the results of testing to be statistically valid:

  • A sample (in this case, the inputs for testing) must be randomly selected from the population.
  • Conclusions are only valid for the population from which a sample is drawn.

For a sample to cover a population, every property of the inputs that might be expected to affect correctness should be represented sufficiently in the sample. "Sufficiently" in this context means that if a particular property causes the program to fail, inputs with that property occur frequently enough so that the failure is detectable statistically.

What to Include in the Test Sample

Wallace et al. have cataloged some of the kinds of inputs that should go into a test set.(4) These inputs must be in the test set because they have properties that might affect whether a program functions correctly.

  • Boundary value analysis "detects and removes errors occurring at parameter limits or boundaries. The input domain of the program is divided into a number of input classes. The tests should cover the boundaries and extremes of the classes. The tests check that the boundaries of the input domain of the specification coincide with those in the program. The value zero, whether used directly or indirectly, should be used with special attention (e.g., division by zero, null matrix, zero table entry). Usually, boundary values of the input produce boundary values for the output. Test cases should also be designed to force the output to its extreme values. If possible, a test case, which causes output to exceed the specification boundary values, should be specified. If output is a sequence of data, special attention should be given to the first and last elements and to lists containing zero, one, and two elements."
  • "Error seeding determines whether a set of test cases is adequate by inserting (seeding) known error types into the program and executing it with the test cases. If only some of the seeded errors are found, the test case set is not adequate."
  • "Coverage analysis measures how much of the structure of a unit or system has been exercised by a given set of tests."
  • "Functional testing executes part or all of the system to validate that the user requirement is satisfied."
  • For highway-related software, this means that highway engineers should determine the range of different situations from the engineering perspective covered by the software, and ensure that all of those situations are represented in the test set. Software developers may not understand the engineering significance of the software, and might not include those test cases or notice engineering anomalies in other test results.
  • "Regression analysis and testing is used to reevaluate software requirements and software design issues whenever any significant code change is made."

How to Select the Test Sample

The test sample must be random, yet contain enough of the special inputs listed above (called a stratified sample in statistics). To ensure randomness, it is important that inputs in the test set not be used during program development. If inputs are used in program development, they are not random, but become inputs used during development. There is no way to be sure that a program does not behave differently on the inputs used during development than on truly random inputs. In fact, neural nets can be over-trained to behave better on their training set than on a randomly chosen input set.

The best way to ensure that test inputs are not used during program development is to choose a test set before implementation begins, and lock that test set away from the development team.

How Many Inputs Should Be Tested

The number of inputs to test depends on the purpose of the test. To obtain a certification of correct performance to a certain confidence level, a sample that is large enough to provide that confidence level must be chosen. (See the FHWA handbook referenced above, or most elementary statistics texts for more details.) If the program is designed to compute an estimate rather than an exact value of its output, a statistically significant sample must be used.

However, for programs that compute exact values of their outputs, an argument can be made for testing a single point in a region of the input space for which:

  • The program follows the same computational path for all the inputs in the region.
  • The input chosen has no special properties in the program, e.g., is not a boundary point of a branch in the program, unless all points in the region have this property.
  • All the inputs in the region represent similar inputs from the engineering perspective of the area of application.
  • "Functional testing executes part or all of the system to validate that the user requirement is satisfied."

If a single input from the region is correct, this was either a fluke or an event that would be replicated with further tests. For it to be a fluke, the program would have to apply some set of formulas other than the intended ones and get an identical output to the intended output for a randomly chosen point. This is a very unlikely event.

Limitations of Testing

Rigorous testing of large software is not possible, because the number of test cases that must be run is impractically large. For example, the PAMEX pavement maintenance expert system contains approximately 325 rules; there are millions of different possible execution paths. For this reason, testing should be supplemented with techniques such as informal proofs and/or wrapping.

Previous | Table of Contents | Next

Federal Highway Administration | 1200 New Jersey Avenue, SE | Washington, DC 20590 | 202-366-4000
Turner-Fairbank Highway Research Center | 6300 Georgetown Pike | McLean, VA | 22101