U.S. Department of Transportation
Federal Highway Administration
1200 New Jersey Avenue, SE
Washington, DC 20590
Federal Highway Administration Research and Technology
Coordinating, Developing, and Delivering Highway Transportation Innovations
|This report is an archived publication and may contain dated technical, contact, and link information|
Publication Number: FHWA-HRT-11-035
Date: May 2011
Pedestrian and Bicyclist Traffic Control Device Evaluation Methods
APPENDIX B. EXPERIMENTAL DESIGN AND STATISTICAL ANALYSIS BASICS
As presented in Hauer's book on Observational Before-After Studies in Road Safety, one of the main sources of factual knowledge about the effect of highway and traffic engineering measures on safety is observational study.(13) The term "observational" is used to emphasize that the evaluation is not an experiment deliberately designed to answer a question that can be carefully controlled in a laboratory. Rather, in the transportation environment, the elements that cannot be controlled must be accounted for due to the dynamics of the location. The two basic types of observational evaluations are before-after and cross sectional.
Controlled experiments observe behavior under more controlled circumstances. In these types of evaluations, the participants know they are being studied, and the experimental designs carefully control extraneous factors. The basic types of controlled experiment designs discussed in this document are within subjects and between subjects.
In a before-after evaluation design, two measurements are taken, one before and one after the treatment is implemented. Effectiveness is defined as the difference in the two measurements over time. The before data must be collected prior to the installation of a countermeasure. Measurements are taken at all sites before installation of the device. After the device is installed at a site, identical measurements are again taken at all sites.
Depending on the type of treatment, a learning period may be necessary to provide road users with time to fully understand the appropriate behavior for the device. An evaluation may also need to collect data after a long period to ensure that the device is creating a long-term behavior change rather than a change due to the novelty effect of the device. While the amount of time needed for learning and to avoid novelty is debatable, previous evaluations have used 2 months for learning and collected data at 6-month intervals to ensure that the benefit of the device is long term.
While the before-after evaluation design is straightforward and easily applied, it has shortcomings. It is vulnerable to changes that occur during the time it takes to complete the evaluation (e.g., traffic volumes or composition). The effects of such variables must be considered in the evaluation.
The goal in a before-and-after evaluation is to have only one component that changes at the site over time—the application of the treatment itself. Therefore, all other conditions at the treated and comparison sites must be monitored or controlled. The following list includes examples of variables that need to be considered:
Table 10 shows evaluation plan considerations for an evaluation of the effectiveness of placing a flashing beacon on a pedestrian crossing advance warning sign.
The evaluation design may also consider a comparison (or control) group. A comparison group consists of sites that are similar to the treated sites in characteristics, such as traffic conditions and roadway geometry, but do not have the treatment installed. The comparison group concept is strongly encouraged in before-after evaluations, especially for crash evaluations. In addition to crash evaluations, a comparison group is highly desirable with a before-after evaluation design when behavioral and operational measures are used. A comparison group is similar to a placebo control group in medical research during which the same measurements are taken before and after a placebo treatment.
For safety evaluations in which crashes are used as a MOE, simply comparing results before and after installation is not sufficient. This method requires advanced statistics; local agencies unfamiliar with the method should consider using an expert in this evaluation method. Use of a comparison group is mandatory with crash MOEs. A comparison group has sites that are similar in location, traffic conditions, and roadway geometry to the treated sites but do not have the treatment installed. The use of a comparison group improves the reliability of the results. The same data are collected at the comparison sites and the treated sites, and results are then compared not only between time periods but also among sites to determine the effectiveness of a particular treatment. Current practice is to use an EB method in safety evaluations. The EB method uses a derived crash prediction for the after period assuming the treatment had not been applied and compares this predicted value to the observed crash frequency for the after period with the treatment installed. This method is used in the Highway Safety Manual and the FHWA SafetyAnalyst approach (see chapter 6).(6,14)
An observational cross sectional evaluation is a research design in which a site only has one treatment. In general, an observational cross sectional evaluation estimates the safety or operational effect of an element that is different between two groups of sites. The sites should be similar except for the difference of interest—the treatment. For example, an agency may identify 10 intersections that have similar traffic volume, roadway geometry, land use, and lighting but differ in the pattern of crosswalk markings used (see figure 12). Cross sectional studies are different from controlled experiments in that the investigator cannot control the assignment of treatments to sites.
Cross sectional evaluation design has been used to estimate the safety effect of differences between the treatments. Hauer notes in Cause and Effect in Observational Cross-Sectional Studies on Road Safety that
… the question of whether causal interpretation of cross-sectional studies is at all possible is of central importance for road safety. The reason is that opportunities to do observational before-after studies about, say, the safety effect of change in horizontal curvature, road grade, lane width, median slope, etc. are few and imperfect. This is so, partly, because when a road is rebuilt usually several of its attributes are changed at once and it is difficult to assign the result to any single causal factor. In addition, the rebuilding of a road often changes it to such an extent that it may not be regarded the same unit after reconstruction. In contrast, opportunities for observational cross-sectional studies are plentiful.(15)
Hauer points out the dangers in interpreting the results of cross sectional and before-after designs.(15) Sites that are well matched for cross sectional designs are often hard to find. Likewise, in before-after studies, assuring that the only thing that changed at the site is the traffic control device under evaluation is often difficult.
The before-after evaluation identifies the change resulting when a treatment has been applied to a group. The cross sectional evaluation compares the safety or operations of one group having some common features to the safety of a different group not having those features in order to assess the safety effect of that feature.
Hauer provides the following examples for these different types of studies:(13)
Table 11 provides definitions for different evaluation design types.
Hauer provides information on how to conduct and interpret observational before-after studies.(13) He notes that the use of the comparison group method is deceptively similar to that of randomized experiments, which are popular in agriculture, medicine, and other fields of research. However, he notes that there is a crucial distinction. In a randomized experiment, the decision as to which entities get treated and which are left as control is made at random. Therefore, were the experiment repeated a very large number of times, each time with a random assignment of entities to treatment and control, the influence of the causal factors on both groups of entities would tend to be equal. As a result, when the assignment to treatment is made at random, it is legitimate to speak of a statistical experiment that involves a control group. In contrast, when the assignments of entities to a treatment group are not made at random, even if both groups of entities are very large, they may differ systematically with respect to some causal factors. Therefore, even with large groups of entities, there is no assurance that the expected number of crashes in the treatment group (had treatment not been administered) would have changed in the same manner as in the comparison group. For this reason, when entities are not assigned to treatment at random, the terms "experiment" or a "control group" should not be used. To mark the distinction, it is prudent to use the terms "observational studies" (not experiments) and of "comparison groups" (not control groups).
Because most evaluations of a proposed traffic control device involve the introduction of a new device, the before-after evaluation rather than the cross sectional evaluation is the typical approach. If the evaluation uses crashes or occurs over several months, the preferred approach would also use a comparison group.
While the before-after (with or without comparison group) and cross sectional evaluation designs generally apply to field installation evaluations, there are other experimental design considerations when performing controlled experiments such as surveys, laboratory, and test track studies in which candidate countermeasures are shown to drivers or pedestrians. The main experimental design feature to consider is whether all of the different candidate treatments will be shown to all the participants (within subjects) or whether some subgroup of participants will see some subgroup of treatments (between subjects). There are statistical power advantages to a within subjects design, which are beyond the scope of this report. By having each person see each treatment, direct comparisons of the treatments can be made within each individual as well as across individuals. A within subjects design for studies of human behavior is comparable to a before-after evaluation where the presence of a treatment is varied within a single site.
There are also practical advantages to a within subjects treatment. Consider an evaluation that is assessing the visibility benefits of installing a flashing beacon on top of an advance warning sign for a pedestrian crossing. Research participants will stand one block away and rate how easy it is to see the sign on a scale of 1 to 5. Treatment A (the standard sign) is installed on the northbound approach, and treatment B (the sign plus the flashing beacon) is installed on the southbound approach to the crosswalk. A total of 40 people volunteer to participate and complete the ratings. They are randomly assigned to group 1 or group 2. A within subjects design would have all 40 people look at both treatments. In contrast, for a between subjects design, group 1 would look at treatment A and group 2 would look at treatment B. If treatment B receives higher ratings, the researcher could not be sure if it was because treatment B is better or because the people in group 2 just happened to be people who tend to give high ratings or have good eyesight.
Problems with administering within subjects designs include treatment order effects and learning. In the example above, if everyone sees treatment A before treatment B, bias could be introduced in the ratings because people naturally compare B to A. One experimental control that can be introduced is counterbalancing in which the order of presentation is balanced across the two groups of people. Half the people would see treatment A first, and half would see treatment B first. This way, any order effects are spread out across the two groups.
It is often not feasible to do a within subjects design, especially when evaluating treatments in the field. Additionally, performing a within subjects evaluation would require having the participants return at a later time after a new treatment has been installed. In these cases, a between subjects evaluation is acceptable, but a higher total number of subjects may be required to ensure adequate statistical power. A between subjects design for human subject research is comparable to a cross sectional design for traffic studies where one site (or group of subjects) gets treatment A, and a different site (or group of subjects) gets treatment B.
Table 12 shows a selection of evaluation plan considerations for a within subjects evaluation on the visibility of a flashing beacon on a pedestrian crossing advance warning sign.
In the example provided above, the order in which subjects rated the treatments was used to illustrate a factor that can affect the validity of the results, which is an example of confounding. Extraneous factors that vary consistently with a treatment are confounded with the treatment. In the example, the order was confounded with the treatment; treatment A always came first. Confounding makes it difficult to interpret results. Researchers may question whether treatment A received low ratings because it was hard to see or simply because it was the first treatment the people saw so they were inclined to give midrange scores to anything that came first. Confounding is sometimes unavoidable, especially in field studies. With proper planning and consideration, confounding variables can often be controlled or eliminated.
Many factors can influence the results of an evaluation if not adequately considered during the evaluation design. The following is an overview of the factors that need to be accounted or controlled for in the evaluation design:
As discussed by Knoblauch and Crigler, appropriate statistical analyses are required to determine if any differences between the before and after data are due to the treatment or to chance.(7) In most cases, one of the following three types of data will be collected:
The actual statistical analysis performed will depend on the type of data collected. Table 13, adapted from Knoblauch and Crigler, presents combinations of types of data, recommended statistical tests, and comments regarding the output or use of the tests.(7)