U.S. Department of Transportation
Federal Highway Administration
1200 New Jersey Avenue, SE
Washington, DC 20590
202-366-4000


Skip to content
Facebook iconYouTube iconTwitter iconFlickr iconLinkedInInstagram

Federal Highway Administration Research and Technology
Coordinating, Developing, and Delivering Highway Transportation Innovations

Report
This report is an archived publication and may contain dated technical, contact, and link information
Publication Number: FHWA-HRT-11-035
Date: May 2011

Pedestrian and Bicyclist Traffic Control Device Evaluation Methods

APPENDIX B. EXPERIMENTAL DESIGN AND STATISTICAL ANALYSIS BASICS

EXPERIMENTAL DESIGN

As presented in Hauer's book on Observational Before-After Studies in Road Safety, one of the main sources of factual knowledge about the effect of highway and traffic engineering measures on safety is observational study.(13) The term "observational" is used to emphasize that the evaluation is not an experiment deliberately designed to answer a question that can be carefully controlled in a laboratory. Rather, in the transportation environment, the elements that cannot be controlled must be accounted for due to the dynamics of the location. The two basic types of observational evaluations are before-after and cross sectional.

Controlled experiments observe behavior under more controlled circumstances. In these types of evaluations, the participants know they are being studied, and the experimental designs carefully control extraneous factors. The basic types of controlled experiment designs discussed in this document are within subjects and between subjects.

Before-After Evaluations

In a before-after evaluation design, two measurements are taken, one before and one after the treatment is implemented. Effectiveness is defined as the difference in the two measurements over time. The before data must be collected prior to the installation of a countermeasure. Measurements are taken at all sites before installation of the device. After the device is installed at a site, identical measurements are again taken at all sites.

Depending on the type of treatment, a learning period may be necessary to provide road users with time to fully understand the appropriate behavior for the device. An evaluation may also need to collect data after a long period to ensure that the device is creating a long-term behavior change rather than a change due to the novelty effect of the device. While the amount of time needed for learning and to avoid novelty is debatable, previous evaluations have used 2 months for learning and collected data at 6-month intervals to ensure that the benefit of the device is long term.

While the before-after evaluation design is straightforward and easily applied, it has shortcomings. It is vulnerable to changes that occur during the time it takes to complete the evaluation (e.g., traffic volumes or composition). The effects of such variables must be considered in the evaluation.

The goal in a before-and-after evaluation is to have only one component that changes at the site over time—the application of the treatment itself. Therefore, all other conditions at the treated and comparison sites must be monitored or controlled. The following list includes examples of variables that need to be considered:

  • Weather conditions.

  • Illumination level.

  • Traffic volume.

  • Traffic mix.

  • Calendar time, especially for school crossing treatments.

  • Geometric changes (e.g., addition of curbs, etc.).

  • Other traffic control devices (e.g., additional signs).

  • User familiarity or unfamiliarity.

  • Pedestrian age and gender.

  • Pedestrian volume.

Table 10 shows evaluation plan considerations for an evaluation of the effectiveness of placing a flashing beacon on a pedestrian crossing advance warning sign.

Table 10. Example of before-after evaluation design considerations.

Design Consideration

Example

Evaluation design

Observational evaluation, before-after

Evaluation question

Will the addition of a flashing beacon improve vehicle yielding?

Research hypothesis

The number of vehicles yielding will be greater when there is a flashing beacon present compared to when there is no beacon present

Independent variable

Beacon presence two levels of this variable: beacon absent/beacon present

MOE

Percentage of vehicles yielding at the crosswalk when a pedestrian is present

Other independent variables controlled

Time of day is same between periods, day of the week is same between periods, no changes to roadway geometry

Other variables to be considered in evaluation

Traffic volume (motor vehicles, bicyclists, and pedestrians)

The evaluation design may also consider a comparison (or control) group. A comparison group consists of sites that are similar to the treated sites in characteristics, such as traffic conditions and roadway geometry, but do not have the treatment installed. The comparison group concept is strongly encouraged in before-after evaluations, especially for crash evaluations. In addition to crash evaluations, a comparison group is highly desirable with a before-after evaluation design when behavioral and operational measures are used. A comparison group is similar to a placebo control group in medical research during which the same measurements are taken before and after a placebo treatment.

For safety evaluations in which crashes are used as a MOE, simply comparing results before and after installation is not sufficient. This method requires advanced statistics; local agencies unfamiliar with the method should consider using an expert in this evaluation method. Use of a comparison group is mandatory with crash MOEs. A comparison group has sites that are similar in location, traffic conditions, and roadway geometry to the treated sites but do not have the treatment installed. The use of a comparison group improves the reliability of the results. The same data are collected at the comparison sites and the treated sites, and results are then compared not only between time periods but also among sites to determine the effectiveness of a particular treatment. Current practice is to use an EB method in safety evaluations. The EB method uses a derived crash prediction for the after period assuming the treatment had not been applied and compares this predicted value to the observed crash frequency for the after period with the treatment installed. This method is used in the Highway Safety Manual and the FHWA SafetyAnalyst approach (see chapter 6).(6,14)

Cross Sectional Evaluation

An observational cross sectional evaluation is a research design in which a site only has one treatment. In general, an observational cross sectional evaluation estimates the safety or operational effect of an element that is different between two groups of sites. The sites should be similar except for the difference of interest—the treatment. For example, an agency may identify 10 intersections that have similar traffic volume, roadway geometry, land use, and lighting but differ in the pattern of crosswalk markings used (see figure 12). Cross sectional studies are different from controlled experiments in that the investigator cannot control the assignment of treatments to sites.

Cross sectional evaluation design has been used to estimate the safety effect of differences between the treatments. Hauer notes in Cause and Effect in Observational Cross-Sectional Studies on Road Safety that

… the question of whether causal interpretation of cross-sectional studies is at all possible is of central importance for road safety. The reason is that opportunities to do observational before-after studies about, say, the safety effect of change in horizontal curvature, road grade, lane width, median slope, etc. are few and imperfect. This is so, partly, because when a road is rebuilt usually several of its attributes are changed at once and it is difficult to assign the result to any single causal factor. In addition, the rebuilding of a road often changes it to such an extent that it may not be regarded the same unit after reconstruction. In contrast, opportunities for observational cross-sectional studies are plentiful.(15)

Hauer points out the dangers in interpreting the results of cross sectional and before-after designs.(15) Sites that are well matched for cross sectional designs are often hard to find. Likewise, in before-after studies, assuring that the only thing that changed at the site is the traffic control device under evaluation is often difficult.

Comparison of Observational Before-After and Cross Sectional Evaluation Designs

The before-after evaluation identifies the change resulting when a treatment has been applied to a group. The cross sectional evaluation compares the safety or operations of one group having some common features to the safety of a different group not having those features in order to assess the safety effect of that feature.

Hauer provides the following examples for these different types of studies:(13)

  • Observational before-after evaluation: Circumstances where the entities that are changed by the treatment retain many of their original attributes. For example, replacement of a stop sign with a yield sign would leave the intersection geometry and setting unchanged. Additionally, the introduction of a seat belt law would not modify drivers' travel patterns, vehicle performance, or the road network.

  • Observational before-after with comparison group evaluation: Circumstances that are similar to the above observational before-after evaluation with the addition of a comparison group used to provide corrections for changes in conditions over time, such as weather, vehicle mix, driver behavior, crash reporting practices, a citywide public relation campaign, etc.

  • Observational cross sectional evaluation: Circumstances when the treatment substantially alters the entity. For example, a rural two-lane road is to be rebuilt into a four-lane divided road with a substantially modified alignment.

Table 11 provides definitions for different evaluation design types.

Table 11. Overview of evaluation designs.

Evaluation Design

Time

Before Treatment
Is Introduced

Introduce Treatment

After Treatment
Is Introduced

Before-after

Evaluation sites

Treatment 1

Evaluation sites

Before-after with comparison

Evaluation sites

Treatment 1

Evaluation sites

Comparison sites

None

Comparison sites

Cross sectional
(controls)

None

Treatment 1

Evaluation sites

None

None

Comparison sites

Cross sectional

None

Treatment 1

Evaluation sites subgroup A

None

Treatment 2

Evaluation sites subgroup B

Hauer provides information on how to conduct and interpret observational before-after studies.(13) He notes that the use of the comparison group method is deceptively similar to that of randomized experiments, which are popular in agriculture, medicine, and other fields of research. However, he notes that there is a crucial distinction. In a randomized experiment, the decision as to which entities get treated and which are left as control is made at random. Therefore, were the experiment repeated a very large number of times, each time with a random assignment of entities to treatment and control, the influence of the causal factors on both groups of entities would tend to be equal. As a result, when the assignment to treatment is made at random, it is legitimate to speak of a statistical experiment that involves a control group. In contrast, when the assignments of entities to a treatment group are not made at random, even if both groups of entities are very large, they may differ systematically with respect to some causal factors. Therefore, even with large groups of entities, there is no assurance that the expected number of crashes in the treatment group (had treatment not been administered) would have changed in the same manner as in the comparison group. For this reason, when entities are not assigned to treatment at random, the terms "experiment" or a "control group" should not be used. To mark the distinction, it is prudent to use the terms "observational studies" (not experiments) and of "comparison groups" (not control groups).

Because most evaluations of a proposed traffic control device involve the introduction of a new device, the before-after evaluation rather than the cross sectional evaluation is the typical approach. If the evaluation uses crashes or occurs over several months, the preferred approach would also use a comparison group.

Within Subjects or Between Subjects

While the before-after (with or without comparison group) and cross sectional evaluation designs generally apply to field installation evaluations, there are other experimental design considerations when performing controlled experiments such as surveys, laboratory, and test track studies in which candidate countermeasures are shown to drivers or pedestrians. The main experimental design feature to consider is whether all of the different candidate treatments will be shown to all the participants (within subjects) or whether some subgroup of participants will see some subgroup of treatments (between subjects). There are statistical power advantages to a within subjects design, which are beyond the scope of this report. By having each person see each treatment, direct comparisons of the treatments can be made within each individual as well as across individuals. A within subjects design for studies of human behavior is comparable to a before-after evaluation where the presence of a treatment is varied within a single site.

There are also practical advantages to a within subjects treatment. Consider an evaluation that is assessing the visibility benefits of installing a flashing beacon on top of an advance warning sign for a pedestrian crossing. Research participants will stand one block away and rate how easy it is to see the sign on a scale of 1 to 5. Treatment A (the standard sign) is installed on the northbound approach, and treatment B (the sign plus the flashing beacon) is installed on the southbound approach to the crosswalk. A total of 40 people volunteer to participate and complete the ratings. They are randomly assigned to group 1 or group 2. A within subjects design would have all 40 people look at both treatments. In contrast, for a between subjects design, group 1 would look at treatment A and group 2 would look at treatment B. If treatment B receives higher ratings, the researcher could not be sure if it was because treatment B is better or because the people in group 2 just happened to be people who tend to give high ratings or have good eyesight.

Problems with administering within subjects designs include treatment order effects and learning. In the example above, if everyone sees treatment A before treatment B, bias could be introduced in the ratings because people naturally compare B to A. One experimental control that can be introduced is counterbalancing in which the order of presentation is balanced across the two groups of people. Half the people would see treatment A first, and half would see treatment B first. This way, any order effects are spread out across the two groups.

It is often not feasible to do a within subjects design, especially when evaluating treatments in the field. Additionally, performing a within subjects evaluation would require having the participants return at a later time after a new treatment has been installed. In these cases, a between subjects evaluation is acceptable, but a higher total number of subjects may be required to ensure adequate statistical power. A between subjects design for human subject research is comparable to a cross sectional design for traffic studies where one site (or group of subjects) gets treatment A, and a different site (or group of subjects) gets treatment B.

Table 12 shows a selection of evaluation plan considerations for a within subjects evaluation on the visibility of a flashing beacon on a pedestrian crossing advance warning sign.

Table 12. Example of within subjects evaluation design considerations.

Design Consideration

Example

Evaluation design

Controlled experiment, within subjects

Evaluation question

Will the addition of a flashing beacon improve the visibility of a pedestrian crossing sign?

Null hypothesis

The visibility ratings will not be different when there is a flashing beacon present compared to when there is a beacon present

Independent variable

Beacon presence; two levels of this variable: beacon absent/beacon present

MOE

Visibility rating on a scale of 1 to 5

Other independent variables controlled by experimenter

Time of day, illumination, with order counterbalanced

Other variables to be considered in evaluation

Equal split in gender and age groups of participants

Factors Affecting the Validity of Results

In the example provided above, the order in which subjects rated the treatments was used to illustrate a factor that can affect the validity of the results, which is an example of confounding. Extraneous factors that vary consistently with a treatment are confounded with the treatment. In the example, the order was confounded with the treatment; treatment A always came first. Confounding makes it difficult to interpret results. Researchers may question whether treatment A received low ratings because it was hard to see or simply because it was the first treatment the people saw so they were inclined to give midrange scores to anything that came first. Confounding is sometimes unavoidable, especially in field studies. With proper planning and consideration, confounding variables can often be controlled or eliminated.

Many factors can influence the results of an evaluation if not adequately considered during the evaluation design. The following is an overview of the factors that need to be accounted or controlled for in the evaluation design:

  • Changes over time.

    • Consider any changes in volume between periods for both traffic and pedestrians.

    • Consider changes to the mix of users over time as land use changes (e.g., a new school opens) or use of a facility changes (e.g., lower pedestrian volumes in winter).

    • Identify and consider any other changes that may be occurring (e.g., changes in reporting thresholds for property damage only crashes over subsequent years).

  • Presence of recording equipment/observers.

    • Will road users be able to see the equipment or observer? If so, how will it affect their behavior? (If recording equipment or observers can be seen, drivers or pedestrians may be on their best behavior.)

  • Instrumentation/measurement procedures.

    • Is the measuring instrument calibrated?

    • Were the rating scales and scoring criteria used consistently?

  • Selection of comparison groups.

    • Identify the important factors to match between treatment and comparison groups must relate both to the treatment and the MOEs.

STATISTICAL ANALYSIS

As discussed by Knoblauch and Crigler, appropriate statistical analyses are required to determine if any differences between the before and after data are due to the treatment or to chance.(7) In most cases, one of the following three types of data will be collected:

  • Continuous: Data that have no distinct intervals between possible values are continuous. Examples include vehicle speed and lateral placement.

  • Dichotomous: Data that are identified by only two categories (i.e., the occurrence or nonoccurrence of a behavior are dichotomous). Examples include pedestrian compliance and pedestrian-motor vehicle conflicts.

  • Counts of events: Data are on the number of occurrences. Examples include crash counts and the number of vehicles yielding. Note that for EB before-after crash studies, additional measures are needed beyond simple counts of crashes.

The actual statistical analysis performed will depend on the type of data collected. Table 13, adapted from Knoblauch and Crigler, presents combinations of types of data, recommended statistical tests, and comments regarding the output or use of the tests.(7)

Table 13. Sample applications of statistical techniques.

Data Type

Parameter(s) of Interest

Recommended Tests/Procedures

Comments

Continuous

Two means

t-test for difference in means

Assumes data are normally distributed and samples are independent

Continuous

Two means

z-test for difference in means

Sample sizes of 30 or more are required

Continuous

Two variances

F-test for difference in variances

Assumes data are normally distributed and samples are independent

Continuous

More than two means

Analysis of variance for testing equality of more than two means

Assumes data are normally distributed, variances are equal, and samples are independent

Dichotomous

Two proportions

z-test for difference in proportions

Assumes the sample sizes are large enough

Categorical data (more than two categories)

More than two proportions

Chi-square test of the equality of more than two proportions or of the independence of two categorical variables

Used when comparing more than two proportions, e.g., a two-by-two or larger contingency table; particularly used for testing cross-tabulated questionnaire data

Count data

Regression coefficient

Poisson regression or negative binomial regression

Used for assessing crash reduction

Federal Highway Administration | 1200 New Jersey Avenue, SE | Washington, DC 20590 | 202-366-4000
Turner-Fairbank Highway Research Center | 6300 Georgetown Pike | McLean, VA | 22101