U.S. Department of Transportation
Federal Highway Administration
1200 New Jersey Avenue, SE
Washington, DC 20590

Skip to content
Facebook iconYouTube iconTwitter iconFlickr iconLinkedInInstagram

Federal Highway Administration Research and Technology
Coordinating, Developing, and Delivering Highway Transportation Innovations

This report is an archived publication and may contain dated technical, contact, and link information
Back to Publication List        
Publication Number:  FHWA-HRT-14-081    Date:  November 2014
Publication Number: FHWA-HRT-14-081
Date: November 2014


Enhancing Statistical Methodologies For Highway Safety Research – Impetus From FHWA


The intent of this introductory section is to provide basic background material for the benefit of uninitiated statisticians who have not worked on these topics. Following a brief overview, some content is provided on the main issues encountered, and the statistical tools applied, by researchers currently working in these areas. This section provides context for the main section that follows on opportunities for advancing the methodologies for CMF and SPF estimation.


A CMF is a multiplicative factor used to compute the number of crashes that would be expected after implementing a given countermeasure at an existing roadway site or after making a change to a roadway being designed. The CMF is multiplied by the expected crash frequency without the countermeasure. A CMF greater than 1.0 indicates an expected increase in crashes, while a value less than 1.0 indicates an expected reduction in crashes. For example, a CMF of 0.8 indicates a 20 percent expected reduction in crashes.

A CMFunction is a formula used to compute the CMF for a specific site based on its characteristics. It is not always reasonable to assume a uniform safety effect for all sites with different characteristics (e.g., safety benefits may be greater for high traffic volumes). A countermeasure may also have several levels or potential values (e.g., improving intersection skew angle, or widening a shoulder). A crash modification function allows the CMF to change over the range of a variable or combination of variables. Where possible, it is preferable to develop CMFunctions as opposed to a single CMF value since safety effectiveness most likely varies based on site characteristics. In practice, however, this is often difficult since more data are required to detect such differences.


The CMFunction for improving intersection skew angle at a rural, four-legged, stop-controlled intersection is a function of the absolute value of intersection angle minus 90 degrees, where the intersection angle is in degrees, as shown in the equation in figure 1.

Figure 1. Equation. CMFunction for intersection skew angle. The exponential function of 0.0054 multiplied by absolute value of the intersection angle minus 90 degrees.

Figure 1. Equation. CMFunction for intersection skew angle.

The CMFunction allows the user to calculate the CMF for a specific intersection skew angle compared to a baseline of 90 degrees. For example, if the intersection angle is 120 degrees, the CMF is exp(0.0054*|120º - 90º|) = 1.18. Note that the CMF is the same if the other angle of the intersection is used: exp(0.0054*|60º - 90º|) = 1.18.

As the intersection angle approaches 90 degrees, the CMF approaches 1.0. For instance, if the intersection angle is 100 degrees, the CMF is computed as exp(0.0054*|100º - 90º|) = 1.06.

SPFs are essentially mathematical equations that relate the expected number of crashes of different types to site characteristics. These models always include traffic volume as a form of exposure but may also include site characteristics such as lane width, shoulder width, radius/degree of horizontal curves, presence of turn lanes (at intersections), and traffic control (at intersections).

The following is an example of an SPF for a segment of road:

Figure 2. Equation. Example SPF.  The equation shows the units of crashes per mile per year equals the parameter estimate for the constant, alpha, multiplied by the average annual daily traffic volume taken to the power of the parameter estimate b subscript 1, multiplied by the exponent of parameter estimate b subscript 2 multiplied by lane width.

Figure 2. Equation. Example SPF.

Where α, b1, and b2 are parameters estimated in the modeling process, AADT is the estimated average annual daily traffic volume on the roadway, and lane width is the width of the travel lanes measured in feet.

Safety performance functions are used in the development of CMFs through before-after studies and in this context are crash prediction models. With caution, they can be used to develop CMFs through cross-sectional studies; in this context they are explanatory models since the variable coefficients are used to estimate the CMFs that reflect the effect on safety of changing the value of a variable.


In road safety research, experimental studies are extremely rare. There is a reliance on observational data, meaning that data are collected retrospectively by observing the performance of an existing road system, where the treatment has already been implemented at some sites, usually not on the basis of a planned experiment, but on engineering considerations, including safety. There are several important issues that are typically considered in the estimation of SPFs and CMFs.

Regression to the Mean in CMF Estimation From Before-After Studies

Regression to the mean (RTM) is the natural tendency of observed crashes to regress (return) to the mean in the year following an unusually high or low crash count. RTM effects arise when sites with randomly high short-term crash counts are selected for treatment and experience a subsequent reduction in crashes when these counts regress toward their true long-term mean. Not accounting for this will exaggerate any safety benefits estimated for sites with randomly high counts and underestimate the benefit for sites with randomly low counts.

Changes in Exposure in CMF Estimation From Before-After Studies

The greatest predictor of crashes is the amount of exposure, measured by the amount of traffic. If exposure changes at a site over time it is important to account for the impact of these changes on the expected number of crashes. This is particularly important for treatments that may impact exposure. For example, if a stop-controlled intersection is converted to a roundabout and vehicle delays are reduced then traffic volumes may increase as traffic is attracted from nearby routes.

Time Trends in CMF Estimation From Before-After Studies

Another confounding factor is general time trends in expected crashes. Time trends may occur due to several unmeasured changes that can occur including: demographic changes, weather, crash reporting practices, levels of enforcement, etc.

Endogeneity Between Variables in Estimating CMFs from SPFs

Road safety situations often exist when some of the explanatory variables may depend on the dependent variable (frequency of crashes) themselves. Bias due to endogeneity can lead to incorrect conclusions from a model, e.g., a model may show that a treatment is associated with an increased number of crashes, when in reality the treatment may actually reduce crashes. This becomes a critical issue if the SPF is used to estimate the CMF associated with a particular treatment. For example, left-turn lanes at intersections are likely to be implemented at sites with large numbers of left-turn related crashes. Therefore a prediction model that includes the presence of left turn lanes as an independent variable is likely to suffer due to endogeneity bias. This has been found where conventional cross-sectional models have indicated a higher expected crash frequency at sites with left-turn lanes than those without.

Correlation Between Predictor Variables in Estimating CMFs from SPFs

A high degree of correlation among explanatory variables in the model makes it very difficult to determine a reliable estimate of the effects of particular variables. For example, if horizontal curvature is correlated with clear zone/roadside hazards, then it is difficult to isolate the safety effect of horizontal curvature. It may be tempting to remove one of the correlated variables, but this can lead to omitted variable bias.


This section presents an overview of commonly applied statistical tools in the development of SPFs and CMFs at present.

Generalized Linear Modeling

The most common approach in road safety research for the development of SPFs is to apply generalized linear modeling with a negative binomial error distribution and log link function. The negative binomial distribution has been adopted because it is appropriate for non-negative count data (crash frequencies) and reflects the observed overdispersion found in crash data.

Recent advances have seen some researchers apply alternate model specifications including the following:

The Full Bayes MCMC methods are particularly appealing in that they have the capability of allowing complex model forms, accounting for spatial correlation and the use of prior information about estimated parameters.

Determining Functional Form of Models

There are few available tools applied in road safety research for determining the appropriate model form. Typical measures of goodness of fit include the t-statistic of estimated parameters, chi-square statistics, Akaike's information criterion and the Bayesian Information Criterion.

Testing of variables for inclusion is sometimes done through a forward or backward stepwise regression. Some methods for determining the functional form are described below.

Integrate-Differentiate Method

The method is based on the Empirical Integral Function. To illustrate we will use the traffic volume variable AADT for road segments of equal length. The data are divided into groups, for example, 0-1,000, 1,001-2,000, etc. For each group the average crash rate is determined and the area of the bin is equal to this average crash rate multiplied by the bin width (1,000 in this case). The value of the Empirical Integral Function is then the sum of bin areas from the lowest AADT group up to that boundary. In such a plot some order can be seen whereas in a simple scatterplot of crashes versus a variable of interest it is very difficult to perceive any pattern.

The essence here is that there exists some function linking crashes to AADT. There then exists an Integral Function as well. We can use the Empirical Integral Function to make an informed judgment about what the true Integral Function is. If this is successful then the function linking crashes to the variable of interest is the derivative of the Integral Function.

Analysis of Over Versus Under Prediction

In this method a model without the variable of interest is applied to the data. Then using the variable of interest, the data are divided into groups (e.g., 10-ft lanes, 11-ft lanes etc.). The ratio of observed/predicted for each group is then determined and plotted versus the value of the variable defining the group. The plot is used to infer an appropriate relationship between the dependent variable and the variable of interest.

Cumulative Residual (CURE) Plots

In the CURE method the cumulative residuals (the difference between the observed and predicted values for each observation) are plotted in increasing order for each covariate separately. Also plotted are graphs of the 95-percent confidence limits. If there is no bias in the model, the plot of cumulative residuals should oscillate around the x-axis without systematic over or under-prediction, and stay inside of these confidence limits. In the context of CURE plots, it is important to recognize that the plot is not only a reflection of the functional form of the particular explanatory variable, but also whether other relevant explanatory factors have been included in the model in an appropriate form (i.e., the extent to which there is omitted variable bias).



Federal Highway Administration | 1200 New Jersey Avenue, SE | Washington, DC 20590 | 202-366-4000
Turner-Fairbank Highway Research Center | 6300 Georgetown Pike | McLean, VA | 22101