U.S. Department of Transportation
Federal Highway Administration
1200 New Jersey Avenue, SE
Washington, DC 20590
2023664000
Federal Highway Administration Research and Technology
Coordinating, Developing, and Delivering Highway Transportation Innovations
REPORT 
This report is an archived publication and may contain dated technical, contact, and link information 

Publication Number: FHWAHRT14081 Date: November 2014 
Publication Number: FHWAHRT14081 Date: November 2014 
The intent of this introductory section is to provide basic background material for the benefit of uninitiated statisticians who have not worked on these topics. Following a brief overview, some content is provided on the main issues encountered, and the statistical tools applied, by researchers currently working in these areas. This section provides context for the main section that follows on opportunities for advancing the methodologies for CMF and SPF estimation.
A CMF is a multiplicative factor used to compute the number of crashes that would be expected after implementing a given countermeasure at an existing roadway site or after making a change to a roadway being designed. The CMF is multiplied by the expected crash frequency without the countermeasure. A CMF greater than 1.0 indicates an expected increase in crashes, while a value less than 1.0 indicates an expected reduction in crashes. For example, a CMF of 0.8 indicates a 20 percent expected reduction in crashes.
A CMFunction is a formula used to compute the CMF for a specific site based on its characteristics. It is not always reasonable to assume a uniform safety effect for all sites with different characteristics (e.g., safety benefits may be greater for high traffic volumes). A countermeasure may also have several levels or potential values (e.g., improving intersection skew angle, or widening a shoulder). A crash modification function allows the CMF to change over the range of a variable or combination of variables. Where possible, it is preferable to develop CMFunctions as opposed to a single CMF value since safety effectiveness most likely varies based on site characteristics. In practice, however, this is often difficult since more data are required to detect such differences.
The CMFunction for improving intersection skew angle at a rural, fourlegged, stopcontrolled intersection is a function of the absolute value of intersection angle minus 90 degrees, where the intersection angle is in degrees, as shown in the equation in figure 1.
Figure 1. Equation. CMFunction for intersection skew angle.
The CMFunction allows the user to calculate the CMF for a specific intersection skew angle compared to a baseline of 90 degrees. For example, if the intersection angle is 120 degrees, the CMF is exp(0.0054*120º  90º) = 1.18. Note that the CMF is the same if the other angle of the intersection is used: exp(0.0054*60º  90º) = 1.18.
As the intersection angle approaches 90 degrees, the CMF approaches 1.0. For instance, if the intersection angle is 100 degrees, the CMF is computed as exp(0.0054*100º  90º) = 1.06.
SPFs are essentially mathematical equations that relate the expected number of crashes of different types to site characteristics. These models always include traffic volume as a form of exposure but may also include site characteristics such as lane width, shoulder width, radius/degree of horizontal curves, presence of turn lanes (at intersections), and traffic control (at intersections).
The following is an example of an SPF for a segment of road:
Figure 2. Equation. Example SPF.
Where α, b_{1,} and b_{2} are parameters estimated in the modeling process, AADT is the estimated average annual daily traffic volume on the roadway, and lane width is the width of the travel lanes measured in feet.
Safety performance functions are used in the development of CMFs through beforeafter studies and in this context are crash prediction models. With caution, they can be used to develop CMFs through crosssectional studies; in this context they are explanatory models since the variable coefficients are used to estimate the CMFs that reflect the effect on safety of changing the value of a variable.
In road safety research, experimental studies are extremely rare. There is a reliance on observational data, meaning that data are collected retrospectively by observing the performance of an existing road system, where the treatment has already been implemented at some sites, usually not on the basis of a planned experiment, but on engineering considerations, including safety. There are several important issues that are typically considered in the estimation of SPFs and CMFs.
Regression to the mean (RTM) is the natural tendency of observed crashes to regress (return) to the mean in the year following an unusually high or low crash count. RTM effects arise when sites with randomly high shortterm crash counts are selected for treatment and experience a subsequent reduction in crashes when these counts regress toward their true longterm mean. Not accounting for this will exaggerate any safety benefits estimated for sites with randomly high counts and underestimate the benefit for sites with randomly low counts.
The greatest predictor of crashes is the amount of exposure, measured by the amount of traffic. If exposure changes at a site over time it is important to account for the impact of these changes on the expected number of crashes. This is particularly important for treatments that may impact exposure. For example, if a stopcontrolled intersection is converted to a roundabout and vehicle delays are reduced then traffic volumes may increase as traffic is attracted from nearby routes.
Another confounding factor is general time trends in expected crashes. Time trends may occur due to several unmeasured changes that can occur including: demographic changes, weather, crash reporting practices, levels of enforcement, etc.
Road safety situations often exist when some of the explanatory variables may depend on the dependent variable (frequency of crashes) themselves. Bias due to endogeneity can lead to incorrect conclusions from a model, e.g., a model may show that a treatment is associated with an increased number of crashes, when in reality the treatment may actually reduce crashes. This becomes a critical issue if the SPF is used to estimate the CMF associated with a particular treatment. For example, leftturn lanes at intersections are likely to be implemented at sites with large numbers of leftturn related crashes. Therefore a prediction model that includes the presence of left turn lanes as an independent variable is likely to suffer due to endogeneity bias. This has been found where conventional crosssectional models have indicated a higher expected crash frequency at sites with leftturn lanes than those without.
A high degree of correlation among explanatory variables in the model makes it very difficult to determine a reliable estimate of the effects of particular variables. For example, if horizontal curvature is correlated with clear zone/roadside hazards, then it is difficult to isolate the safety effect of horizontal curvature. It may be tempting to remove one of the correlated variables, but this can lead to omitted variable bias.
This section presents an overview of commonly applied statistical tools in the development of SPFs and CMFs at present.
The most common approach in road safety research for the development of SPFs is to apply generalized linear modeling with a negative binomial error distribution and log link function. The negative binomial distribution has been adopted because it is appropriate for nonnegative count data (crash frequencies) and reflects the observed overdispersion found in crash data.
Recent advances have seen some researchers apply alternate model specifications including the following:
The Full Bayes MCMC methods are particularly appealing in that they have the capability of allowing complex model forms, accounting for spatial correlation and the use of prior information about estimated parameters.
There are few available tools applied in road safety research for determining the appropriate model form. Typical measures of goodness of fit include the tstatistic of estimated parameters, chisquare statistics, Akaike's information criterion and the Bayesian Information Criterion.
Testing of variables for inclusion is sometimes done through a forward or backward stepwise regression. Some methods for determining the functional form are described below.
IntegrateDifferentiate Method
The method is based on the Empirical Integral Function. To illustrate we will use the traffic volume variable AADT for road segments of equal length. The data are divided into groups, for example, 01,000, 1,0012,000, etc. For each group the average crash rate is determined and the area of the bin is equal to this average crash rate multiplied by the bin width (1,000 in this case). The value of the Empirical Integral Function is then the sum of bin areas from the lowest AADT group up to that boundary. In such a plot some order can be seen whereas in a simple scatterplot of crashes versus a variable of interest it is very difficult to perceive any pattern.
The essence here is that there exists some function linking crashes to AADT. There then exists an Integral Function as well. We can use the Empirical Integral Function to make an informed judgment about what the true Integral Function is. If this is successful then the function linking crashes to the variable of interest is the derivative of the Integral Function.
Analysis of Over Versus Under Prediction
In this method a model without the variable of interest is applied to the data. Then using the variable of interest, the data are divided into groups (e.g., 10ft lanes, 11ft lanes etc.). The ratio of observed/predicted for each group is then determined and plotted versus the value of the variable defining the group. The plot is used to infer an appropriate relationship between the dependent variable and the variable of interest.
Cumulative Residual (CURE) Plots
In the CURE method the cumulative residuals (the difference between the observed and predicted values for each observation) are plotted in increasing order for each covariate separately. Also plotted are graphs of the 95percent confidence limits. If there is no bias in the model, the plot of cumulative residuals should oscillate around the xaxis without systematic over or underprediction, and stay inside of these confidence limits. In the context of CURE plots, it is important to recognize that the plot is not only a reflection of the functional form of the particular explanatory variable, but also whether other relevant explanatory factors have been included in the model in an appropriate form (i.e., the extent to which there is omitted variable bias).