U.S. Department of Transportation
Federal Highway Administration
1200 New Jersey Avenue, SE
Washington, DC 20590
Federal Highway Administration Research and Technology
Coordinating, Developing, and Delivering Highway Transportation Innovations
|This report is an archived publication and may contain dated technical, contact, and link information|
Publication Number: FHWA-HRT-10-024
Date: April 2010
Development of a Speeding-Related Crash Typology
PDF Version (1.53 MB)
PDF files can be viewed with the Acrobat® Reader®
OVERVIEW OF ANALYSIS METHODOLOGY
The goal of this study is to determine which crash–, vehicle–, and driver–related factors are more likely to be found in SR crashes. As such, the list of possible variables (e.g., crash type) and combinations of variables (e.g., crash type by urban versus rural roadway class by speed limit) are almost limitless. Finding the most important variables is difficult since it is in some ways determined by the interest of the user or the type of treatment program being considered. For example, roadway–based treatments (e.g., traffic–calming measures) might be better identified or targeted by location–type analyses, while enforcement or educational treatments would be more related to driver variables. It was also difficult to combine vehicle– and driver–related factors in the same analyses as the broader crash factors since the decision of how to code each driver in a crash was complex. For example, while a crash can be classified as SR if one or more vehicles is SR, not all drivers in that crash should be thought of as SR. Indeed, in multivehicle collisions, only one of the drivers would be speeding in many cases, and thus, a comparison of driver age for speeding and nonspeeding drivers must be done on a vehicle basis rather than a crash basis.
Given the unlimited number of possible factors of interest, a decision was made to conduct a two–part analysis. In the first part, a series of single–variable tables was produced for key crash, vehicle, and driver variables. Each variable–specific table was examined to determine which categories of that variable had the highest number and SR percentage. Such single–variable tables provide valuable information on SR crashes; however, they do not provide a way of determining which variables are most important in terms of speeding or information on combinations of variables or on the interactions between variables. The second set of analyses attempted to do this through the use of classification trees as produced by the classification and regression tree (CART) software that is available in SAS®.(10)
SINGLE–VARIABLE TABLE ANALYSES
As indicated previously, single–variable tables were created from each dataset/definition for a large number of variables. The choice of variables to be examined was based to some extent on the results of past studies of SR issues, particularly on the earlier study by Bowie and Walz.(4) The factors describing the overall nature of each crash (e.g., crash type, crash location, etc.) were examined using a crash–based file where any involved vehicle was speeding, and the vehicle and driver–based factors were examined in a vehicle–based file where each vehicle was classified as speeding or not. In the results section, three tables are presented for each variable–the first contains GES and FARS results if both are available, and the other two contain results for both definitions for each of the two States. In general, a category is defined as over–represented if it is characterized by a high percentage of SR crashes, drivers, or vehicles. Whether this is the most helpful way to characterize these findings if they are to be used in treatment development or targeting is discussed below in the interpretation of results section. A brief discussion describing the consistency of findings across the databases and definitions is included below each table.
IDENTIFICATION OF CRITICAL FACTORS USING CLASSIFICATION TREES
Although the analyses of single–variable tables provide useful information about SR crashes and vehicles/drivers in crashes, they do not automatically indicate which factors/variables are the most critical with regard to SR crashes or speeding drivers. They also do not indicate which combinations of variables are the most important. One way to identify the critical roadway, vehicle, and driver factors associated with an increased likelihood of an SR crash is to estimate a logistic regression model with the roadway, vehicle, and driver factors as independent variables and then to identify the statistically significant factors. Logistic regression is a parametric approach that is based on assumptions about error distributions. The CART methodology is nonparametric and does not require any such assumptions. In addition, CART is able to include a relatively large number of independent variables and identify complex interactions between these variables more efficiently compared to logistic regression. For example, CART is able to determine not only the most important variable and categories within that variable in terms of the risk of an SR crash, but also the most important second–level variable within the most important categories of the first–level variable, etc. That is, given the most important variable with respect to the proportion of SR crashes (e.g., manner of collision) and the subgroup of categories within that variable with the highest proportion of SR crashes (e.g., run–off–road crashes), CART is able to determine the next most important variable within these high–risk categories (e.g., road surface condition) and the categories of that variable that are most important (e.g., snow and sleet). It is hoped that these variables and categories are helpful in determining needed treatments. For these reasons, it was decided that classification trees would be used as the second type of analysis in this project.
Thus, the goals of the CART analysis are as follows: (1) to determine which variables available for examination are most important in terms of predicting SR crashes, (2) to determine which categories within that variable predict the highest risk/proportion of SR crashes, (3) to determine the second most important variable and subset of categories in terms of predicting SR crashes within this highest risk subset of categories of the first variable, and (4) to repeat the process to determine the third, fourth, and subsequent variables. This produces a tree with multiple branches that can be traced down to determine the most important combinations (or subsets) of variable categories in terms of predicting SR crashes. In the most simplistic terms, the CART procedure splits the categories of each variable in the database into all possible binary (two–category) combinations (nodes), calculates the SR risk within each part (node) of each pair, and determines which pair (i.e., which two sets of categories) produces the largest difference in SR risk within that variable. By repeating this process for each variable in the database, CART determines the two sets of categories producing the largest difference in risk of SR crash within each variable. This largest difference in risk is then compared across all variables to determine the one variable (and the set of categories) that produces the largest of all differences. This is the top of the tree, and the two categories within that variable are the first two branches of the tree. This process is then repeated within each of the two categories (branches) of the first variable to identify the second, third, and subsequent variables.
For a categorical variable (e.g., manner of collisions, month of crash, etc.), all possible binary combinations of categories are compared (e.g., category 1 versus categories 2–5, category 1 and 2 versus category 3–5, category 1 and 3 versus categories 2, 4, and 5, etc.). For ordinal variables (e.g., speed limit), all cases with the value of that variable smaller than or equal to a certain value go to one node, all other cases go to the other node (e.g., speed limit £ 30 mi/h versus speed limit ≥35 mi/h; speed limit £ 35 mi/h versus ≥ 40 mi/h, etc.).
CART then outputs a tree showing all branches (i.e., both high and low SR branches). This report shows the section of the tree illustrating up to the first four levels of branches with the highest percentage of SR crashes. Note that CART divides the database being analyzed into a training subset and a validation subset to refine the final output. The results of the training subset are presented in this report, meaning that the total frequency at the top of each tree only shows approximately 2/3 of the total case count shown in the single–variable tables. A description of the results of the CART analysis is provided in the results of CART analyses section.
Further information about CART is available in Breiman et al.(11) For applications of these trees in road safety research, see Stewart and Yan and Radwan.(12,13) Additional statistical details are provided in the appendix of this report.
VARIABLES USED IN THE ANALYSES
To examine the question of what occurs in an SR crash, the following crash characteristics are of interest:
To answer the question concerning where SR crashes mostly occur, the following variables are of interest:
The variables which might be helpful in deciding when SR crashes mostly occur are as follows:
The question concerning who is most likely to be involved in SR crashes is related to the following variables:
Since the "who" question could also involve the vehicle being driven, vehicle characteristics that might be of interest include the following: