U.S. Department of Transportation
Federal Highway Administration
1200 New Jersey Avenue, SE
Washington, DC 20590
202-366-4000


Skip to content
Facebook iconYouTube iconTwitter iconFlickr iconLinkedInInstagram

Federal Highway Administration Research and Technology
Coordinating, Developing, and Delivering Highway Transportation Innovations

Report
This report is an archived publication and may contain dated technical, contact, and link information
Publication Number: FHWA-HRT-10-024
Date:April 2010

Development of a Speeding-Related Crash Typology

PDF Version (1.53 MB)

PDF files can be viewed with the Acrobat® Reader®

 

APPENDIX: IDENTIFICATION OF CRITICAL FACTORS USING CLASSIFICATION TREES

Analyses of single–variable tables do not automatically indicate which factors/variables are the most critical with regard to SR crashes or speeding drivers. They also do not indicate which combinations of variables are the most important. One way to identify the critical roadway, vehicle, and driver factors associated with an increased likelihood of an SR crash is to estimate a logistic regression model with the roadway, vehicle, and driver factors as independent variables and then to identify the statistically significant factors. Logistic regression is a parametric approach which is based on assumptions about error distributions. The CART methodology is nonparametric and does not require any such assumptions. In addition, CART is able to include a relatively large number of independent variables and identify complex interactions between these variables more efficiently compared to logistic regression. For example, CART is able to determine not only the most important variable and categories within that variable in terms of the risk of an SR crash but also the most important second–level variable within the most important categories of the first level variable, etc. That is, given the most important variable with respect to the proportion of SR crashes (e.g., manner of collision) and the subgroup of categories within that variable with the highest proportion of SR crashes (e.g., run–off–road crashes), CART is able to determine the next most important variable within these high–risk categories (e.g., road surface condition) and the categories of that variable that are most important (e.g., snow and sleet). It would be hoped that these variables and categories would be helpful in determining needed treatments. For these reasons, it was decided that classification trees would be used as the second type of analysis in this project. A classification tree divides the data set into smaller sets which are more homogeneous in the values of certain variables.

The following provides a limited overview of classification trees. Further information about CART is available a study conducted by Breiman et al.(11) For applications of these trees in road safety research, see Stewart and Yan and Radwan.(12,13)

 

Overview of Classification Trees

The goals of CART are as follows: (1) to determine which of the many possible crash, driver, or vehicle variables available for examination is most important in terms of predicting SR crashes, (2) to determine which categories within that variable predict the highest risk/proportion of SR crashes, (3) to determine the second most important variable and subset of categories within this highest–risk subset of categories of the first variable, and (4) to repeat the process to determine the third, fourth, and subsequent variables. This produces a tree with multiple branches that can be traced down to determine the most important combinations (or subsets) of variable categories in terms of predicting SR crashes. The CART procedure basically splits the categories of each variable in the database into all possible binary (two–category) combinations (nodes), calculates the SR risk within each part (node) of each pair, and determines which pair (i.e., which two sets of categories) produces the largest difference in SR risk within that variable. By repeating this process for each variable in the database, CART determines the two sets of categories producing the largest difference in risk of SR crash within each variable. This largest difference in risk is then compared across all variables to determine the one variable (and the set of categories) producing the largest of all differences. This is the top of the tree, and the two categories within that variable are the first two branches of the tree. This process is then repeated within each of the two categories (branches) of the first variable to identify the second, third, and subsequent most important variables. For a categorical variable, all possible binary combinations of categories are compared (e.g., category 1 versus category 2–5, category 1 and 2 versus category 3–5, category 1 and 3 versus categories 2, 4, 5, etc.). For ordinal variables (e.g., speed limit), all cases with the value of that variable smaller than or equal to a certain value go to one node, all other cases go to the other (e.g., £ 30 mi/h versus ≥ 35 mi/h; £ 35 mi/h versus ≥ 40 mi/h, etc.). The following describes this process in statistical terms.

In general, assume that there is a dataset D with N cases, in this case, crashes or vehicles/drivers involved in crashes. In order to develop a classification tree, a target variable is needed, which is the variable which will be used by the tree to classify a case in some optimal way. A target variable should always be categorical. The other variables might be ordinal or categorical. The target variable is the variable describing whether a crash was SR or whether a vehicle involved in a crash was speeding. This is indeed a categorical variable because it has only two categories (yes and no). In the following overview of classification trees, there is also a target variable with two categories. Since the objective is to estimate the probability that a crash is SR, the procedure for developing a class probability tree will be discussed. This means a class probability estimator d, should be constructed as follows:

Class probability estimator d. d parenthesis x close parenthesis equals parenthesis d parenthesis 1 such that x close parenthesis, d parenthesis 2 such that x close parenthesis, close parenthesis, followed by d parenthesis 1 such that x close parenthesis, d parenthesis 2 such that x close parenthesis is greater than or equal to zero, followed by d parenthesis 1 such that x close parenthesis plus d parenthesis 2 such that x close parenthesis equals 1. For every x is an element of a set D. (1)

Building a classification tree starts with dividing case D in two parts. One part, D1, is the training sample, which is used to build the complete tree. The other part, D2, is called the validation set and is used to prune the complete tree back to a simpler tree such that it will have a better predictive ability. The number of cases in the training and validation set will be denoted by N1 and N2, respectively. In this case, 70 percent of the cases were included in the training sample.

All the cases in the training sample form the root note, t0, of the tree. Assume that p parenthesis j such that t subscript zero close parenthesis j = 1, 2, the proportion of this set for which the target variable falls in category j. The impurity i(t0) of the root node then describes how mixed the node is according to the target variable. This means that the impurity should be equal to zero when all the cases in the root node belong to the same category and that the impurity should have its maximum value when half of the cases fall in one category and half of the crashes fall in the other category. There are several functions possible which satisfy these conditions, but Breiman et al. suggest the use of the Gini index seen in equation 2.(11)

i parenthesis t subscript zero close parenthesis. i times parenthesis t subscript zero close parenthesis equals p parenthesis 1 such that t subscript zero close parenthesis p parenthesis 2 such that t subscript zero close parenthesis (2)

The classification tree algorithm now the impurity of the nodes computes for all independent variables which would result from splitting the root node in two new nodes. If an independent variable is ordinal, then all splits of the following type are considered: all cases with the value of that variable smaller than or equal to a certain value go to one node and all other cases to the other. For a categorical variable, all splits of the following types are considered: if J = {j1,…,jk} is the set of possible values of that variable, then all cases for which the variable has a value in J subscript sub is an element of J go to one node, all other observations go to the other. Let S be the set of all possible splits, and assume that a split s in this set S results in two new nodes denoted by t1,s and t2,s. Then the impurity of node ti,s, i = 1,2 is given by equation 3 as follows:

i parenthesis t subscript i,s close parenthesis. i parenthesis t subscript i,s close parenthesis equals p parenthesis 1 such that t subscript i,s close parenthesis p parenthesis 2 such that t subscript i,s close parenthesis. (3)

where p is analogously defined as for the root node. The decrease of impurity due to a certain split s is then equal to the following:

Delta i parenthesis s, t subscript zero close parenthesis. Delta i parenthesis s, t subscript zero close parenthesis equals i parenthesis t subscript zero close parenthesis minus i parenthesis t subscript 1,s close parenthesis minus i parenthesis t subscript 2,s close parenthesis. (4)

The split maximizing delta i parenthesis s, t subscript zero close parenthesis is chosen. For each of the two new nodes, this splitting procedure is repeated. Splitting stops when the end nodes splitting will not decrease impurity anymore (all observations in that node are then either SR or not SR) or when the end nodes consist of less than a certain number of observations that is set in the beginning.

In most cases, this splitting algorithm results in a very large tree, Tm, where m denotes the number of terminal nodes (also called leaves) which are not split anymore. Each node gives the proportions of cases in that node falling in each of the categories of the target variable. These proportions can be interpreted as the probabilities that a case, which is run through the tree and ends up in a certain node, falls in each of the categories. This tree is helpful in predicting the proportions of the cases in the training sample in each because this set was used to build the tree. When the tree is applied on the other 30 percent of the original data set, it may not do well in predicting the proportions, especially for the nodes resulting from a high number of splits. That is why the classification tree procedure continues with pruning back the resulting tree from the previous step.

Before the actual pruning phase, a sequence of subtrees of Tm is constructed by using the training sample. This sequence consists of n – 1 subtrees of Tm. Subtree Tk, k = 1,...,m – 1, satisfies the following conditions:

  • It has exactly k leaves.

  • It is a subtree of Tk+1.

  • It is in some sense the most optimal subtree of Tm with k leaves.

It goes too far to explain this last condition in detail; its optimality has to do with its total Gini index (sum of the Gini indices of all its terminal nodes) and its number of terminal nodes.

In the final step of the classification tree building process, the best subtree is determined using the validation sample. Before explaining what the best subtree is, some notation is introduced. Let D2,j be the part of the validation set D2 consisting of all cases falling in category j. The number of crashes in D2,j will be denoted by N2,j. Further, d is the class probability estimator, which means that if x is a crash in D2 which ends up in node t while ran through the tree, then the probabilities that x is of category 1 and that x is of category 2 are given by d as follows:

Class probability estimator d times x. d parenthesis x close parenthesis equals parenthesis p parenthesis 1 such that t close parenthesis, p parenthesis 2 such that t close parenthesis, close parenthesis. (5)

Where, p(j | t) is the proportion of cases of the training set D1 in node t which are in category j. Finally, the map zi is defined as follows:

z subscript i times x. z subscript i parenthesis x close parenthesis equals the set 1, if i equals j parenthesis x close parenthesis or zero, otherwise. (6)

Where  j(x) is the real category of x.

To determine the best subtree, all the cases in D2,1 are first run down all the subtrees, followed by the cases in D2,2. For each subtree, the following values are computed as follows:

R subscript j parenthesis T subscript k close parenthesis equals 1 divided by uppercase N subscript 2, j, sigma parenthesis z subscript i parenthesis x close parenthesis minus d parenthesis i such that x close parenthesis, close parenthesis superscript 2, where j equals 1,2. The summation (i.e., sigma) is conducted for every x that is an element of a set D subscript 2,j and for every i that is an element of parenthesis 1,2 close parenthesis. (7)

The subtree for which R0(Tk)symbol for pi(1) + R1(Tk)symbol for pi(2) (which is an estimate of the mean square error of d and is sometimes called the total leaf impurity) is minimal will be considered the best subtree and will be the final classification tree. In this expression, symbol for pi(j) is the proportion in the entire dataset (so N1 and N2) belonging to category j.

Dealing with Missing Values

If the target variable is missing, the crash is not used for growing or pruning the tree. Therefore, these observations are removed from the datasets before developing the trees. Classification trees are able to handle missing values of the independent variables. If the value of a variable for a case on which a node will be split is missing, then a surrogate split is used. This means that based on the value of another variable, it is decided to which of the new nodes the case goes. There are a lot of variables and possible splits to choose from, but the one that gives the most similar results as the real split will be chosen.

In GES and FARS, unknown values of variables are coded as 9, 99, or similar. Because these are not automatically recognized by SAS® Enterprise Miner as missing values, these codes are recoded as blank.(10)

 

FHWA-HRT-10-024

 

Previous | Table of Contents | Next

Federal Highway Administration | 1200 New Jersey Avenue, SE | Washington, DC 20590 | 202-366-4000
Turner-Fairbank Highway Research Center | 6300 Georgetown Pike | McLean, VA | 22101