U.S. Department of Transportation
Federal Highway Administration
1200 New Jersey Avenue, SE
Washington, DC 20590
2023664000
Federal Highway Administration Research and Technology
Coordinating, Developing, and Delivering Highway Transportation Innovations
This report is an archived publication and may contain dated technical, contact, and link information 

Publication Number: FHWAHRT10024
Date:April 2010 
Development of a SpeedingRelated Crash TypologyPDF Version (1.53 MB)
PDF files can be viewed with the Acrobat® Reader®
APPENDIX: IDENTIFICATION OF CRITICAL FACTORS USING CLASSIFICATION TREESAnalyses of single–variable tables do not automatically indicate which factors/variables are the most critical with regard to SR crashes or speeding drivers. They also do not indicate which combinations of variables are the most important. One way to identify the critical roadway, vehicle, and driver factors associated with an increased likelihood of an SR crash is to estimate a logistic regression model with the roadway, vehicle, and driver factors as independent variables and then to identify the statistically significant factors. Logistic regression is a parametric approach which is based on assumptions about error distributions. The CART methodology is nonparametric and does not require any such assumptions. In addition, CART is able to include a relatively large number of independent variables and identify complex interactions between these variables more efficiently compared to logistic regression. For example, CART is able to determine not only the most important variable and categories within that variable in terms of the risk of an SR crash but also the most important second–level variable within the most important categories of the first level variable, etc. That is, given the most important variable with respect to the proportion of SR crashes (e.g., manner of collision) and the subgroup of categories within that variable with the highest proportion of SR crashes (e.g., run–off–road crashes), CART is able to determine the next most important variable within these high–risk categories (e.g., road surface condition) and the categories of that variable that are most important (e.g., snow and sleet). It would be hoped that these variables and categories would be helpful in determining needed treatments. For these reasons, it was decided that classification trees would be used as the second type of analysis in this project. A classification tree divides the data set into smaller sets which are more homogeneous in the values of certain variables. The following provides a limited overview of classification trees. Further information about CART is available a study conducted by Breiman et al.^{(11)} For applications of these trees in road safety research, see Stewart and Yan and Radwan.^{(12,13)}
Overview of Classification Trees The goals of CART are as follows: (1) to determine which of the many possible crash, driver, or vehicle variables available for examination is most important in terms of predicting SR crashes, (2) to determine which categories within that variable predict the highest risk/proportion of SR crashes, (3) to determine the second most important variable and subset of categories within this highest–risk subset of categories of the first variable, and (4) to repeat the process to determine the third, fourth, and subsequent variables. This produces a tree with multiple branches that can be traced down to determine the most important combinations (or subsets) of variable categories in terms of predicting SR crashes. The CART procedure basically splits the categories of each variable in the database into all possible binary (two–category) combinations (nodes), calculates the SR risk within each part (node) of each pair, and determines which pair (i.e., which two sets of categories) produces the largest difference in SR risk within that variable. By repeating this process for each variable in the database, CART determines the two sets of categories producing the largest difference in risk of SR crash within each variable. This largest difference in risk is then compared across all variables to determine the one variable (and the set of categories) producing the largest of all differences. This is the top of the tree, and the two categories within that variable are the first two branches of the tree. This process is then repeated within each of the two categories (branches) of the first variable to identify the second, third, and subsequent most important variables. For a categorical variable, all possible binary combinations of categories are compared (e.g., category 1 versus category 2–5, category 1 and 2 versus category 3–5, category 1 and 3 versus categories 2, 4, 5, etc.). For ordinal variables (e.g., speed limit), all cases with the value of that variable smaller than or equal to a certain value go to one node, all other cases go to the other (e.g., £ 30 mi/h versus ≥ 35 mi/h; £ 35 mi/h versus ≥ 40 mi/h, etc.). The following describes this process in statistical terms. In general, assume that there is a dataset D with N cases, in this case, crashes or vehicles/drivers involved in crashes. In order to develop a classification tree, a target variable is needed, which is the variable which will be used by the tree to classify a case in some optimal way. A target variable should always be categorical. The other variables might be ordinal or categorical. The target variable is the variable describing whether a crash was SR or whether a vehicle involved in a crash was speeding. This is indeed a categorical variable because it has only two categories (yes and no). In the following overview of classification trees, there is also a target variable with two categories. Since the objective is to estimate the probability that a crash is SR, the procedure for developing a class probability tree will be discussed. This means a class probability estimator d, should be constructed as follows:
Building a classification tree starts with dividing case D in two parts. One part, D_{1}, is the training sample, which is used to build the complete tree. The other part, D_{2}, is called the validation set and is used to prune the complete tree back to a simpler tree such that it will have a better predictive ability. The number of cases in the training and validation set will be denoted by N_{1} and N_{2}, respectively. In this case, 70 percent of the cases were included in the training sample. All the cases in the training sample form the root note, t_{0}, of the tree. Assume that j = 1, 2, the proportion of this set for which the target variable falls in category j. The impurity i(t_{0}) of the root node then describes how mixed the node is according to the target variable. This means that the impurity should be equal to zero when all the cases in the root node belong to the same category and that the impurity should have its maximum value when half of the cases fall in one category and half of the crashes fall in the other category. There are several functions possible which satisfy these conditions, but Breiman et al. suggest the use of the Gini index seen in equation 2.^{(11)}
The classification tree algorithm now the impurity of the nodes computes for all independent variables which would result from splitting the root node in two new nodes. If an independent variable is ordinal, then all splits of the following type are considered: all cases with the value of that variable smaller than or equal to a certain value go to one node and all other cases to the other. For a categorical variable, all splits of the following types are considered: if J = {j_{1},…,j_{k}} is the set of possible values of that variable, then all cases for which the variable has a value in go to one node, all other observations go to the other. Let S be the set of all possible splits, and assume that a split s in this set S results in two new nodes denoted by t_{1,s} and t_{2,s}. Then the impurity of node t_{i,s}, i = 1,2 is given by equation 3 as follows:
where p is analogously defined as for the root node. The decrease of impurity due to a certain split s is then equal to the following:
The split maximizing _{} is chosen. For each of the two new nodes, this splitting procedure is repeated. Splitting stops when the end nodes splitting will not decrease impurity anymore (all observations in that node are then either SR or not SR) or when the end nodes consist of less than a certain number of observations that is set in the beginning. In most cases, this splitting algorithm results in a very large tree, T_{m}, where m denotes the number of terminal nodes (also called leaves) which are not split anymore. Each node gives the proportions of cases in that node falling in each of the categories of the target variable. These proportions can be interpreted as the probabilities that a case, which is run through the tree and ends up in a certain node, falls in each of the categories. This tree is helpful in predicting the proportions of the cases in the training sample in each because this set was used to build the tree. When the tree is applied on the other 30 percent of the original data set, it may not do well in predicting the proportions, especially for the nodes resulting from a high number of splits. That is why the classification tree procedure continues with pruning back the resulting tree from the previous step. Before the actual pruning phase, a sequence of subtrees of T_{m} is constructed by using the training sample. This sequence consists of n – 1 subtrees of T_{m}. Subtree T_{k}, k = 1,...,m – 1, satisfies the following conditions:
It goes too far to explain this last condition in detail; its optimality has to do with its total Gini index (sum of the Gini indices of all its terminal nodes) and its number of terminal nodes. In the final step of the classification tree building process, the best subtree is determined using the validation sample. Before explaining what the best subtree is, some notation is introduced. Let D_{2,j} be the part of the validation set D_{2} consisting of all cases falling in category j. The number of crashes in D_{2,j} will be denoted by N_{2,j}. Further, d is the class probability estimator, which means that if x is a crash in D_{2} which ends up in node t while ran through the tree, then the probabilities that x is of category 1 and that x is of category 2 are given by d as follows:
Where, p(j  t) is the proportion of cases of the training set D_{1} in node t which are in category j. Finally, the map z_{i} is defined as follows:
Where j(x) is the real category of x. To determine the best subtree, all the cases in D_{2,1} are first run down all the subtrees, followed by the cases in D_{2,2}. For each subtree, the following values are computed as follows:
The subtree for which R_{0}(T_{k})_{}(1) + R_{1}(T_{k})_{}(2) (which is an estimate of the mean square error of d and is sometimes called the total leaf impurity) is minimal will be considered the best subtree and will be the final classification tree. In this expression, _{}(j) is the proportion in the entire dataset (so N_{1} and N_{2}) belonging to category j. Dealing with Missing Values If the target variable is missing, the crash is not used for growing or pruning the tree. Therefore, these observations are removed from the datasets before developing the trees. Classification trees are able to handle missing values of the independent variables. If the value of a variable for a case on which a node will be split is missing, then a surrogate split is used. This means that based on the value of another variable, it is decided to which of the new nodes the case goes. There are a lot of variables and possible splits to choose from, but the one that gives the most similar results as the real split will be chosen. In GES and FARS, unknown values of variables are coded as 9, 99, or similar. Because these are not automatically recognized by SAS^{®} Enterprise Miner as missing values, these codes are recoded as blank.^{(10)}
FHWAHRT10024
