U.S. Department of Transportation
Federal Highway Administration
1200 New Jersey Avenue, SE
Washington, DC 20590
202-366-4000
Federal Highway Administration Research and Technology
Coordinating, Developing, and Delivering Highway Transportation Innovations
![]() |
This report is an archived publication and may contain dated technical, contact, and link information |
|
Publication Number: FHWA-HRT-10-024
Date:April 2010 |
Development of a Speeding-Related Crash TypologyPDF Version (1.53 MB)
PDF files can be viewed with the Acrobat® Reader®
APPENDIX: IDENTIFICATION OF CRITICAL FACTORS USING CLASSIFICATION TREESAnalyses of single–variable tables do not automatically indicate which factors/variables are the most critical with regard to SR crashes or speeding drivers. They also do not indicate which combinations of variables are the most important. One way to identify the critical roadway, vehicle, and driver factors associated with an increased likelihood of an SR crash is to estimate a logistic regression model with the roadway, vehicle, and driver factors as independent variables and then to identify the statistically significant factors. Logistic regression is a parametric approach which is based on assumptions about error distributions. The CART methodology is nonparametric and does not require any such assumptions. In addition, CART is able to include a relatively large number of independent variables and identify complex interactions between these variables more efficiently compared to logistic regression. For example, CART is able to determine not only the most important variable and categories within that variable in terms of the risk of an SR crash but also the most important second–level variable within the most important categories of the first level variable, etc. That is, given the most important variable with respect to the proportion of SR crashes (e.g., manner of collision) and the subgroup of categories within that variable with the highest proportion of SR crashes (e.g., run–off–road crashes), CART is able to determine the next most important variable within these high–risk categories (e.g., road surface condition) and the categories of that variable that are most important (e.g., snow and sleet). It would be hoped that these variables and categories would be helpful in determining needed treatments. For these reasons, it was decided that classification trees would be used as the second type of analysis in this project. A classification tree divides the data set into smaller sets which are more homogeneous in the values of certain variables. The following provides a limited overview of classification trees. Further information about CART is available a study conducted by Breiman et al.(11) For applications of these trees in road safety research, see Stewart and Yan and Radwan.(12,13)
Overview of Classification Trees The goals of CART are as follows: (1) to determine which of the many possible crash, driver, or vehicle variables available for examination is most important in terms of predicting SR crashes, (2) to determine which categories within that variable predict the highest risk/proportion of SR crashes, (3) to determine the second most important variable and subset of categories within this highest–risk subset of categories of the first variable, and (4) to repeat the process to determine the third, fourth, and subsequent variables. This produces a tree with multiple branches that can be traced down to determine the most important combinations (or subsets) of variable categories in terms of predicting SR crashes. The CART procedure basically splits the categories of each variable in the database into all possible binary (two–category) combinations (nodes), calculates the SR risk within each part (node) of each pair, and determines which pair (i.e., which two sets of categories) produces the largest difference in SR risk within that variable. By repeating this process for each variable in the database, CART determines the two sets of categories producing the largest difference in risk of SR crash within each variable. This largest difference in risk is then compared across all variables to determine the one variable (and the set of categories) producing the largest of all differences. This is the top of the tree, and the two categories within that variable are the first two branches of the tree. This process is then repeated within each of the two categories (branches) of the first variable to identify the second, third, and subsequent most important variables. For a categorical variable, all possible binary combinations of categories are compared (e.g., category 1 versus category 2–5, category 1 and 2 versus category 3–5, category 1 and 3 versus categories 2, 4, 5, etc.). For ordinal variables (e.g., speed limit), all cases with the value of that variable smaller than or equal to a certain value go to one node, all other cases go to the other (e.g., £ 30 mi/h versus ≥ 35 mi/h; £ 35 mi/h versus ≥ 40 mi/h, etc.). The following describes this process in statistical terms. In general, assume that there is a dataset D with N cases, in this case, crashes or vehicles/drivers involved in crashes. In order to develop a classification tree, a target variable is needed, which is the variable which will be used by the tree to classify a case in some optimal way. A target variable should always be categorical. The other variables might be ordinal or categorical. The target variable is the variable describing whether a crash was SR or whether a vehicle involved in a crash was speeding. This is indeed a categorical variable because it has only two categories (yes and no). In the following overview of classification trees, there is also a target variable with two categories. Since the objective is to estimate the probability that a crash is SR, the procedure for developing a class probability tree will be discussed. This means a class probability estimator d, should be constructed as follows:
Building a classification tree starts with dividing case D in two parts. One part, D1, is the training sample, which is used to build the complete tree. The other part, D2, is called the validation set and is used to prune the complete tree back to a simpler tree such that it will have a better predictive ability. The number of cases in the training and validation set will be denoted by N1 and N2, respectively. In this case, 70 percent of the cases were included in the training sample.
All the cases in the training sample form the root note, t0, of the tree. Assume that
The classification tree algorithm now the impurity of the nodes computes for all independent variables which would result from splitting the root node in two new nodes. If an independent variable is ordinal, then all splits of the following type are considered: all cases with the value of that variable smaller than or equal to a certain value go to one node and all other cases to the other. For a categorical variable, all splits of the following types are considered: if J = {j1,…,jk} is the set of possible values of that variable, then all cases for which the variable has a value in
where p is analogously defined as for the root node. The decrease of impurity due to a certain split s is then equal to the following:
The split maximizing In most cases, this splitting algorithm results in a very large tree, Tm, where m denotes the number of terminal nodes (also called leaves) which are not split anymore. Each node gives the proportions of cases in that node falling in each of the categories of the target variable. These proportions can be interpreted as the probabilities that a case, which is run through the tree and ends up in a certain node, falls in each of the categories. This tree is helpful in predicting the proportions of the cases in the training sample in each because this set was used to build the tree. When the tree is applied on the other 30 percent of the original data set, it may not do well in predicting the proportions, especially for the nodes resulting from a high number of splits. That is why the classification tree procedure continues with pruning back the resulting tree from the previous step. Before the actual pruning phase, a sequence of subtrees of Tm is constructed by using the training sample. This sequence consists of n – 1 subtrees of Tm. Subtree Tk, k = 1,...,m – 1, satisfies the following conditions:
It goes too far to explain this last condition in detail; its optimality has to do with its total Gini index (sum of the Gini indices of all its terminal nodes) and its number of terminal nodes. In the final step of the classification tree building process, the best subtree is determined using the validation sample. Before explaining what the best subtree is, some notation is introduced. Let D2,j be the part of the validation set D2 consisting of all cases falling in category j. The number of crashes in D2,j will be denoted by N2,j. Further, d is the class probability estimator, which means that if x is a crash in D2 which ends up in node t while ran through the tree, then the probabilities that x is of category 1 and that x is of category 2 are given by d as follows:
Where, p(j | t) is the proportion of cases of the training set D1 in node t which are in category j. Finally, the map zi is defined as follows:
Where j(x) is the real category of x. To determine the best subtree, all the cases in D2,1 are first run down all the subtrees, followed by the cases in D2,2. For each subtree, the following values are computed as follows:
The subtree for which R0(Tk) Dealing with Missing Values If the target variable is missing, the crash is not used for growing or pruning the tree. Therefore, these observations are removed from the datasets before developing the trees. Classification trees are able to handle missing values of the independent variables. If the value of a variable for a case on which a node will be split is missing, then a surrogate split is used. This means that based on the value of another variable, it is decided to which of the new nodes the case goes. There are a lot of variables and possible splits to choose from, but the one that gives the most similar results as the real split will be chosen. In GES and FARS, unknown values of variables are coded as 9, 99, or similar. Because these are not automatically recognized by SAS® Enterprise Miner as missing values, these codes are recoded as blank.(10)
FHWA-HRT-10-024
|