Federal Highway Administration
1200 New Jersey Avenue, SE
Washington, DC 20590
Federal Highway Administration Research and Technology
Coordinating, Developing, and Delivering Highway Transportation Innovations
|This report is an archived publication and may contain dated technical, contact, and link information|
Publication Number: FHWA-RD-98-096
Date: September 1997
Modeling Intersection Crash Counts and Traffic Volume - Final Report
1. MODELING INTERSECTION CRASH COUNTS IN RELATION TO EXPOSURE
1.1 The concept of exposure
Comparing the annual number of crashes in California with that in Rhode Island simply shows the obvious consequence of California being much larger than Rhode Island, with many more motor vehicles that together travel many more miles. A comparison that reveals something more has to account for the difference in scale. "Exposure" is such a scale factor. Dividing crash counts by an exposure measure gives an indication of crash risk relative to that exposure measure.
Sometimes registered vehicle–years or insured vehicle–years are used as exposure measures. Such measures reduce the effect of large discrepancies in "size" between states, or between various populations of vehicles, such as vehicle types, or vehicle makes and models. However, they are not fully satisfactory measures of exposure because they do not control for differences in the annual miles traveled by different vehicles. Therefore, vehicle miles of travel (VMT) is a preferred exposure measure.
VMT is a plausible exposure measure for studies of some crash types, such as running off the road or collisions with a roadside object. However, for other types of crashes, such as head–on collisions, VMT is not a good exposure measure. In the case of head–on crashes, an "exposure" to a collision is present only when a vehicle encounters an oncoming vehicle. The number of such encounters on a segment of highway will change proportionately to the square of VMT, not proportionately to VMT. On the other hand, if we compare several states with the same ratio of VMT to highway miles, then the number of such encounters will be proportional to the numbers of VMT in the states.
The situation is different if specific locations on highways, such as intersections rather than aggregations of highways or larger areas, are considered. In principle, VMT within the intersection could be defined as an exposure measure. This would imply that the expected number of crashes should increase proportionately to the width of the intersecting highways. An alternative to this assumption is to use the number of vehicles entering an intersection as the exposure measure, and to include the width (or the number of lanes as an indicator of width) among the other factors to be studied. Thus, for a n–leg intersection, the exposure measure would have n components.
First consider only intersections with crossing traffic streams and no turning maneuvers. In this case, the two traffic volumes on the crossing roads would suffice as exposure measures. If the intersection is uncontrolled, then the number of "encounters" between vehicles, where a crash could occur, is proportional to the product of the two volumes.
However, in reality, vehicles turn at intersections and the numbers of the different types of turns determine the encounters in which certain types of crashes can occur. The situation is even more complicated when traffic at the intersection is controlled by signs or a traffic signal. In that case the number of encounters within the intersection is reduced, but the number of different types of encounters during which rear–end collisions can occur just outside the intersection is increased. The magnitude of this shift depends on the lengths of the phases of the traffic signal. Even in the simplest situation, it cannot be expected that the number of encounters, which represent exposure to the possibility of a crash, can be represented by relatively simple mathematical functions of the entering traffic volumes.
Thus, ideally, exposure measures should be related to the potential conflicts between vehicles in the intersection, i.e., situations where more than one vehicle can occupy the same space at the same time. In this study, we explore what can be achieved in modeling intersection crash counts using the readily available "exposure" measures of traffic volumes on the crossing roads.
1.2 The purposes of modeling intersection crash counts
The two major purposes for using models of intersection crash counts as functions of traffic volumes and intersection characteristics are:
For the first purpose, a population of intersections is selected and each intersection is treated as an "observation." The dependent variable is the crash count for some period of time, usually a year, and the independent variables consist of intersection characteristics, including traffic volumes on the approaches to the intersection. Of special interest are those characteristics that can be modified to reduce the crash risk at the intersection. Statistical methods are used to fit a model approximating the crash counts by a function of the independent variables. The form of the function is usually assumed on the basis of mathematical convenience, and typically is not empirically determined.
For the second purpose, a model is applied to an individual intersection and the expected number of crashes is predicted and compared against the actual number of crashes at the intersection. The difference between them is examined to determine if it is greater than that expected from random variations. If so, the intersection is studied to identify factors responsible for the elevated risk.
When applying the model to determine if the intersection has an "unusually" bad crash experience, a distinction has to be made whether this intersection was or was not used to develop the model.
If the intersection was used in the development of the model and "all" relevant factors were included in the model, and the mathematical form of the model is correct, then the crash counts at any individual intersection should not differ from the modeled value by more than random variation. In this case any differences between modeled and the actual values cannot be attributed to the factors included in the model. Thus, if an intersection=s crash count differs from the modeled value by more than random variation, either the model is incomplete or it is mathematically incorrect. If this occurs the model should be revised. Factors other than those included in the original model must be sought to explain the discrepancy, or a better mathematical form of the model must be found.
Unfortunately, there can be situations where influential factors are not included in the model or where the mathematical form is not correct, yet they are accepted as correct. For example, the data may be configured in a way that allows the model fitting process to include data points that should be outliers. This will bias, possibly dramatically, other coefficients of the model.
These problems do not arise if the model is applied to an intersection that was not in the population used to develop the model. However, different problems can arise. Consider the situation where the assumed mathematical form of the model, while not correct, may still be good enough to represent the data over the range of the independent variables in the data set. If the "new" intersection is outside, or possibly just within the range, of the independent variables used in the development of the model, the actual crash count can deviate substantially from the modeled value. This occurs because model errors tend to increase toward the limits of the range over which it was calibrated, and may "explode" beyond it. Trying to explain such differences between new cases and an apparently satisfactory model can give completely wrong results.
These examples should serve as a warning that it may not be possible to achieve the goal of modeling intersection crashes.
1.3 Some critical assumptions
The difficulties discussed in the previous section can arise when applying models. However, technical problems also can arise when basic assumptions of the modeling process are not satisfied in the development of models.
One important point to remember is that safety features often are installed in response to a perceived "crash problem" that has been identified by high crash counts or by large deviations of crash counts from a model. Such safety features should be included among the variables describing intersection characteristics. Depending on the overall quality of the model, in terms of completeness and realism of its mathematical form, this can distort the model coefficients, and can sometimes show a crash increase effect from the safety feature.
Other problems result from aggregation of data. The mathematical relationship between crashes within the intersection and traffic volumes and other intersection characteristics may be very different from the mathematical relationship between crashes on the approaches to the intersection and traffic volumes and intersection characteristics. However, crashes within the intersection and on its approaches are usually aggregated together as intersection crash counts because both are "intersection related." Even if the intersection and intersection approach crashes followed simple relationships such as
where x and y are the volumes on the intersecting roads, their sum will usually not follow such a relationship. The same holds for aggregation across different crash types.
The daily (and possibly weekly and seasonal) variation of traffic volumes causes another problem. For instance, let xi ...xn and yi...yn be the traffic volumes during n time intervals, e.g., of the day, and let x and y be their totals. Assume that for each time interval a relationship
holds. Generally, a corresponding relationship
will not hold for the sums over the time intervals. Rather, the relationship for the total is
where R is the correlation coefficient between the xi and yi. Usually, since traffic tends to be high, or low, on all approaches at the same time, R will be high, and the second term will not vanish. Thus, even if a relationship
or a more complicated, similar relationship holds at any time, a similar one will not hold for aggregated traffic volumes if they vary and are correlated.
1.4 The conventional statistical approach
The conventional approach uses statistical techniques to fit an analytical model to the data. Examples of such models are:
where x1, x2 and x3 are variables describing intersection characteristics, including traffic volumes on the approaches and possibly volumes of turning traffic. For qualitative characteristics, "dummy" variables with values of 0 or 1 are used. Statistical techniques to fit such models include regression on transformed variables and maximum likelihood estimates
If all assumptions underlying these statistical techniques are satisfied, valid estimates of the model parameters can be obtained and the effects of intersection characteristics on crash risk can be determined. However, some of the assumptions are often, if not always, violated. Basic assumptions are that the deviations between the model and the crash counts have expected values of zero and that they are independent. Both assumptions are violated if the model does not reflect the relationship correctly, which is likely, as discussed above. The second assumption is violated if the traffic passing through several intersections in the population is largely the same. Factors such as driver–age distribution, vehicle mix, and possibly trip purpose cause deviations from the average crash risk that are treated as random variations by the model. These deviations will be correlated over the intersections that are passed by largely the same traffic.
Systematic deviations between the assumed mathematical forms of the model and the actual relationship between crash counts and intersection parameters, especially volumes, can have a serious biasing effect. This can occur if the majority of intersections has medium volumes, but a few intersections have high volumes. A similar effect also can occur if there are a few rare intersections with low volumes. In such cases, the model parameters for the volumes will be determined primarily by the majority of the intersections and will provide a good fit in the area covered by them. If the assumed model is not correct, the predicted crash counts for high–volume intersections would show large, systematic deviations. However, if there are characteristics that are mainly present at high–volume intersections, then the statistical algorithms may use the coefficients of these characteristics to reduce the systematic deviation. In extreme cases, a very significant parameter coefficient may appear that depends on a single intersection. To what extent this actually occurs has to be established in each case by a very detailed analysis.
Standard statistical techniques assume that the independent variables are not subject to error. This does not hold for traffic volumes. Traffic volume data can be subject to large errors if special counts are not available. Often, values of average daily traffic (ADT) are carried over long distances of roadway and used for several adjacent intersections. In that case, the errors of the independent variables also will be correlated, complicating an already complex situation even further. For linear regression, the problem has been studied and suitable approaches have been developed. However, this still has to be done for the nonlinear models used for intersection modeling.
The simplest way of assessing the model fit is to test correlation coefficients or similar aggregate measures. This is wholly inadequate. First, the explicit or implicit null hypothesis of such tests is that there is no relationship between the parameters and crash counts. However, even if one of the included factors has a relationship with crash counts, the test will show a significant relationship, even though many terms of the model may contribute nothing but noise. To recognize this, more sophisticated tests, which separate the contributions of the independent variables, have to be applied. A second problem is that, even if a test shows a high significance level, the modeled relationship may have nonnegligible systematic errors.
Many of these problems can be reduced and sometimes even avoided by using better than run–of–the–mill statistical techniques. Serious difficulties, however, can remain if the actual relationship between crash counts and intersection characteristics cannot be expressed by manageable mathematical functions.
1.5 Smoothing techniques
The difficulties arising from relationships that cannot be described by simple mathematical functions can be avoided by using smoothing methods. Smoothing techniques are based on the mapping of the data onto an n+1 dimensional space (where n is the number of independent variables), selecting a grid of appropriate spacing and fitting relatively simple "local" functions to the data points "near" each grid point. The values of these local functions at the grid point or at actual data points are the smoothed values. Smoothing techniques do not provide a relationship in mathematical form and work only for continuous relationships, such as those between crash counts and traffic volumes. Categorical variables must be treated as additive or multiplicative terms, or by splitting the data into subsets.
Smoothing techniques are simple and the results can be easily interpreted if there are one or two continuous independent variables. Fitting the model is only slightly more difficult with more variables, but presenting and interpreting the results becomes complex. The results can be presented in a simple form only if the relations are additive or multiplicative with respect to the continuous variables.
Estimating errors for a smoothed model is more laborious than for traditional analytical models. One approach is to use the error estimate obtained when calculating each smoothed value. Another approach is to split the data set and derive separate models for each part, with the differences between them providing error estimates. Bootstrapping is a similar technique, with an additional feature that allows the incorporation of the effect of "influential" observations into the errors. These approaches give error estimates for each grid point, or each data point. This makes these error estimates more realistic, but more cumbersome, than those obtained from analytical models. It also is possible to define overall error estimates for a smoothed model. Significance testing, however, if at all possible, is very complex.