U.S. Department of Transportation
Federal Highway Administration
1200 New Jersey Avenue, SE
Washington, DC 20590
2023664000
Federal Highway Administration Research and Technology
Coordinating, Developing, and Delivering Highway Transportation Innovations
REPORT 
This report is an archived publication and may contain dated technical, contact, and link information 

Publication Number: FHWAHRT17104 Date: June 2018 
Publication Number: FHWAHRT17104 Date: June 2018 
A comprehensive framework was devised for comparison of the multiobjective calibration results to the AASHTO recommended singleobjective calibration. This comparison framework is based on quantitative and qualitative evaluation of the calibrated models.^{(63)} Figure 17 shows the different steps within the comparison framework.
The measured rutting databases for both LTPP and APT were randomly divided into 80 percent for calibration and 20 percent for validation. Following the calibration of the models using the calibration data, the calibrated prediction models were tested using the validation dataset.
Source: FHWA.
Figure 17. Flowchart. Framework for comparison of the calibrated performance models.
For quantitative evaluation, accuracy, precision, and generalization capability of the models are assessed using the calibration and validation datasets. To evaluate the accuracy of the multiobjective approach compared to the conventional singleobjective calibration, statistical significance testing is required to determine whether the bias between measured and predicted performance values is statistically significant before and after each calibration process.
The bias (average error) is an estimate of the model accuracy, and the STE represents the precision of the performance model. According to the AASHTO recommended calibration guidelines, the STE of the calibrated model is compared to the STE from the national global calibration to determine its significance.
Once the models are calibrated, they are validated on a validation dataset to check for the reasonableness of performance predictions. This step also provides a measure of the generalization capability (repeatability) of the calibrated models. The model that generalizes better will be the model that matches measured pavement performance equally well on both the validation and the calibration datasets.
For qualitative evaluation of the calibrated models, scatterplots of measured versus predicted performance on the calibration and validation datasets were used. Another qualitative assessment could be a sensitivity analysis of the calibrated models to the changes in input variables, which is a substantial task and beyond the scope of the current project. The reasonableness of this model behavior can be compared between the singleobjective and multiobjective calibration methods. An additional qualitative evaluation was conducted by comparing the shape of the distribution of measured versus predicted pavement performance. Statistical distribution shape descriptors such as nonparametric skewness and kurtosis can be used for this purpose.
Examining the predicted rate of change in performance indices compared to measured deterioration trends, a combination of quantitative and qualitative evaluations was implemented to compare different calibrated models.
This comparison of the multiobjective approach to the conventional singleobjective method is demonstrated in calibrating rutting models for new AC pavements using LTPP SPS1 Florida site data, new AC pavements using FDOT APT data, and overlaid AC pavements using LTPP SPS5 Florida site data.
A singleobjective calibration according to the AASHTO guidelines was conducted, the results of which serve as a comparison baseline. The AASHTO recommended approach includes an 11step procedure for “verification,” “calibration,” and “validation” of the MEPDG models for local conditions. The verification involves an examination of accuracy and precision of the global (nationally calibrated) model on the local dataset.
Table 26 and table 27 list the important statistics regarding this verification of the rutting models for new pavements using Florida SPS1 site data and overlaid pavements using Florida SPS5 site data, respectively. As expected, the results show a significant positive bias and a high standard deviation of error because the global model was calibrated based on all LTPP data from across North America. The next steps will demonstrate the AASHTO recommended procedure for “eliminating” the bias and potentially reducing the STE.
Table 26. “Verification” of the global rutting model for new pavements on Florida SPS1.
Statistic  Calibration Data  Validation Data  Combined Data 

Data records  128  26  154 
SSE  11,135.13 mm^{2} (17.26 inch^{2)} 
2,396.75 mm^{2} (3.72 inch^{2)} 
13,531.88 mm^{2} (20.97 inch^{2)} 
RMSE  9.33 mm (0.367 inch)  9.60 mm (0.379 inch)  9.37 mm (0.370 inch) 
Bias  +8.05 mm (0.317 inch)  +8.34 mm (0.328 inch)  +8.10 mm (0.319 inch) 
p value (paired ttest) for bias  5.11E – 39 < 0.05; significant bias  1.28E – 08 < 0.05; significant bias  5.51E – 47 < 0.05; significant bias 
STE  4.73 mm (0.186 inch)  4.85 mm (0.191 inch)  4.73 mm (0.186 inch) 
Generalization capability  N/A  N/A  96.4% 
R^{2} (goodness of fit)  0.0391  0.123  0.0487 
RMSE = rootmeansquared error; generalization capability = 100 – normalized difference in bias between the calibration and validation datasets. N/A = not applicable. 
Table 27. “Verification” of the global rutting model for overlaid pavements on Florida SPS5.
Statistic  Calibration Data  Validation Data  Combined Data 

Data records  189  39  228 
SSE  62,207.04 mm^{2} (96.42 inch^{2}) 
11,607.67 mm^{2} (17.99 inch^{2}) 
73,814.71 mm^{2} (114.41 inch^{2}) 
RMSE  18.14 mm (0.714 inch)  17.25 mm (0.680 inch)  17.99 mm (0.709 inch) 
Bias  +16.39 mm (0.645 inch)  +15.71 mm (0.618 inch)  +16.27 mm (0.640 inch) 
p value (paired ttest) for bias  7.40E – 71 < 0.05; significant bias  1.18E– 15 < 0.05; significant bias  9.78E– 86 < 0.05; significant bias 
STE  7.80 mm (0.311 inch)  7.22 mm (0.284 inch)  7.70 mm (0.304 inch) 
Generalization capability  N/A  N/A  95.85% 
R^{2} (goodness of fit)  0.0609  0.0018  0.0504 
N/A = not applicable; RMSE = rootmeansquared error. 
At the seventh step of the 11step AASHTO recommended calibration procedure, the significance of the bias (the average difference between predicted and measured performance) is tested. If there is a significant bias in prediction of pavement performance measures, the first round of calibration is conducted at the eighth step to eliminate bias. During this step, the SSE is minimized by adjusting the β_{r}_{1}, β_{GB}, and β_{SG }calibration factors.
At the ninth step, the STE (standard deviation of error among the calibration dataset) is evaluated by comparing it to the STE from the national global calibration, which was about 0.1 inch. If there is a significant STE, the second round of calibration at the tenth step tries to reduce the STE by adjusting the β_{r}_{2} and β_{r}_{3} calibration factors. A final validation step checks for the reasonableness of performance predictions on the validation dataset that has not been used for model calibration.
Global heuristic optimization methods such as EAs could possibly identify a more optimum set of calibration coefficients compared to the local exhaustive search methods. That is why in this project GA was used for singleobjective optimization in calibrating rutting models for new AC pavements using the LTPP SPS1 Florida site data and overlaid AC pavements using the LTPP SPS5 Florida site data.
Figure 18 shows a dynamic plot of SSE as the optimization iterations pass by for singleobjective calibration on Florida SPS1 data. The optimization is stopped when the SSE for the best member of the population does not change more than 1 mm^{2} for at least 10 consecutive generations (in this case, after 34 iterations).
Source: FHWA.
Figure 18. Chart. Dynamic plot of SSE in singleobjective optimization on Florida SPS1 data.
Table 28 lists the important statistics regarding these singleobjective calibration results of rutting models for new pavements on Florida SPS1. The final results of the singleobjective minimization of SSE demonstrate a p value (0.096) greater than 0.05 for a paired ttest of measured versus predicted total rut depth on calibration data. Therefore, there is not enough evidence to reject the null hypothesis of the bias being insignificant. It seems that the continuation of this optimization process will not significantly improve the results (figure 18). On the other hand, the p value is much higher (0.47) for the validation dataset, and therefore the bias seems to be even less significant on the validation dataset, which was not used for model calibration. The negative bias indicates that the calibrated model underpredicts rut depth values on average by 0.25 mm, which is insignificant compared to the accuracy of manual rut depth measurements that is 1 mm in the LTPP program.
There seems to be a lack of precision of the calibrated model demonstrated by the high amount of scatter in figure 19. However, since the overall STE is lower than the national global calibration results (0.068 inch < 0.129 inch), according to AASHTO calibration guidelines, there is no need for the second round of optimization to reduce STE. Therefore, the following are the selected calibration factors:
β_{r}_{1} = 0.522, β_{r}_{2} = 1.0, β_{r}_{3} = 1.0, β_{GB}= 0.011, and β_{SG} = 0.171.
Table 28. Singleobjective calibration results of rutting models for new pavements on Florida SPS1.
Statistic  Calibration Data  Validation Data  Combined Data 

Data records  128  26  154 
SSE  409.44 mm^{2} (0.64 inch^{2})  59.70 mm^{2} (0.09 inch^{2})  469.14 mm^{2} (0.73 inch^{2}) 
RMSE  1.79 mm (0.071 inch)  1.52 mm (0.060 inch)  1.75 mm (0.069 inch) 
Bias  –0.26 mm (–0.010 inch)  –0.20 mm (–0.008 inch)  –0.25 mm (–0.010 inch) 
p value (paired ttest) for bias  0.096 > 0.05; insignificant bias  0.47 > 0.05; insignificant bias  0.072 > 0.05; insignificant bias 
STE  1.78 mm (0.070 inch)  1.53 mm (0.060 inch)  1.73 mm (0.068 inch) 
Generalization capability  N/A  N/A  70% 
R^{2} (goodness of fit)  0.1519  0.2553  0.165 
N/A = not applicable; RMSE = rootmeansquared error. 
Figure 19 and figure 20 show scatterplots of the measured versus predicted total rut depth on calibration (128 records) and validation (26 records) datasets, respectively, for Florida SPS1. The measured rut depth data are discrete values at every 1 mm (the plots show 0.5 mm because the rut depth values are averaged between the left and right wheelpaths), while the predicted rut depth data are continuous values. Even though the bias and the STE are relatively low on both the calibration and validation datasets, there is a significant amount of scatter in these plots, and the goodnessoffit indicator (R^{2}) is poor. This scatter could perhaps be due to two reasons. First and foremost, the calibrated model does not exhibit adequate precision. This could be tracked back to the lack of precision of the global model. Second, there is a high degree of variation (perhaps due to construction quality and environmental variability) in measured rut depths along each pavement section (and between the two wheelpaths) at each distress survey, while the model can only produce one value (for 50 percent reliability) for each pavement section at each specified time. This could add to the scatter observed in these plots.
Source: FHWA.
Figure 19. Scatterplot. Measured versus predicted singleobjective calibration results of rutting models for new pavements on calibration dataset for Florida SPS1.
Source: FHWA.
Figure 20. Scatterplot. Measured versus predicted singleobjective calibration results of rutting models for new pavements on validation dataset for Florida SPS1.
Figure 21 shows a dynamic plot of SSE as the optimization iterations pass by for singleobjective calibration on Florida SPS5 data. The optimization is stopped when the SSE for the best member of the population does not change more than 1 mm^{2} for at least 10 consecutive generations (in this case, after 36 iterations).
Source: FHWA.
Figure 21. Chart. Dynamic plot of SSE in singleobjective optimization on Florida SPS5 data.
Table 29 lists the important statistics regarding these singleobjective calibration results of rutting models for overlaid pavements on Florida SPS5. The final results of the singleobjective minimization of SSE demonstrate a low p value (0.0005) for a paired ttest of measured versus predicted total rut depth on calibration data. Therefore, the null hypothesis of the bias being insignificant has been rejected. However, it seems that the continuation of this optimization process will not significantly improve the results (figure 21). The p value is higher (0.013) for the validation dataset, but the bias is still significant on the validation dataset as well.
Table 29. Singleobjective calibration results of rutting models for overlaid pavements on Florida SPS5.
Statistic  Calibration Data  Validation Data  Combined Data 

Data records  189  39  228 
SSE  820.24 mm^{2} (1.27 inch^{2})  128.37 mm^{2} (0.20 inch^{2})  948.61 mm^{2} (1.47 inch^{2}) 
RMSE  2.08 mm (0.082 inch)  1.81 mm (0.071 inch)  2.04 mm (0.080 inch) 
Bias  –0.53 mm (–0.021 inch)  –0.72 mm (–0.028 inch)  –0.56 mm (–0.022 inch) 
p value (paired ttest) for bias  0.0005 < 0.05; significant bias  0.013 < 0.05; significant bias  3.18E05 < 0.05; significant bias 
STE  2.02 mm (0.080 inch)  1.69 mm (0.066 inch)  1.97 mm (0.078 inch) 
Generalization capability  N/A  N/A  64.15% 
R^{2} (goodness of fit)  0.1196  0.0213  0.1073 
N/A = not applicable; RMSE = rootmeansquared error. 
Since the overall STE is lower than the national global calibration results (0.078 inch < 0.129 inch), according to AASHTO calibration guidelines, there is no need for the second round of optimization to reduce STE. Therefore, the following are the selected calibration factors:
β_{r}_{1} = 0.5004, β_{r}_{2} = 1.0, β_{r}_{3} = 1.0, β_{GB}= 0.0738, and β_{SG} = 0.1554.
Figure 22 and figure 23 show scatterplots of the measured versus predicted total rut depth on calibration (189 records) and validation (39 records) datasets, respectively, for Florida SPS5. Similar to the results on SPS1 data, there is a significant amount of scatter in these plots, and the goodnessoffit indicator (R^{2)} is poor. The negative bias indicates that the calibrated model underpredicts rut depth values on average by 0.56 mm. While this bias is statistically significant, it is low compared to the accuracy of manual rut depth measurements, which is 1 mm in the LTPP program.
Source: FHWA.
Figure 22. Scatterplot. Measured versus predicted singleobjective calibration results of rutting models for overlaid pavements on calibration dataset for Florida SPS5.
Source: FHWA.
Figure 23. Scatterplot. Measured versus predicted singleobjective calibration results of rutting models for overlaid pavements on validation dataset for Florida SPS5.
As discussed in chapter 4, several scenarios can be devised for multiobjective formulation of calibration, all of which could overcome cognitive challenges and add to our knowledge of this problem. The following sections demonstrate the results of the two considered scenarios—the first for simultaneous utilization of multiple statistical data in the calibration process, and the second for objective incorporation of data from multiple disparate sources. Note that in the multiobjective calibration approaches, unlike the singleobjective calibration, all of the involved calibration factors are evolved to determine the suitable factors.
The general goal of any multiobjective optimization is to identify the Paretooptimal tradeoffs between multiple objectives. A Paretooptimal tradeoff identifies the best compensation among the multiple conflicting objectives. The solutions presented on the Paretooptimal front are nondominated solutions. A solution X1 is called dominated if, and only if, another feasible solution X2 performs better than X1 in terms of at least one objective and as well as X1 in terms of others. A set of nondominated solutions is called the nondominated or Paretooptimal front. This nondominated solution set might contain information that advances knowledge of the problem at hand.
The final solution can be selected from the set using engineering judgment and/or other qualitative criteria. Statistical distribution shape descriptors (skewness and kurtosis) can be used in conjunction with engineering judgment to better match the distribution of measured and calculated rutting data on LTPP test sections. Difference in nonparametric skew (as calculated using equation 46) between the distributions of measured and calculated rutting values could be one of the criteria used to select the final solution among the nondominated pool of solutions. Skewness quantifies how symmetrical the distribution is.
(46)
Where:
NPS = nonparametric skewness.
μ = mean.
ν = median.
σ = standard deviation.
Another criterion is the difference in kurtosis (as calculated using equation 47) between the distributions of measured and calculated rutting values. Kurtosis quantifies whether the shape of the data distribution matches the Gaussian distribution. A flatter distribution has a negative kurtosis, and a distribution more peaked than a Gaussian distribution has a positive kurtosis.
(47)
In the first alternative scenario, mean and standard deviation of prediction error were simultaneously minimized to reduce the bias and STE (increase model accuracy and precision) at the same time. In this manner, the information from a single calibration run was fully implemented, and an additional round of computationally intensive calibration was avoided.
Figure 24 presents the final Paretooptimal front showing the nondominated solutions in terms of SSE and STE for the SPS1 data. As explained in chapter 4, it cannot be proven whether the selected objective functions are in conflict. In other words, decreasing SSE might not necessarily decrease STE. As shown in this graph, the results of this specific optimization indicate that decreasing one objective might result in increasing the other. Since the two objectives of minimizing bias (increasing accuracy) and minimizing STE (increasing precision) seem to be conflicting objectives in this case, the application of a multiobjective optimization algorithm is justified. This means that a final calibrated model that exhibits higher accuracy might not necessarily have higher precision as well. In this specific case, the difference in STE is not significant among the different solutions.
Source: FHWA.
Figure 24. Scatterplot. The final nondominated solution set for twoobjective calibration of rutting models for new pavements on Florida LTPP SPS1 data.
All the solutions on this nondominated front are valid solutions in terms of the mathematical optimization problem at hand. This is because no solution is better than another solution in terms of all objective functions. A solution might perform better than another in terms of one objective function (e.g., higher accuracy), but it will be performing worse in terms of the other objective function (e.g., lower precision). However, qualitative criteria and engineering judgment could be practiced when selecting the most reasonable solution from this front.
Table 30 shows two of the candidate solutions on the final nondominated front for SPS1 data (figure 24). These solutions have the minimum difference in skewness and kurtosis between the predicted and measured rutting values. Based on these results, the solution with minimum skewness difference seems to be the suitable solution, as its difference in kurtosis is not much higher than the solution with minimum kurtosis difference:
β_{r}_{1} = 0.54, β_{r}_{2} = 0.79, β_{r}_{3} = 1.16, β_{GB}= 0.01, and β_{SG} = 0.09.
Table 30. Candidate solutions from the twoobjective nondominated front for SPS1, with minimum difference in skewness and kurtosis between predicted and measured distributions.
Candidate Solutions  β_{r}_{1}  β_{r}_{2}  β_{r}_{3}  β_{GB}  β_{SG}  SSE (mm^{2})  STE (mm)  Skewness Difference (%)  Kurtosis Difference (%) 

Minimum difference in skewness  0.5357  0.7945  1.1652  0.0102  0.0870  494.19  1.66  43.27  34.48 
Minimum difference in kurtosis  0.5224  0.8152  1.1548  0.0101  0.0118  617.18  1.79  55.40  33.18 
It should be noted that this solution (highlighted in figure 24) also provides a more reasonable (farther from zero) calibration factor for the subgrade rutting, compared to the singleobjective calibration results. The calibration factor for rutting in the base layer seems insignificant, but all the solutions on the nondominated front shared this issue. Since trench data were not available for this study, the models could not be calibrated accurately for the rutting in unbound layers, as they are often overwhelmed by bound layers with higher stiffness.
The final results in table 31 demonstrate a low p value for a paired ttest of measured versus predicted total rut depth on the calibration dataset. Therefore, the null hypothesis of the bias being insignificant has been rejected. The p value is higher for the validation dataset (0.001), but the bias is still significant on the validation dataset as well. The negative bias indicates that the calibrated model underpredicts rut depth values on average by 1.05 mm, which is at the same accuracy of manual rut depth measurements that is 1 mm in the LTPP program.
Table 31. Twoobjective calibration results of rutting models for new pavements on Florida SPS1.
Statistic  Calibration Data  Validation Data  Combined Data 

Data records  128  26  154 
SSE  494.19 mm^{2} (0.77 inch^{2})  72.41 mm^{2} (0.11 inch^{2})  566.60 mm^{2} (0.88 inch^{2}) 
RMSE  1.96 mm (0.077 inch)  1.67 mm (0.066 inch)  1.92 mm (0.076 inch) 
Bias  –1.06 mm (–0.042 inch)  –1.01 mm (–0.040 inch)  –1.05 mm (–0.041 inch) 
p value (paired ttest) for bias  4.56E11 < 0.05; significant bias  0.001 < 0.05; significant bias  1.68E13 < 0.05; significant bias 
STE  1.66 mm (0.065 inch)  1.36 mm (0.054 inch)  1.61 mm (0.063 inch) 
Generalization capability  N/A  N/A  95.28% 
R^{2} (goodness of fit)  0.1778  0.3067  0.194 
N/A = not applicable; RMSE = rootmeansquared error. 
In comparison to the singleobjective calibration results, the twoobjective calibration has resulted in lower accuracy (a higher bias), but higher precision (a lower STE), of the final model on both the calibration and validation datasets. This is because the standard deviation of error was also minimized simultaneously with the SSE. In addition to an increased precision, the other improvement is the generalization capability of the calibrated model. The model that was calibrated using twoobjective optimization had more similar bias values on calibration and validation data, compared to the singleobjective results.
Figure 25 and figure 26 show scatterplots of the measured versus predicted total rut depth on calibration (128 records) and validation (26 records) datasets, respectively, for Florida SPS1. Similar to the singleobjective calibration results, there is a significant amount of scatter in these plots, and the goodnessoffit indicator (R^{2)} is poor. As explained before, this scatter indicates the lack of precision of the rutting model. However, the overall STE is lower than the national global calibration results (0.063 inch < 0.129 inch).
Overall, it seems that the twoobjective optimization has increased the precision and generalization capability of the final calibrated model at the cost of decreasing accuracy. Since the changes in the results are not significant, conducting scenario 1 of the multiobjective optimization might not be worth the computational cost.
Source: FHWA.
Figure 25. Scatterplot. Measured versus predicted twoobjective calibration results of rutting models for new pavements on calibration dataset for Florida SPS1.
Source: FHWA.
Figure 26. Scatterplot. Measured versus predicted twoobjective calibration results of rutting models for new pavements on validation dataset for Florida SPS1.
From the twoobjective Paretooptimal front on SPS1 (figure 24), it seems that using an epsilon parameter (MOEA precision factor) of 0.1 was too conservative and has produced too many solutions with similar objective function values. A higher epsilon value (equal to 1.0 instead of 0.1) was used for the twoobjective calibration on SPS5 data, and that has increased the speed and efficiency of the multiobjective optimization drastically (by about 70 percent). Figure 27 shows the final Paretooptimal front showing the nondominated solutions in terms of SSE and STE for the SPS5 data.
Again, the skewness and kurtosis of the predicted rutting distribution with every solution on the nondominated front were calculated and compared to the measured values (table 32). From these calculations, it is evident that one of the solutions on the final front produces rutting predictions that have the lowest difference in distribution skewness and kurtosis from the distribution of measured rut depth values. However, there is another similar solution (in bold font) that has resulted in the most reasonable calibration factor for the subgrade rutting. The final selected solution is also highlighted in figure 27.
Source: FHWA.
Figure 27. Scatterplot. The final nondominated solution set for twoobjective calibration of rutting models for overlaid pavements on Florida LTPP SPS5 data.
Table 32. Solutions from the twoobjective nondominated front for SPS5, with difference in skewness and kurtosis between predicted and measured data distributions.
Candidate Solutions  β_{r}_{1}  β_{r}_{2}  β_{r}_{3}  β_{GB}  β_{SG}  SSE (mm^{2})  STE (mm)  Skewness Difference (%)  Kurtosis Difference (%) 

Other viable solution  1.8778  0.7588  0.5071  0.4438  0.0226  1,188.72  1.49  52.33  19.48 
Other viable solution  2.1152  0.9602  0.5071  0.4405  0.0246  1,070.47  1.51  58.65  29.54 
Other viable solution  2.1152  0.9602  0.5071  0.5877  0.0246  744.75  1.63  61.78  25.72 
Other viable solution  2.1185  0.9939  0.5069  0.4370  0.0256  1,039.45  1.52  51.96  33.24 
Other viable solution  2.1185  0.9939  0.5069  0.4746  0.0225  961.70  1.54  59.60  31.58 
Minimum skewness and kurtosis difference  2.9712  0.7589  0.5071  0.5877  0.0253  787.72  1.61  51.89  19.28 
Other viable solution  2.9838  0.7589  0.5071  0.5877  0.0540  671.14  1.66  56.40  22.83 
Other viable solution  2.9844  0.7589  0.5071  0.5877  0.0887  541.80  1.64  53.49  25.64 
Other viable solution  2.9844  0.9599  0.5071  0.5877  0.0500  643.39  1.70  56.25  31.35 
The final selected solution seems to have reasonable calibration factors for base and subgrade rutting: β_{r}_{1} = 2.98, β_{r}_{2} = 0.76, β_{r}_{3} = 0.51, β_{GB}= 0.59, and β_{SG} = 0.09. Table 33 shows the prediction results of this final solution on SPS5 calibration and validation datasets. While the low bias value of –0.44 mm is much less than the LTPP measurement precision of 1 mm, the bias is statistically significant on the calibration dataset. However, the bias is statistically insignificant on the validation dataset that was not used in the calibration process.
Table 33. Twoobjective calibration results of rutting models for overlaid pavements on Florida SPS5.
Statistic  Calibration Data  Validation Data  Combined Data 

Data records  189  39  228 
SSE  541.80 mm^{2} (0.84 inch^{2})  91.65 mm^{2} (0.14 inch^{2})  633.45 mm^{2} (0.98 inch^{2}) 
RMSE  1.69 mm (0.066 inch)  1.53 mm (0.060 inch)  1.67mm (0.066 inch) 
Bias  –0.44 mm (–0.017 inch)  –0.47 mm (–0.018 inch)  –0.45 mm (–0.018 inch) 
p value (paired ttest) for bias  0.0003 < 0.05; significant bias  0.051 > 0.05; insignificant bias  4.16E05 < 0.05; significant bias 
STE  1.64 mm (0.065 inch)  1.48 mm (0.058 inch)  1.61 mm (0.063 inch) 
Generalization capability  N/A  N/A  93.18% 
R^{2} (goodness of fit)  0.0427  0.0013  0.0348 
N/A = not applicable; RMSE = rootmeansquared error. 
Figure 28 and figure 29 show scatterplots of the measured versus predicted total rut depth on calibration (189 records) and validation (39 records) datasets, respectively, for Florida SPS5. Similar to the singleobjective calibration, there is significant scatter in these plots. However, the standard deviation of error is lower for the model that was calibrated using the twoobjective approach compared to the singleobjective approach. The overall standard deviation of error is much lower than the STE from the national global calibration (0.063 < 0.129).
Compared to the singleobjective calibration, the rutting model for overlaid pavements (SPS5) that was calibrated using the twoobjective approach shows both a higher accuracy and higher precision. In addition, the twoobjective calibration has resulted in a model with much greater generalization capability.
Source: FHWA.
Figure 28. Scatterplot. Measured versus predicted twoobjective calibration results of rutting models for overlaid pavements on calibration dataset for Florida SPS5.
Source: FHWA.
Figure 29. Scatterplot. Measured versus predicted twoobjective calibration results of rutting models for overlaid pavements on validation dataset for Florida SPS5.
In the second scenario, the SSE and STE in predicting permanent deformation of pavements within different performance data sources were used as separate objective functions to be minimized simultaneously. This fourobjective optimization was used in calibrating rutting models for new AC pavements using the LTPP SPS1 Florida site data and the FDOT APT data at the same time. In this manner, information from pavement performance near the end of its service life can be incorporated in the calibration process. This scenario comprised an objective approach to incorporate different sources of data. This scenario was not applied to overlaid pavements because the available APT data were only for new pavement structures.
Figure 30 shows the Paretooptimal front for the fourobjective calibration of the rutting models for new pavements using SPS1 and APT data, simultaneously. In this figure, F1 and F2 are SSE and STE on Florida LTPP SPS1 data, and F3 and F4 are SSE and STE on FDOT APT data. While F1, F2, and F3 are indicated on the three dimensions, the values for F4 are indicated using a color chart, where lower values are closer to lighter gray. Figure 31 shows another approach to visualizing the final nondominated front for this fourobjective optimization. In this figure, rootmeansquared error (RMSE) is shown instead of SSE so that the values of the average error objective functions (F1 and F3) are relatively in the same scale and comparable with standard deviation functions (F2 and F4). Unlike figure 30, where each solution (set of calibration factors) was indicated using a circle, in figure 31, each solution is represented by a line that crosses the objective function axes at different locations.
Figure 32 and figure 33 exhibit twodimensional representations of the final Paretooptimal front using pairwise comparison of SSE and STE, respectively. Each figure compares the average error or standard deviation of error on SPS1 data to that on APT data. Note that some solutions on these fronts might seem suboptimal; however, that is because these figures are showing a twodimensional shadow of the final fourdimensional nondominated front on one plane. Also note that there is a crowd of solutions on these fronts because an epsilon value of 0.1 was used; an epsilon value of 1.0 would result in a more reasonable density of solutions in the final front. Similar to the first multiobjective calibration scenario, these figures show a conflict between the F1 and F3 and F2 and F4 objective functions, which advocates for the application of multiobjective optimization. The advantage of using the multiobjective approach in this case is that solutions closer to the performance measured on SPS1 sections can be preferred over other solutions closer to the performance observed in APT data. This is because the LTPP sections are actual inservice highway pavements exposed to actual climate and traffic, as opposed to APT sections, which have been exposed to simulated accelerated loading under a constant temperature.
Source: FHWA.
Figure 30. Scatterplot. The final nondominated solution set for fourobjective calibration of rutting models for new pavements: F1 and F2 are SSE and STE on Florida LTPP SPS1 data, and F3 and F4 are SSE and STE on FDOT APT data.
Source: FHWA.
Figure 31. Chart. The final nondominated solution set for fourobjective calibration of rutting models for new pavements: RMSE and STE on Florida SPS1 and FDOT APT data.
Source: FHWA.
Figure 32. Scatterplot. Twodimensional representation of the final nondominated solution set for fourobjective calibration: SSE on Florida LTPP SPS1 versus SSE on FDOT APT data.
Source: FHWA.
Figure 33. Scatterplot. Twodimensional representation of the final nondominated solution set for fourobjective calibration: STE on Florida LTPP SPS1 versus STE on FDOT APT data.
Table 34 shows some viable candidate solutions from the final nondominated front. The solution that has the minimum difference in kurtosis between the predicted and measured SPS1 data is exhibiting a high difference in skewness. Therefore, other candidates with a better balance were sought. Some solutions show a very low value for the calibration factors for either base or subgrade, and those solutions were avoided. Therefore, it seems that the solution with the minimum difference in kurtosis is the best candidate, as it exhibits reasonable calibration factors with relatively good quality distribution of predicted values (in addition to being on the final nondominated front of the multiobjective optimization). Note that this solution offers more reasonable calibration factors for base and subgrade layers compared to the solutions on the singleobjective and twoobjective calibration (where all of the solutions had insignificant values for the calibration coefficient of the base layer):
β_{r}_{1} = 3.84, β_{r}_{2} = 0.96, β_{r}_{3} = 0.56, β_{GB }= 0.14, and β_{SG} = 0.79.
Table 34. Candidate solutions from the fourobjective nondominated front with difference in skewness and kurtosis between the predicted and measured data distributions.
Candidate Solutions  β_{r}_{1}  β_{r}_{2}  β_{r}_{3}  β_{GB}  β_{SG}  On LTPP Data SSE (mm^{2}) 
On LTPP Data STE (mm) 
Skewness Difference (%)  Kurtosis Difference (%) 

Minimum difference in kurtosis  3.8397  0.9549  0.5625  0.1377  0.7861  539.53  2.05  105.03  29.52 
Minimum difference in skewness  3.8476  0.6471  0.8332  1.4298  0.4571  1,288.64  3.03  11.41  53.79 
Other viable solution  1.2356  0.6223  0.7848  0.7718  0.7178  893.73  2.64  30.82  44.60 
Other viable solution  6.2883  0.5961  0.9930  0.0104  0.5231  736.54  2.27  139.10  35.43 
Other viable solution  1.0952  1.4996  0.5051  1.3954  0.0140  2,354.96  3.99  58.50  52.62 
Other viable solution  1.0952  1.4559  0.5016  0.0560  0.0189  823.28  2.28  151.96  41.42 
Other viable solution  0.9257  0.8089  1.0421  0.0206  0.1975  1,090.185  1.87  139.77  38.17 
The final results in table 35 demonstrate a high p value for a paired ttest of measured versus predicted total rut depth on both calibration and validation datasets (0.18 and 0.51, respectively). Therefore, there is not enough evidence to reject the null hypothesis of the bias being insignificant. The negative bias indicates that the calibrated model underpredicts rut depth values on average by 0.28 mm, which is insignificant compared to the accuracy of manual rut depth measurements that is 1 mm in the LTPP program.
Based on the final selected solution, figure 34 and figure 35 show the measured versus predicted total rut depth on Florida SPS1 calibration and validation datasets, respectively. Similar to the singleobjective and twoobjective calibration results, there is a significant amount of scatter in these plots, and the goodnessoffit indicator (R^{2}) is poor. As explained before, this scatter indicates the lack of precision of the rutting model. However, the overall STE is lower than the national global calibration results (0.078 inch < 0.129 inch).
Table 35. Fourobjective calibration results of rutting models for new pavements on Florida SPS1 data.
Statistic  Calibration Data  Validation Data  Combined Data 

Data records  128  26  154 
SSE  539.53 mm^{2} (0.84 inch^{2})  71.13 mm^{2} (0.11 inch^{2})  610.66 mm^{2} (0.95 inch^{2}) 
RMSE  2.05 mm (0.081 inch)  1.65 mm (0.065 inch)  1.99 mm (0.078 inch) 
Bias  –0.24 mm (–0.009 inch)  –0.24 mm (–0.009 inch)  –0.24 mm (–0.009 inch) 
p value (paired ttest) for bias  0.18 > 0.05; insignificant bias  0.51 > 0.05; insignificant bias  0.125 > 0.05; insignificant bias 
STE  2.05 mm (0.081 inch)  1.67 mm (0.066 inch)  1.98 mm (0.078 inch) 
Generalization capability  N/A  N/A  100% 
R^{2} (goodness of fit)  0.0263  0.1478  0.0382 
N/A = not applicable. 
In comparison to the singleobjective calibration results, the fourobjective calibration has resulted in increased accuracy (lower bias and even higher p value), but lower precision (a higher STE), of the final model on both the calibration and validation datasets. This is because the fourobjective scenario has included APT data in addition to the LTPP data, which are different in nature. The main improvement in the results of fourobjective calibration compared to singleobjective calibration is the increased generalization capability of the calibrated model. The model that was calibrated using fourobjective optimization had more similar bias values on calibration and validation data, compared to the singleobjective results.
While the fourobjective calibration results are not superior to the singleobjective results in terms of precision, the simultaneous minimization of error on both LTPP and APT data has actually resulted in higher accuracy and generalization capability of the calibrated model.
Source: FHWA.
Figure 34. Scatterplot. Measured versus predicted fourobjective calibration results of rutting models for new pavements on calibration dataset for Florida SPS1.
Source: FHWA.
Figure 35. Scatterplot. Measured versus predicted fourobjective calibration results of rutting models for new pavements on validation dataset for Florida SPS1.
To investigate the advantages of the novel calibration approach recommended in this study, the final singleobjective and multiobjective calibrated models were compared through a comprehensive framework. This framework (figure 17) includes quantitative measures of accuracy, precision, and generalization capability. A model that has high accuracy and precision might not necessarily have an adequate generalization capability, meaning that it would not be able to reproduce the same accuracy when predicting the performance on other similar test sections. The employed qualitative measures include the goodness of fit and similarity of the distribution shape between measured and predicted performance. The quality of the solutions needs to be also examined in terms of the engineering reasonableness of the selected calibration factors. Finally, a combination of quantitative and qualitative criteria is used by comparing the performance trends prediction. The capability of the final calibrated model in predicting future performance trends is an essential feature for pavement design and therefore a significant indicator of the quality of the developed model.
Figure 36 shows a comparison of the quantitative criteria among the singleobjective, twoobjective, and fourobjective calibrated models for rutting in new pavements on combined (calibration and validation) SPS1 data.
Source: FHWA.
Figure 36. Bar chart. Comparison of the quantitative criteria for the calibrated rutting models on SPS1.
It is evident in this figure that the incorporation of the APT data in calibration of rutting models on LTPP SPS1 data has led to a model with the lowest bias (and least significant) and the highest generalization capability. On the other hand, the twoobjective minimization of SSE and STE has resulted in the lowest STE (highest precision), but a significant bias (low accuracy).
Figure 37 shows a comparison of the quantitative criteria between the singleobjective and twoobjective calibrated models for rutting in overlaid pavements on combined (calibration and validation) SPS5 data. There were no APT data available for this calibration of rutting models for overlaid pavements. Unlike the case in calibration on SPS1 data, the twoobjective calibration on SPS5 data has resulted in both an increase in accuracy (lower bias but still significant), an increase in precision (lower STE), and an increase in generalization capability of the rutting model compared to singleobjective calibration.
Source: FHWA.
Figure 37. Bar chart. Comparison of the quantitative criteria for the calibrated rutting models on SPS5.
A model that exhibits the best quantitative results might not necessarily show the most desirable qualitative performance. Figure 38 shows a comparison of the qualitative criteria among the singleobjective, twoobjective, and fourobjective calibrated models for rutting in new pavements on combined (calibration and validation) SPS1 data. The twoobjective results show the best goodness of fit, but as explained before, there is a lack of precision of the mechanistic model, and all of the calibration approaches resulted in significant scatter. The singleobjective results have the least difference in skewness between the distributions of measured and predicted rutting. The fourobjective results show the least difference in kurtosis (flatness) between the distributions of measured and predicted rutting.
Source: FHWA.
Figure 38. Bar chart. Comparison of the qualitative criteria for the calibrated rutting models on SPS1.
Figure 39 shows a comparison of the qualitative criteria between the singleobjective and twoobjective calibrated models for rutting in overlaid pavements on combined (calibration and validation) SPS5 data. The singleobjective results show the best goodness of fit and the lowest skewness difference between measured and predicted data. The twoobjective results show the least difference in kurtosis (flatness) between the distributions of measured and predicted rutting.
Source: FHWA.
Figure 39. Bar chart. Comparison of the qualitative criteria for the calibrated rutting models on SPS5.
In all cases, using the multiobjective approach has resulted in predicted rutting distributions that are more similar in flatness to the measured rutting distributions.
Table 36 shows the final selected calibration factors. It is evident that the calibration factors selected for the rutting in base and subgrade layers were insignificant in singleobjective calibration. However, the multiobjective calibration has resulted in more reasonable calibration factors for rutting in unbound layers.
Table 36. Final selected calibration factors.
Model  β_{r}_{1}  β_{r}_{2}  β_{r}_{3}  β_{GB}  β_{SG} 

New pavement (SPS1) global  1.0  1.0  1.0  1.0  1.0 
New pavement (SPS1) singleobjective  0.522  1.0  1.0  0.011  0.171 
New pavement (SPS1) twoobjective  0.536  0.795  1.165  0.010  0.087 
New pavement (SPS1) fourobjective  3.840  0.955  0.563  0.138  0.786 
Overlaid pavement (SPS5) global  1.0  1.0  1.0  1.0  1.0 
Overlaid pavement (SPS5) singleobjective  0.500  1.0  1.0  0.074  0.155 
Overlaid pavement (SPS5) twoobjective  2.984  0.759  0.507  0.588  0.089 
To combine the quantitative and qualitative success metrics of a performance prediction model, the average absolute error (AAE) of the calibrated models in predicting the rate of change in pavement rutting was calculated. The rate of change in rut depth was estimated using measured data and by dividing the change in rutting by the amount of time (months) passed. Only the positive rates were considered in this investigation. Table 37 shows the AAE of the calibrated models in predicting the rate of change in pavement rutting. While the twoobjective calibration on SPS5 data has significantly improved the prediction of rutting deterioration rates compared to singleobjective calibration, the multiobjective calibration results on SPS1 do not exhibit the same quality. This investigation reveals that simply evaluating the bias and STE is not adequate for a comprehensive evaluation of performance prediction models.
Table 37. AAE of calibrated models in predicting the rutting deterioration rates.
Model  AAE in Predicting the Rate of Change in Rutting (%) 

New pavement (SPS1) singleobjective  51.93 
New pavement (SPS1) twoobjective  60.53 
New pavement (SPS1) fourobjective  79.47 
Overlaid pavement (SPS5) singleobjective  100.81 
Overlaid pavement (SPS5) Twoobjective  66.26 
To visualize these deterioration trends, figure 40 and figure 41 demonstrate the measured and predicted rut depth trends on sample test sections of the Florida SPS1 and SPS5 sites, respectively. It is evident that on the SPS1 test section 120108, the twoobjective and fourobjective calibrations have predicted deterioration trends that are closer to the monitored performance, compared to singleobjective calibration results. It should be noted that the predicted rutting data are from the simulated calculations (see chapter 4 for the description of the simulated calculations), and that is why a rutting decrement is observed for the model predictions from the twoobjective calibration around the 100th month. The trends on SPS5 reveal that the twoobjective calibration has predicted deterioration trends that better match the measured values, compared to singleobjective calibration results.
Source: FHWA.
Figure 40. Chart. Predicted and measured rutting deterioration on FL SPS1 section 120108.
Source: FHWA.
Figure 41. Chart. Predicted and measured rutting deterioration on FL SPS5 section 120509.