Long-Term Pavement Performance Compliance With Department of Transportation Information Dissemination Quality Guidelines
CHAPTER 3. LTPP DATA PROCESSING
LTPP's data processing operations started in 1988 at a time when modern database software was evolving. In the development of the data quality approach to LTPP data, all of the features covered in the IDQG were addressed. The data QC process started with procedures to calibrate and check the functioning of field data collection equipment. Data forms and data collection procedures always received an independent review prior to use. Extensive data editing checks were developed to automate the process of identifying and correcting erroneous data. Methods were developed to identify and address missing data. A codified procedure was developed to address the issue of computed parameters containing estimates, projects, and imputations. The data analysis plan and analyses were subjected to scrutiny from an expert panel operating outside the program.
Data Editing and Coding
Efforts are made to reduce errors at the source, i.e., during data collection. An extensive number of methods have been developed to identify, and where possible, correct erroneous data. Data checks are made before and after data are entered into the database. A primary objective of the data checks made prior to entry is to prevent "bad" data from being entered into the database. Some of the data editing and coding methods used by the LTPP program include the following:
- Predata entry processor programs developed by LTPP to prevent "bad" data from being entered into the database include:
- AWSCHECK-This program is used for Automated Weather Stations (AWS) operated by the program at some test section sites. In addition to range and integrity checks, the program plots the climate data allowing for time-history consistency checks to be performed. This preprocessor program allows deletion of "bad" measurement data and adjustments to data fields such as time to correct for daylight savings time adjustments. The output of this program is an input file for loading data into the LTPP database.
- SMPCHECK-The Seasonal Monitoring Program (SMP) includes instrumentation that measures air temperature, subsurface pavement gradient temperatures, subsurface electrical resistivity (frost indicator), and subsurface dielectric constant (moisture indicator) on a subset of test sections. In addition to automated range and integrity checks on the data, time-history plots of this temporal data are also used to identify data inconsistencies. This preprocessor program allows deletion of "bad" data and time-based adjustments. The output of this program is an input file for loading into the LTPP database.
- FWDSCAN-This program scans electronic FWD data files to identify data collection rules violation issues, data file format integrity, and range values violations.
- P46CHECK-This preprocessor program automates checks on the results of laboratory resilient modulus tests on unbound materials. In addition to all of the routine checks on data values, advanced statistical based checks are performed on details such as conformance of load-pulse shape and duration to protocols rules.
- P07CHECK-While the primary function of this preprocessor program is a check on the integrity of the results of resilient modulus measurements in indirect tension on asphalt concrete cores, it also uses a documented algorithm to calculate the test results from the raw measurement data, which are stored in the database.
- PROQUAL-Like P07CHECK, this program is both a preupload data check processor and a computed parameter generator. This program processes, evaluates, and generates computed parameters for both longitudinal and transverse profile measurements. In the LTPP series of preupload programs, this software is unique in that it is the primary data-entry point for manually collected profile data. It automates detection of spatial-based measurement anomalies as well as computers parameters, such as International Roughness Index. The output files from this program are used as input files to load data into the LTPP database.
- Traffic database-Since traffic load data are an input to the pavement performance database, separate data storage and processing functions were developed for traffic load, classification, and volume data. Graphical, automated-range, and statistical-based checks are employed to identify suspect, invalid, duplicate, and erroneous data. Since the bulk of traffic data for the LTPP program are supplied by participating highway agencies in the United States, Canada, and Puerto Rico, data identified as suspect are returned to the agency for review and comment. "Bad" data are purged from the system prior to generation of annual estimates.
- While calculated measures have been used to reduce the reliance on subjectivity in detecting data anomalies, there are still many errors that can not be detected using automated methods. With the complex data structures collected by the LTPP program, some of the simple time-based, distance-based, or binned statistical distribution plots have proven to be more effective at detecting data problems.
- To make some of the automated data checks effective, LTPP had to develop a system of human review of records failing the checks to weed out false positives. In order to make many automated data checks effective, such as range checks, a percentage of valid results must be flagged for further investigation. Another example is the check that LTPP uses to see if the ratio of the standard deviation to the mean is less than 0.5. While valid data sets can exist that fail this check, its purpose is to flag suspect data for further review.
- The best way to avoid deletion of valid outliers that violate a range check is with subjective human review. It is LTPP's policy not to delete something just because it is an outlier.
- Entry of duplicate data into the database is primarily controlled by judicious use of key fields in the relational database software, which restricts entry of duplicate data sets. The key fields are specified such that only one logical record can exist for a given measurement or data set. To detect duplicate data sets in which one of the keys was changed, for example, the date of the measurement was changed; data studies are performed to detect this type of duplicates using Standard Query Language (SQL).
- The size of LTPP's database makes it impractical and cost prohibitive to both track and report all data edits. To track changes in the more than 7,000 data fields and more than 125 million records would be a worthless exercise to the data user. Some of the measures used by LTPP to address changes to measured data and data released to the public include:
- If corrections are required to a raw electronic data file, when possible, the corrections are made in the file prior to upload into the database.
- If corrections are required to data submitted on a paper data collection form, the corrected values are written onto the form with the previous values crossed out in a fashion that makes them still legible.
- Data releases are numbered and dated.
- To track important changes between data releases, a public data feedback report process was implemented. Problems identified in the data and their resolutions are posted on the LTPP Web page (https://www.fhwa.dot.gov/research/tfhrc/programs/infrastructure/pavements/ltpp/).
- The size of the LTPP database also makes it impractical to comment on every piece of missing information. The complication is related to tables containing multiple data attributes, some of which may not be applicable in a specific situation. On average, each table contains 17 fields, and some tables contain up to 256 data fields. The missing data approach taken by LTPP includes the following:
- For data collected via paper form, nonapplicable data fields are identified on the form using a not applicable code. In many fields, a nonapplicable code is used in the database.
- For electronically measured data, a rigid enforcement of the use of null values is used. The objective is to differentiate between a zero and null, where null represents value not present, not measured, or removed. Modern database tools no longer translate null values into zeros when performing mathematical functions on a data set.
- For data types measured electronically, manually recorded on a data form, and manually input into the database, a variety of data checks are used to detect missing data.
- The use of automated required data checks in the database to identify data that should be present but are not.
- For groups of related data, both referential data integrity checks are directly coded into the database software, and external relational data checks are used.
- Data transcription error checks are made to identify improper data entries.
- Data checks are used to identify widows and orphans. A widow occurs when a parent table exists, but the related child table(s) containing some part of the data does not exist. An orphan is when a child table exists, but the parent table no longer contains a matching record. These types of checks are necessary when using relational databases to store a linked data set in multiple tables.
- There is an extensive set of comment fields included in the database where a controlled vocabulary is enforced using codes. A code table that describes all of the code fields is included in all data releases. Since some comment fields do not lend themselves to codes and a controlled vocabulary, these fields are regularly checked for spelling, grammar, and consistency prior to each data release.
- LTPP has produced extensive documentation on its data editing and processing procedures. Distributed with each standard data release is a Reference Library disk which contains copies of all important documents on LTPP data editing and processing procedures. Some of the measures used by LTPP to enhance the transparency and understanding of its data include the following:
- Inclusion of a database user reference guide with each data release. The user reference guide contains a listing of all operational documents for experimental designs, data collection methods, data checks, QC, data editing procedures prior to upload into the database, and missing data identification.
- A separate document is contained in the Reference Library that contains all of the data checks performed on the data after upload into the database. This document currently exceeds 700 pages.
- A data quality flag indicator is included in every record in the database containing "data." At this time, a single flag field is used to indicate a relative level of data completeness, data range, and relational data integrity. A comment table explains the actions taken on records failing the automated checks. This explanatory table is no longer distributed with the data due to the passage of time and changes in the data checks; some of the comments are no longer applicable, and due to funding cuts, a review of these comments has not been possible.
- A plan was devised to separate automated data quality flag fields into three distinct flags on the data quality attributes of completeness, logical range value, and data structure integrity. These flags would record each failure of a data check in a database table and indicate to a data user the action taken. Due to program budget cuts, this enhancement to the data quality system has not been implemented.
Handling Missing Data, Estimates, and Projections
The approach to handling missing data, production of estimates, and projections for the LTPP program is similar enough to be classified under one topic.
- A truth in data concept was developed in the early days of LTPP since it was intended that users of the data would perform direct manipulations of the data. The truth in data concept requires that measured values be separated from imputed or estimated values. Information on the statistical nature and basis of values obtained from samples is stored in the database.
- Traffic volumes and loads over a test section are the only data that LTPP provides cumulative annual estimates for a data user from a monitoring data sample. While LTPP has developed various statistically-based traffic data sampling schemes, based on analysis of "real life" traffic monitoring data, in many cases, the best that participating highway agencies could do is provide LTPP with unstructured sample data. Thus, LTPP had to develop a wide range of estimation methods that included very basic time expansion algorithms that use days of the week, month of the year weighting factors to arrive at the best annual estimate from the data provided. The database contains information on the size of the sample used in the estimate which indirectly infers the amount of missing data.
- Although some attempts have been made to impute, estimate, forecast, backcast, or otherwise compute missing data through various data analysis studies, to date, none of these data have been added to the database. If these data are added, they will follow LTPP's policy on computed parameters that require algorithms and procedures used to be documented and available from the LTPP Web site (https://www.fhwa.dot.gov/research/tfhrc/programs/infrastructure/pavements/ltpp/).
- Since most of the applications of LTPP data involve development, evaluation, calibration, or validation of complex models, missing critical data generally result in exclusion of a test section from the analysis. Thus, missing data rates and weights are not used in the bulk of LTPP analysis projects.
- Due to the diversity of uses of LTPP data in engineering pavement performance research, as well as the dynamic changing nature of the database, data analysts and users are charged with responsibility for their use and interpretation of the database relative to the data study design, imputations, statistical methodology, etc.
- Analysis of LTPP data have been used by universities as part of their engineering curriculum. The motto for use of LTPP in the classroom is, "LTPP...an endless source of unique problems." In this context, problems are questions assigned to students to solve as part of the curriculum.
Production of Estimates and Projections
Due to the research nature of the LTPP program, there is a division between the LTPP database and the LTPP data analysis results. The responses to LTPP's actions relative to this part of the IDQG will focus on database contents. Discussion of production of estimates and projections from LTPP-sponsored analysis of the data are presented in the next portion of this document.
The LTPP database contains a vast array of derived data to enhance the data set and reduce user and data supplier burden. Since a primary purpose of the database is to provide raw data to researchers and analysts, only a limited number of estimates are contained in the database. Virtually no projections are contained in the database. These quantities are contained in published analysis reports distributed independently of the database.
Examples of derived data contained in the database include the following:
- The derived climate data are provided as virtual weather stations using weighting factors based on the distance from the test section site to each weather station. A gravity model is used in which the weighting factors are based on the square of the distance.
- Derived data are computed from climate data obtained from NCDC and the CCC after passing LTPP data quality checks. Derived climate data in the LTPP database include the following:
- Monthly and annual mean maximum, minimum, and standard deviation of air temperature. The number of days of data included in the time period is also reported
- Monthly and annual freeze index, freeze-thaw index, days above 32 °C, and days below 0 °C.
- Derived data similar to that computed from climate data obtained from other sources is also computed for weather stations operated by LTPP.
- The LTPP database contains estimates of annual traffic loading statistics from traffic monitoring data supplied by participating highway agencies. A separate traffic database was established that contains the raw and processed data used to supply the annual estimates from monitored traffic data from highway agencies. The objective of this separation was to provide derived traffic estimates most commonly used by pavement researchers but to still maintain a comprehensive traffic data resource that can be used by other researchers. Examples of derived estimates of traffic data contained the LTPP database include the following:
- Axle weight distributions by vehicle class and axle configuration.
- Annual volume estimates by axle class.
- Weighting factors used to expand volume and weight measurements to annual estimates.
- LTPP measures the longitudinal profile of its test sections. Along with the raw profile measurements, ride statistics such as the international roughness index are computed from the profile measurements and stored in the database
- LTPP measures test section transverse profiles at varying distance intervals. Information such as rut depth, rut location, rut width, and other transverse profile distortion indices are computed from the raw profile measurements.
- The most important part of the measured material properties database module is derived data from laboratory measurements. Basic engineering properties computed from measurements of load, displacement, weight, volume, and time are provided in terms such as stress, strain, elastic modulus, resilient modulus, creep compliance, thermal coefficient of expansion, specific gravity, air voids, density, moisture content, etc.
- Most of the estimates of standard error contained in the database are based on simple descriptive statistics of standard deviation derived from repeat measurements. While research studies have been performed to investigate higher order levels of error and uncertainty, these data have not been included in the database.
Data Analysis and Interpretation
A multiproject approach by topic area is used for analysis of LTPP data. To date, there have been approximately 55 analysis projects performed under LTPP management and 21 LTPP analysis projects performed under the auspices and funding of the NCHRP. There have been more NCHRP- and State-sponsored research projects that have used LTPP data. With this volume of research, it is easy to understand that no single approach is used for LTPP data analysis, and different approaches have been used for analysis of the same set of data.
The bulk of LTPP sponsored analyses are performed under contracts with consultants, experts, academicians, and university researchers. Thus, a formal contractual approach to analysis is used. Topics are selected from the LTPP analysis plan; formal statements of work are developed; request for proposals are issued; proposals are evaluated; a contactor is selected; on some projects a project panel is used to review work in progress; and all results are reviewed prior to publication.
Highlights of LTPP conformance to the IDQG relative to data analysis and interpretation are as follows:
- The LTPP program was designed to serve a broad range of pavement management needs that dissect traditional engineering disciplines. Because the objectives contained in the initial project plans were written on a topical basis, LTPP was required to develop detailed plans for specific analysis topics covering a variety of inter-related topics. The LTPP data analysis plan was developed from input from program staff, highway agency personnel, industrial stakeholders, and academicians using an outreach process. The process was based upon solicitation of candidate research needs statements, combination of statements into projects, review and assessment of projects relative to data availability, and classification of projects into a unified plan. The unified plan was developed in concert with the LTPP ETG on Data Analysis. The analysis plan is periodically updated using the ETG peer review process. In addition to deriving new knowledge from higher order analytical investigations, the plan includes exploratory analyses, data studies, and development of derived data for input into the database. The analysis plan is publicly available on the LTPP Web site (https://www.fhwa.dot.gov/research/tfhrc/programs/infrastructure/pavements/ltpp/).
- A variety of statistical approaches have been used in the various LTPP-sponsored data analysis. All of the details contained in the IDQG guidelines have been addressed in one or more of the analyses. The degree to which statistical assumptions are tested, deviations are examined, and statistical sensitivities are evaluated depends on the nature of the analysis efforts. LTPP relies on a formal peer review process consisting of a panel of statistical experts.
- The issue of replication in a field study of pavement test sections is an ongoing source of debate within the LTPP program. It is the nature of field pavement performance studies that variance and errors from uncontrolled and, in some instances, unmeasured co-variates can overshadow the significance of main effects of experimental design constructs and make significant higher order interactions difficult to detect. In some cases, Bayesian modeling approaches have been used to deal with these issues.
- In many of the LTPP data analysis projects, modeling approaches are used to include related variables when the relationship between two or more primary variables are being assessed.
- The wording of results contained in LTPP-sponsored data analysis documents is peer reviewed by an expert panel before dissemination. One of the concepts used in the LTPP program concerning evaluation of statistical significance tests is the engineering or physical significance of a difference or similarity. For example, an analysis of variance may result in a highly significant effect due to a very low error term when the physical reality of the difference has no impact in engineering terms. The opposite is also true due to a very large variance in a data set-items with significant physical or engineering difference can be found to have no statistical significance. This is why LTPP has adopted the use of both statistical and physical tests of inference as indicators of the possible significance of an effect.
- LTPP does not always handle the results of 100 percent sample data. However, there are instances when external data sources on commonly expected variability inputs are needed to assess confidence intervals. One example is the assessment of variability in material test results from SPS projects where multiple test sections are located on the same site. Variance estimates from external industrial sources were used to assess the robustness of the variability in the material test results at these sites.
- Since the basic nature of higher order analysis of LTPP data is a time series problem, stability of interim findings is addressed by requirements that the time stamp of the data set used for the analysis be documented, new analysis projects document findings from previous efforts, and recommendations are included regarding changes to improve the analysis topic within data collection operations.