State of The Practice on Data Access, Sharing, and Integration
CHAPTER 5. BUSINESS RULES FOR INTEGRATING AND SHARING
The challenges to accomplishing successful data integration are plentiful but generally fall into two categories—technical and institutional. The first key dimension centers on the technical challenges associated with data systems, including development and maintenance of hardware and software and the specifications for data collection, analysis, and archiving. These are discussed in chapter 3.
Institutional challenges may include centralized policymaking and decentralized execution of those policies; limited appreciation by decisionmakers of the role of data systems in supporting business operations; and lack of formal policies and standards to guide the collection, processing, and use of data within the organization. This chapter describes best practices that may be applied to the development and management of the VDA Framework.
Sources for this chapter include recent documents from the USDOT Real-Time Data Capture and Management State of the Practice Innovations Scan project; NCHRP Report 666, Target-Setting Methods and Data Management to Support Performance-Based Resource Allocation by Transportation Agencies, which addresses the importance of the data management and data governance function within the State transportation departments; NCHRP Report 754, Improving Management of Transportation Information; the USDOT Roadway Transportation Data Business Plan, and private industry.(72,66,73)
DATA MANAGEMENT
Real-Time Data Capture and Management State of the Practice Assessment and Innovations Scan
The Real-Time Data Capture and Management State of the Practice Assessment and Innovations Scan addresses issues related to data capture, data management, archiving, and sharing collected data to encourage collaboration, research, and operational development and improvement.(74) The scan covers five industries: aviation, freight logistics, Internet search engines, rail transit systems, and transportation management systems.
The report includes the following recommendations identifying data management practices and considerations from the five aforementioned industries that are pertinent to the Saxton Data Sharing Framework (pages 2–4):(74)
- Manage the amount of data traffic: Too much or too little information can inhibit the ability to notice critical messages.
- Take a gradual approach: Start addressing the most critical needs first, making sure core competencies are developed as an initial step rather than attempting to implement everything at the same time.
- Devise a design that can be sized: A system should perform well in the real world rather than only in the context of a test setup. An effective way to achieve this is to bring technologies such as virtual warehousing and servers, clustered databases, and others.
- Control which data should be saved and which should not: For instance, if the purpose of the data collection is to analyze the performance of a particular intersection, there is no need to keep data about vehicles once they have passed through such an intersection.
- Give thoughtful consideration to letting a third party manage your data storage needs: While the cost of storing all traffic data may be high for a public-sector entity, it may be profitable for a specialized third party. The pros and cons of using a third party are as follows:
- Pros: It may significantly reduce the costs of data storage and make the data constantly available through a fast, user-friendly interface.
- Cons: Control of the organization’s data and its protection will be outside of the hands of the organization, including addressing the potential risk of hacking. If it is cloud-based, data access will depend on Internet access. Lastly, the organization’s data might be mined by others for usage patterns.
- Make certain there is a clear understanding regarding data ownership and usage rights from the onset: Specifically, address the following issues:
- Determine the responsible party or parties who pay for the collection, storage, and dissemination of the data. Determine who can sell the data and whether it is free for anyone to use if a public agency pays for and owns the data.
- Address privacy concerns. Identifying and tracking vehicles using various types of technology have already faced resistance in the public arena.
- Determine who can have access to the data and to what extent. Would, for instance, a subset of the data be enough for the user to accomplish his or her task? Having a third party manage the data could help address privacy concerns, because it could be in charge of collecting, aggregating, and anonymizing the data to guarantee that the data user did not infringe upon the privacy of citizens.
- Determine whether the data can be used for purposes other than those delineated in the original argument to collect the data. An important privacy concern is that observed data are used for law enforcement rather than simply monitoring mobility. For instance, GPS data could reveal which vehicles exceed the speed limit or which trucks have been on the road without a break for the driver.
- Aim for real time: Even if further processing is needed, providing a real-time, or near-real-time feed of collected data is an appealing feature for current and potential users of the system.
- Optimize your process flow through a set of filtering mechanisms: If certain algorithms are crucial to generate a desired output but have the risk of being overloaded with too much data, consider having intermediate algorithms that aggregate or prioritize the data flow into the critical system. Thus, as soon as the critical system begins to be overloaded, these intermediate algorithms can help reduce its load.
The scan documented the following best practices for access, security, and privacy (page5):(74)
- Generally, the holder of the data controls access to them. Within the transportation and logistics community, this access is carefully controlled.
- There are systems in place that ensure that data can be accessed only by the intended people and only to the degree that they need it. The type of data used by the transportation and logistics industry make it extremely sensitive, with disastrous consequences for business if accessed by persons with malicious intentions.
- Usually, data access is password-protected, and the following is true:
- Because data generated within the logistics systems are often financial, strong encryption is placed on such data when they are sent.
- However, several applications can retrieve aircraft and vessel tracking data, often with other identifying information. The security clearance or password protection to access data through these applications is often minimal.
- The protection of data sources is extremely important. In the search engine industry, it is so heavily protected that there is not even disclosure of how exactly it is protected.
The scan documented the following best practices for data storage and backup (page 6):(74)
- Frequent backups and off-site storage are typical.
- Preventative maintenance should be performed regularly.
- Careful consideration should be devoted to determining how much and for how long data should be stored. In aviation, for instance, data are kept for a relatively short timeframe because the need is for real-time rather than historical information. At the same time, data can be available for revision if there is an incident to investigate.
The scan documented the following best practices for operations and maintenance (page6):(74)
- Deployment should be started on a reasonable scale, such as implementing in a small geographical area or using easily manageable data.
- Multiple servers should be used to distribute real-time loads. Several technologies enable this load distribution.
- It is important to give thoughtful consideration to determining the needed resolution or granularity of the data. This may vary depending on the context and use of the data. Specific examples include the following:
- In the logistics and retail industries, inventory data are refreshed every minute in several stores. Not only is this used to support restocking but also to monitor trends.
- In the search engine industry, data generally go through a 24-h refresh cycle, staying fixed between cycles.
- In the aviation field, data are mostly retrieved as fast as possible to enable incident prevention.
- It is necessary to determine what is critical to communicate and what is not. For instance, railroad and airline alert systems only collect the necessary data that can alert an operator of a particular problem.
The scan documented the following best practice for critical failures (page 7):(74)
- A common issue is that correcting a problem is often dependent on a single person, meaning its solution depends on the given person’s availability. It is therefore important to have staff available around the clock to solve potentially catastrophic failures. The higher labor cost is a necessary expense if the system needs to be highly available at all times.
Applicability to VDA Framework
All of the data management practices and considerations described above are applicable to the VDA Framework.
Oak Ridge National Laboratory: Best Practices for Preparing Environmental Datasets to Share and Archive
The Oak Ridge report discusses best data management practices that data collectors and providers should follow to improve the usability of their datasets.(75) The report focuses on the preparations for sharing of data, preservation of data, and archiving data. It identifies the following seven best practices for preparing environmental datasets to share:
- Define the contents of data files: Before making data accessible, it is important to make the data fully understandable by specifying units of measurement, definitions of codes or acronyms, and other descriptors.
- Use consistent data organization: Whether the data are provided in a matrix format or not, it is vital that there is consistency in the way all data are provided.
- Use consistent file structure and stable file formats for tabular and image data: Data collectors and/or disseminators should use a format that can be read in the future, regardless of any change of data usage or application.
- Assign descriptive file names: As a rule of thumb, file names should be reflective of their contents and be able to uniquely identify the file.
- Perform basic quality assurance: Before sharing the data, it is a good idea to conduct basic and scientific quality assurance of the data.
- Assign descriptive dataset titles: Titles of datasets should be as descriptive as possible, seeking to make them available for future use by users who may not have any familiarity with their context.
- Provide documentation: Like other best practices, providing user-friendly documentation is crucial to ensure that future users can access, understand, and use the data.
Each of these practices should be included in any comprehensive data management program.
Applicability to VDA Framework
Each of the seven best practices listed in the previous section (or some form of several of these) could be included in a data catalog for the VDA Framework.
Massachusetts Institute of Technology Libraries
The following Data Management Checklist was designed as a data planning checklist by Massachusetts Institute of Technology (MIT) Libraries for data used in research projects. The checklist is part of a guide, Data Management and Publishing, available from MIT Libraries.(76)
The guide provides the following examples of the types of questions that should be addressed in developing the VDA Framework:(76)
- What types of data will be generated? Can they be reproducible?
- What is the target audience of the data both now and in the future?
- How long should the data be stored?
- What is needed to generate, analyze, and visualize the data?
- Are there security or privacy procedures one should follow?
- Are there sharing requirements to which one is bound?
- Has the appropriate documentation been provided for its usage?
- What naming convention will be used for directories and files?
- What file formats are needed? Will they remain easily accessible in the future?
- What strategies are put in place for storage and backup?
- Are there standards for sharing or integrating the data?
- Is there someone in place to be in charge of managing the data?
Applicability to VDA Framework
The questions listed in the previous section will help in developing a comprehensive, well-designed data management plan for the VDA Framework by documenting the following important components of a successful data management project:
- Policies and procedures for sharing data from the roadway travel mobility data programs.
- Roles/responsibilities of data collectors, data managers (data business owners, data stewards, and data custodians), and data users (data stakeholders and communities of interest).
- Data standards for collection and reporting of roadway travel mobility data programs.
- Technology (hardware/software) needed to sustain the VDA Framework.
Policy Analysis and Recommendations for the Data Capture and Management Program: Implementation of Open Data Policies and System Policies for the Research Data Exchange and Data Environments (77)
This report argues that development of the Internet has resulted in a significant global trend to adopt open data and open source policies. Several governmental bodies and not-for-profit organizations around the world are developing initiatives to channel this to harness their benefit. The document provides recommendations concerning emerging open data policies within a connected vehicle research program. It argues that within the transportation industry, an open data policy allows a transformation of the state of the practice by supporting the reuse of data in a collaborative and dynamic framework. Some of the main benefits are the following:
- Provides greater access to information from public-sector systems.
- Increases data sharing among organizations.
- Enables the emergence of high-fidelity real-time data sources that spur novel applications and greater efficiency.
To fulfill the promise of an open data policy, this must be readily accessible and cost effective but at the same time address security, privacy, liability, and quality concerns.
Beyond its open data policy recommendations, this report also outlines the following RDE system policies:
- Governance needs and options: RDE governance acts at the following two levels:
- Program-level governance should define the roles of the RDE stakeholders. At the same time this stakeholder group will establish policies and guidelines for RDE operations and the needed resources associated with them.
- RDE-level governance deals with user satisfaction, system performance, and risk mitigation. Its functions are concerned with implementation, management, and monitoring RDE operations. A cloud-based framework can leave privacy, security, and other risk mitigation to a managing third party while requiring the governmental body to maintain responsibility.
- Access policy options: These may range from very restrictive—such as allowing access through a secure vetting and authentication process each time a user attempts to access the portal—to fairly open to anyone. The following considerations can help in determining the appropriate level of access:
- What information is needed from users to grant them access? How can the organization mitigate the risks posed under those settings?
- Are there different levels of access for different types of users?
- How will the RDE store and control the user data input?
- Data management policy options: The governance team should strive to ensure that the appropriate legal language and elements are part of a standard data usage agreement. This should be done in close coordination with the ITS legal team. Regarding the storage and archiving of data, this document recommends limiting storage within the RDE, given the sheer magnitude of the expected data. At the same time, however, each dataset is unique, and the program-level governance team and its legal advisers should collaborate to set the specific policies for each dataset.
- System policy options: This topic addresses the codes of conduct, their enforcement, accessibility and language, system availability and recovery, and policies on upgrades and maintenance. In the case of the RDE, for instance, the fact that it is a Federal project will require it to provide RDE datasets and supporting materials in a way that complies with the Americans with Disabilities Act.(78)
Applicability to VDA Framework
The following conclusions are worth noting.
The VDA Framework should be implemented based on an open data policy. An open data policy is a viable option and is encouraged by the U.S. Government in general and is emerging as a trend with other governments around the Nation and around the world. The level of “openness” is highly dependent on some of the technical inputs—the accessibility of the RDE to public users; the critical and minimum characteristics of the data that will be captured, used, stored, and archived; and the risks/tradeoffs associated with the technical definition of what it means to be open. This report and other related mobility policy reports attempt to apply some definition to these open questions. The whole set of reports and definitions should be vetted by the technical team and stakeholders to ensure that the basis for recommending policies is solid.
The RDE system policies can be based on proven solutions; however, the federation policies require further analysis and development. The alternatives regarding the RDE architecture and set of technologies that are proposed for use in the construction and operation of the RDE appear synonymous with other portals in use with the Federal and State governments, academia, and industry. As a result, most of the RDE system policy can draw from existing models. The key differences, though, from a policy perspective include the wide-scale federation and the monitoring and enforcement of policies throughout such a dispersed system. Once decisions are made about the architecture and technologies, developing a set of alternative models of operation with supporting policies (also referred to as “scenarios”) are a useful next step to determine how the technical, policy, and institutional recommendations align.
State-of-the-Practice and Lessons Learned on Implementing Open Data and Open Source Policies (79)
The document State-of-the-Practice and Lessons Learned on Implementing Open Data and Open Source Policies recommends policies for the DCM and dynamic mobility applications (DMAs) programs. The recommended policies can be summarized as follows:(79)
- Data security: Security risks are generally well understood, and there is a set of regulations, policies, and standards in place that enables the development of robust security approaches. These must ensure security of datasets, data environments, and their corresponding hardware and software.
- Data privacy: There are two particularly sensitive privacy elements in the transportation industry, namely confidentiality and locational privacy. For these, there are not only policies but also technologies and technical security measures that mitigate the risks of exposure and protect the user data. What is known as Fair Information Practice Principles guide data privacy policies for the public sector enforced through Federal law and in the private sector enforced by the Federal Trade Commission’s Bureau of Consumer Protection. State-of-the-art privacy technologies generally seek to anonymize the identity of individuals through various mechanisms to protect those individuals from exposure of sensitive and private information about them.
- Intellectual property: Under U.S. law, any software development is considered intellectual property. It is important to acquire and maintain the right to put developed applications under open-source terms and develop comprehensive and effectively communicated intellectual property policy framework to set the rules that must followed regarding all aspects of intellectual property.
- Liability: Agencies involved in the DCM and DMA programs should make sure that clear limits to liability are set in place. Specifically, they should assess, determine, and clearly communicate what type of liability protection they are willing to offer users, as well as what circumstances fall outside of this protection.
- Governance: A thorough, well-documented policy for implementing data governance at the FHWA Office of Operations needs to clearly state the reasons for implementing data governance, which includes ensuring that existing DCM and DMA programs are managed in a way that provides continued support to the Office of Operations in meeting business needs. The policy also needs to identify the offices/persons responsible for overseeing data governance for the Office of Operations.
Applicability to VDA Framework
It is recommended that the policies for metadata, data security, data privacy, intellectual property, liability, and governance be considered in the implementation of the VDA Framework.
Railinc
Railinc processes the Railroad Carload Waybill Sample data each year for the American Association of Railroads as required in statute by the Surface Transportation Board (STB) (which regulates freight railroads). For example, the States all receive annual updates to the Waybill but it is very strictly controlled by the STB. To use the data for studies, one needs to go through a formal approval process to gain access to the data. The data are proprietary and contain origin-destination, tonnage, miles, carrier, commodity (very detailed), and equipment (carload, intermodal). Each year, Railinc also publishes a public waybill sample. Vendors use the waybill sample (also with strict confidentiality) to produce Transearch and other datasets. FHWA uses it to produce the rail elements of the Freight Analysis Framework.
Applicability to VDA Framework
The strict regulation of data may be a useful concept for the VDA Framework.
DATA GOVERNANCE
NCHRP 666, Target Setting Methods and Data Management to Support Performance-Based Resource Allocation by Transportation Agencies
According to NCHRP 666, “Data governance is defined as the execution and enforcement of authority over the management of data assets and the performance of data functions.” (page II-31)(72)
From a practical standpoint, using a data governance model enables the development of standards, policies, and procedures at an enterprise level. A governance model can thus become a focal point where data collection, storage, and use for a particular project or initiative can be set and identified.
From a technical perspective, the use of a data governance framework makes the system more efficient by reducing the number of duplicate data systems, improving quality, and offering better and more coordinated data managing and coordination tools.
The following issues should be addressed in considering a data governance program:
- What rules are the program setting in place? This can include policies, requirements, standards, controls, and mechanisms to ensure accountability.
- What are the rules of engagement among the various stakeholders and how will these be enforced? Who will be in charge of enforcing said rules?
- What is the best process to follow to ensure that data governance creates value, restrains costs and complexity, and remains compliant with the game rules?
Several models are discussed in the report.
Applicability to VDA Framework
A data governance approach needs to be applied to the VDA Framework.
National Information Exchange Model
The National Information Exchange Model (NIEM) is described as follows:(80)
- [An] approach to driving standardized connections among and between governmental entities as well as with private sector and international partners which enable disparate systems to share, exchange, accept, and translate information….In Fiscal Year (FY) 2010, the Office of Management and Budget (OMB) provided guidance to all Federal Agencies to evaluate the adoption and use of NIEM as the basis for developing reference information exchanges to support specification and implementation of reusable cross-boundary services. (page 1)
The NIEM governance framework includes several entities that are similar to the recommended participants in the data governance framework for FHWA Office of Operations. These include an NIEM Executive Steering Council, NIEM Program Management Office, NIEM Communications and Outreach Committee (this would similar to the Communities of Interest), NIEM Technical Architecture Committee, and the NIEM Business Architecture Committee.
Some Federal agencies are addressing the challenge of implementing centralized governance, developing and implementing information exchange guidelines, creating collaborative sharing agreements, and developing enterprise data management maturity, all of which are identified as challenges in the Agency Information Exchange Functional Standards Evaluation report of June11, 2010.(80) These agencies are committed to using the NIEM framework to facilitate the sharing and exchange of information across stakeholder groups (communities of interest). The following example is excerpted from the report:(80)
- The U.S. Department of Transportation is committed to using NIEM to support a department-wide capability to manage and share Suspicious Activity Reporting (SAR) information. The value expected is DOT’s full participation in the Nationwide SAR Initiative (NSI), and ultimately to contribute in preventing another terrorist-type surprise attack on the nation. The information exchange at DOT is considered to be of high-value. Currently, DOT creates SAR information, and stores this information in five different databases. Participation in the NSI is a priority of the National Security Staff and as such is seen as a high impact exchange. (page 10)
Applicability to VDA Framework
The use of NIEM or a similar framework supports the exchange of information across nonintegrated databases. It is worth considering for the VDA Framework.
Identification of Critical Policy Issues for the DCM and DMA Programs
This document identifies the following critical policy issues related to governance of open data and open data environments that need to be addressed throughout the DCM Technical Program phases:
- Structure and authority: What form of governance structure(s) supports DCM data environments? Who will fulfill what roles and responsibilities in decisionmaking and dispute resolution? Who will make decisions for upgrading and maintaining the data environments? Who will make decisions about enforcement? What are the options regarding the level of ongoing Federal involvement and, for each option, what are the roles and responsibilities of Federal participants, and what are the associated costs? Can governance be implemented by the private sector or a hybrid of public- and private-sector stakeholders? Who currently has authority or is new authority needed?
- RDE Data Manager: What is the role of the RDE Data Manager, and are the appropriate policies defined to guide the Data Manager in operations, maintenance, and enforcement?
- Federation of data environments: What criteria should determine the appropriateness and eligibility of connecting external data environments with either the RDE or operational data environment (ODE)? What are the associated costs and responsibilities of establishing and maintaining a relationship—for both the RDE/ODE and the external environment? What policies/rules are needed for adding or removing these external environments? For removing datasets? How does federation support data ownership? Revenue generation from the data ownership? How does federation affect liability or raise security risks? What are mitigation strategies?
- Data sharing agreements: What are standard components of data sharing agreement documents? What are important considerations and lessons learned from other agencies in implementing data sharing agreements?
- Policy for maintenance: Who makes decisions about technology upgrades and flexibility of adding new technologies? Who manages the DCM system configuration?
Applicability to VDA Framework
The policy issues identified and explored will be relevant to the VDA Framework.
Oregon Department of Transportation
ODOT developed a charter to establish the Transportation Community of Interest Data Council in 2006. The purpose of this council is to identify policy, standards, and processes that support the proper use, management, and maintenance of data assets.
ODOT has recognized the need for strong data governance to create and enforce data management standards. It has established a data management policy, which states the following:(73)
- The Oregon Data Governance model also includes well-defined roles and responsibilities for Data Stewards, Data Custodians, and the various Transportation Communities of Interest. The ODOT Data Governance structure also includes work groups, which were formed to provide work products used by the governance program including enterprise data management and reporting tools. (page 75)
Applicability to VDA Framework
The ODOT data governance model provides an excellent template for defining the roles and responsibilities of all participants in a data governance framework, including the oversight council, communities of interest, data stewards, data business owners, etc. Many of these roles may also need to be defined for the VDA Framework.
Virginia Department of Transportation
In 2008, VDOT implemented a Data Business Plan for the System Operations Directorate to “provide a framework for making decisions about what data to acquire, how to get it, and how to make sure it is providing value commensurate with its cost.”(73) This plan defines a framework of stakeholders and their responsibilities to safely and efficiently manage the data system. This includes data stewards, coordinators, architects, and custodians, as well as business owners and communities of interest. It also defines the roles and interaction within and among data services, data products, applications, business processes, business areas, and business objectives.
Applicability to VDA Framework
The principles associated with a data business plan are directly applicable to the VDA Framework, and the VDOT data business plan framework provides an example of a comprehensive governance structure.
DATA SHARING AGREEMENTS
Data sharing can be crucial in two ways. First, it reduces the need (and associated cost) to collect and manage the same data several times at several offices. Second, it minimizes the risk of giving different responses to the same question that is inherently present when there are several versions of a dataset held by different offices. Formal data sharing agreements are helpful to define how the data will be exchanged across different organizations. Examples can include agreements between Federal and local law enforcement organizations, or between a State transportation department and a department of highway safety and motor vehicles.
One common way to establish a formal data sharing agreement is through memoranda of agreement (MOA) or memoranda of understanding agreed upon by the sharing parties. An example of a MOA is one reached between the Metropolitan Washington Council of Governments and the GIS authorities in various Federal, State, regional, and local organizations with a stake or ownership of data around the Washington, DC, metropolitan area. This MOA, whose purpose was to allow sharing geospatial data among all these parties, included stakeholder responsibilities; the purpose, use, and distribution of the gathered data; liability and other legal agreements; and the terms and conditions of the agreement.(82)
Another example of a data sharing agreement—this time geared for safety matters—can be found in Alaska. The State’s Multi-Agency Justice Integration Consortium includes 20 different agencies, such as the Department of Law and Criminal Division, Association of Police Chiefs, Division of Motor Vehicles, Health and Social Services, Department of Transportation, and Department of Public Safety—all of which were signatories of a MOA “to help agencies more efficiently share complete, accurate, timely information in order to enhance the performance of the criminal justice system as a whole.” Using an automated data collection system called the Traffic and Criminal Software System, this stakeholder group has streamlined the safety and law-enforcement process by making collision, arrest, incident, inspection, and GPS data available to the relevant authorities.(72)
The type of agreement may range from a voluntary collaboration with no binding obligations to one that has enforcement mechanisms. An example of the latter is the data sharing requirement present in the Metropolitan Transportation Commission of the Bay Area in California, which asks local jurisdictions to provide it with regular updates on pavement condition or face the consequence of not receiving Federal grant funds.
Just as external data sharing agreements are extremely valuable in optimizing processes and reducing costs, so is the internal data sharing within a given organization. Internal offices should therefore strive to make data collection and management as unified and streamlined as possible. For instance, within a State transportation department, the data needs for office of transportation statistics and those for the office of transportation safety might be very similar and the two could thus agree to unify their efforts.
Applicability to VDA Framework
Formal data sharing agreements will be necessary in establishing relationships for data sharing for the VDA Framework.