Data Integration Primer - Challenges to Data Integration
One of the most fundamental challenges in the process of data integration is setting realistic expectations. The term data integration conjures a perfect coordination of diversified databases, software, equipment, and personnel into a smoothly functioning alliance, free of the persistent headaches that mark less comprehensive systems of information management. Think again.
The requirements analysis stage offers one of the best opportunities in the process to recognize and digest the full scope of complexity of the data integration task. Thorough attention to this analysis is possibly the most important ingredient in creating a system that will live to see adoption and maximum use.
As the field of data integration progresses, however, other common impediments and compensatory solutions will be easily identified. Current integration practices have already highlighted a few familiar challenges as well as strategies to address them, as outlined below.
For most transportation agencies, data integration involves synchronizing huge quantities of variable, heterogeneous data resulting from internal legacy systems that vary in data format. Legacy systems may have been created around flat file, network, or hierarchical databases, unlike newer generations of databases which use relational data. Data in different formats from external sources continue to be added to the legacy databases to improve the value of the information. Each generation, product, and home-grown system has unique demands to fulfill in order to store or extract data. So data integration can involve various strategies for coping with heterogeneity. In some cases, the effort becomes a major exercise in data homogenization, which may not enhance the quality of the data offered.
- A detailed analysis of the characteristics and uses of data is necessary to mitigate issues with heterogeneous data. First, a model is chosen-either a federated or data warehouse environment- that serves the requirements of the business applications and other uses of the data. Then the database developer will need to ensure that various applications can use this format or, alternatively, that standard operating procedures are adopted to convert the data to another format.
- Bringing disparate data together in a database system or migrating and fusing highly incompatible databases is painstaking work that can sometimes feel like an overwhelming challenge. Thankfully, software technology has advanced to minimize obstacles through a series of data access routines that allow structured query languages to access nearly all DBM and data file systems-relational or non-relational.
Data quality is a top concern in any data integration strategy. Legacy data must be cleaned up prior to conversion and integration, or an agency will almost certainly face serious data problems later. Legacy data impurities have a compounding effect; by nature, they tend to concentrate around high volume data users.
If this information is corrupt, so, too, will be the decisions made from it. It is not unusual for undiscovered data quality problems to emerge in the process of cleaning information for use by the integrated system. The issue of bad data leads to procedures for regularly auditing the quality of information used. But who holds the ultimate responsibility for this job is not always clear.
- The issue of data quality exists throughout the life of any data integration system. So it is best to establish both practices and responsibilities right from the start, and make provisions for each to continue in perpetuity.
- The best processes result when developers and users work together to determine the quality controls that will be put in place in both the development phase and the ongoing use of the system.
Lack of Storage Capacity
The unanticipated need for additional performance and capacity is one of the most common challenges to data integration, particularly in data warehousing. Two storage-related requirements generally come into play: extensibility and scalability. Anticipating the extent of growth in an environment in which the need for storage can increase exponentially once a system is initiated drives fears that the storage cost will exceed the benefit of data integration. Introducing such massive quantities of data can push the limits of hardware and software. This may force developers to instigate costly fixes if an architecture for processing much larger amounts of data must be retrofitted into the planned system.
- Alternative storage is becoming routine for data warehouses that are likely to grow in size. Planning for such options helps keep expanding databases affordable.
- The cost per gigabyte of storage on disk drives continues to decline as technology improves. From 2000 to 2004, for instance, the cost of data storage declined ten-fold. High-performance storage disks are expected to follow the downward pricing spiral.
Data integration costs are fueled largely by items that are difficult for the uninitiated to quantify, and thus predict. These might include:
- Labor costs for initial planning, evaluation, programming and additional data acquisition
- Software and hardware purchases
- Unanticipated technology changes/advances
- Both labor and the direct costs of data storage and maintenance
It is important to note that, regardless of efforts to streamline maintenance, the realities of a fully functioning data integration system may demand a great deal more maintenance than could be anticipated.
Unrealistic estimating can be driven by an overly optimistic budget, particularly in these times of budget shortfall and doing more with less. More users, more analysis needs and more complex requirements may drive performance and capacity problems. Limited resources may cause project timelines to be extended, without commensurate funding. Unanticipated issues, or new issues, may call for expensive consulting help. And the dynamic atmosphere of today's transportation agency must be taken into account, in which lack of staff, changes in business processes, problems with hardware and software, and shifting leadership can drive additional expense.
The investment in time and labor required to extract, clean, load, and maintain data can creep if the quality of the data presented is weak. It is not unusual for this to produce unanticipated labor costs that are rather alarmingly out of proportion to the total project budget.
- The approach to estimating project costs must be both far-sighted and realistic. This requires an investment in experienced analysts, as well as cooperation, where possible, among sister agencies on lessons learned.
- Special effort should be made to identify items that may seem unlikely but could dramatically impact total project cost.
- Extraordinary care in planning, investing in expertise, obtaining stakeholder buy-in and participation, and managing the process will each help ensure that cost overruns are minimized and, when encountered, can be most effectively resolved. Data integration is a fluid process in which such overruns may occur at each step along the way, so trained personnel with vigilant oversight are likely to return dividends instead of adding to cost.
- A viable data integration approach must recognize that the better data integration works for users, the more fundamental it will become to business processes. This level of use must be supported by consistent maintenance. It might be tempting to think that a well designed system will, by nature, function without much upkeep or tweaking. In fact, the best systems and processes tend to thrive on the routine care and support of well-trained personnel, a fact that wise managers generously anticipate in the data integration plan and budget.
Lack of Cooperation from Staff
User groups within an agency may have developed databases on their own, sometimes independently from information systems staff, that are highly responsive to the users' particular needs. It is natural that owners of these functioning standalone units might be skeptical that the new system would support their needs as effectively.
Other proprietary interests may come into play. For example, division staff may not want the data they collect and track to be at all times transparently visible to headquarters staff without the opportunity to address the nuances of what the data appear to show. Owners or users may fear that higher ups without appreciation of the peculiarities of a given method of operation will gain more control over how data is collected and accessed organization-wide.
In some agencies, the level of personnel, consultants, and financial support emanating from the highest echelons of management may be insufficient to dispel these fears and gain cooperation. Top management must be fully invested in the project. Otherwise, the likelihood is smaller that the strategic data integration plan and the resources associated with it will be approved. The additional support required to engage and convey to everyone in the agency the need for and benefits of data integration is unlikely to flow from leaders who lack awareness of or commitment to the benefits of data integration.
- Any large-scale data integration project, regardless of model, demands that executive management be fully on board. Without it, the initiative is, quite simply, likely to fail.
- Informing and involving the diversity of players during the crucial requirements analysis stage, and then in each subsequent phase and step, is probably the single most effective way to gain buy-in, trust, and cooperation. Collecting and addressing each user's concerns may be a daunting proposition, particularly for knowledgeable information professionals who prefer to "cut to the chase." However, without a personal stake in the process and a sense of ownership of the final product, the long-term health of this major investment is likely to be compromised by users who feel that change has been enforced upon them rather than designed to advance their interests.
- Incremental education, another benefit of stakeholder involvement, is easier to impart than after-the-fact training, particularly since it addresses both the capabilities and limitations of the system, helping to calibrate appropriate expectations along the way.
- Since so much of the project's success is dependent upon understanding and conveying both human and technical issues, skilled communicators are a logical component of any data integration team. Whether staff or consultants, professional communications personnel are most effective as core participants, rather than occasional or outside contributors. They are trained to recognize and ameliorate gaps in understanding and motivation. Their skills also help maximize the conditions for cooperation and enthusiastic adoption. In many transportation agencies, public information personnel actually focus a significant amount of their time and budget on internal audiences rather than external customers. This makes them well attuned to the operational realities of a variety of internal stakeholders.
At least three conditions were required for the success of Virginia DOT's development effort:
- Upper management had to support the business objectives of the project and the creation of a new system to meet the objectives
- Project managers had to receive the budget, staff, and IT resources necessary to initiate and complete the process
- All stakeholders and eventual system users from the agency's districts and headquarters had to cooperate with the project team throughout the process(22)
Lack of Data Management Expertise
As more transportation agencies nationwide undertake the integration of data, the availability of experienced personnel increases. However, since data integration is a multi-year, highly complex proposition, even these leaders may not have the kind of expertise that evolves over a full project life-cycle. Common problems develop at different stages of the process and these can better be anticipated and addressed when key personnel have managed the typical variables of each project phase.
Also, the process of transferring historical data from its independent source to the integrated system may benefit from the knowledge of the manager who originally captured and stored the information. High turnover in such positions, along with early retirements and other personnel shifts driven by an historically tight budget environment, may complicate the mining and preparation of this data for convergence with the new system.
- A seasoned and highly knowledgeable data integration project leader and a data manager with state of the practice experience are the minimum required to design a viable approach to integration. Choosing this expertise very carefully can help ensure that the resulting architecture is sufficiently modular, can be maintained, and is robust enough to support a wide range of owner and user needs while remaining flexible enough to accommodate changing transportation decision-support requirements over a period of years.
Perception of Data Integration as an Overwhelming Effort
When transportation agencies consider data integration, one pervasive notion is that the analysis of existing information needs and infrastructure, much less the organization of data into viable channels for integration, requires a monumental initial commitment of resources and staff. Resource-scarce agencies identify this perceived major upfront overhaul as "unachievable" and "disruptive." In addition, uncertainties about funding priorities and potential shortfalls can exacerbate efforts to move forward.
- Methodical planning is essential in data integration. Setting incremental (or phased) goals helps ensure that each phase can be understood, achieved, and funded adequately. This approach also allows the integration process to be flexible and agile, minimizing risks associated with funding and other resource uncertainties and priority shifts. In addition, the smaller, more accurate goals will help sustain the integration effort and make it less disruptive to those using and providing data.
- Source: Transportation Asset Management Case Studies/Data Integration, The Virginia Experience, USDOT FHWA