Post-event Connected Vehicle Data Exploration - Lessons Learned

4. Platform to Conduct Post-CV Data Analysis

Data platforms, including data storing, accessing and analytical tools, are critical to carry out efficient and productive data analysis, especially for post-CV data due to the size of the data, complex data structures, and heterogeneous data formats.

4.1: Data Storage Concept

Figure 1: Databricks architecture illustration

This figure illustrates Databricks architecture. At the center of the figure is the Data Lakehouse, which comprises the Data Warehouse and Data Lake. Microsoft Azure, Amazon Web Services, and Google Cloud communicate with the Lakehouse.

Source: FHWA Office of Highway Policy Information.

Currently, the practice of complex data storage is through the so-called Data Lakehouse architecture. As shown in Figure 1, a Data Lakehouse has two critical parts: Data Warehouse and Data Lake, reflecting the long evolution history of data storing technologies. A Data Warehouse handles traditional structured relational data from all sources but with the same data format. A Data Lake on the other hand provides the capability to store data structured, semi-structured and unstructured data coming in all data formats such as JSON, CSV, Parquet. etc.

4.2: Data Analytics

Based on the Data Lakehouse architecture, the big data industry has adopted the open analytics platform known as the Databricks Lakehouse for building, deploying, sharing enterprise level data, analytics, and AI solutions at scale. It integrates many data processing components such as Apache Spark, a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters, and MLFlow, an open-source platform for machine learning. The Databricks Lakehouse have been customized by many companies including Microsoft (Azure Databricks), Amazon (Amazon Databricks), and Google (Google Databricks) to provide commercial deployments in different cloud environments. In addition to Databricks Lakehouse, there are some other alternative platforms such as Snowflakes and Cloudera. They provide data platforms as Software-As-A-Service (SaaS) with each strengthening on different aspects of data storing, accessing, and processing.

The abovementioned big data platforms offer broad similar capacities and compete with their unique specificities. When deciding on a platform, the programming languages a platform supports should be a pivotal factor. Primary programming languages may include Python, SQL, R, Scala, and Java. Agency users’ familiarity and knowledge with any supported programming language are critical to utilizing a platform’s capacity and achieving the information extraction goal.

Primary programming languages (e.g., Python, SQL, R, Scala, and Java)
proficiency is one of the factors in platform selection.

The OEM CV data provider uses the AWS Databricks Lakehouse platform. The authors analyzed the CV data via an evaluation Databrick account using self-developed Databricks codes in Python and SQL.

The JPO Pilot post-CV data are stored on the AWS cloud as an S3 Bucket. The authors analyzed the pilot CV data via FHWA Turner Fairbank Highway Research Center’s Path to Advancing Novel Data Analytics (PANDA) laboratory AWS Databricks platform. Unlike the OEM CV data that is structured and stored in relational database tables, JPO Pilot post-CV data are unstructured and saved as many JSON files in S3 Bucket. For each Roadside Unit, data of each hour is saved as one JSON file, which results in a tremendous number of individual files. The author developed a process in Databricks to automatically enumerate all daily and weekly files and loaded each JSON file into Apache Spark data frames for further processing. Another issue in the Pilot CV data is data format. Data types and even data units are different from one file to another. A set of Python codes developed resolved all the format issues.

More platforms an organization having accesses to does not necessarily mean
more productivity or higher efficiency.

Previous | Next

Page last modified on May 8, 2024