2021 IEEE 93rd Vehicular Technology Conference (VTC2021-Spring) | 978-1-7281-8964-2/20/$31.00 ©2021 IEEE | DOI: 10.1109/VTC2021-Spring51267.2021.9448697 Data Provenance in Vehicle Data Chains Daniel Wilms Dr. Carsten Stoecker Dr. Juan Caballero Research Engineer, BMW Technology Office Israel Ltd daniel.wilms@bmwtechoffice.co.il Founder and CEO, Spherity GmbH carsten.stoecker@spherity.com Research, Spherity GmbH, Decentralized Identity Fdn juan.caballero@spherity.com Abstract—With almost every new vehicle being connected, the importance of vehicle data is growing rapidly. Many mobility applications rely on the fusion of data coming from heterogeneous data sources, like vehicle and ”smart-city” data or process data generated by systems out of their control. This external data determines much about the behaviour of the relying applications: it impacts the reliability, security and overall quality of the application’s input data and ultimately of the application itself. Hence, knowledge about the provenance of that data is a critical component in any data-driven system. The secure traceability of the data handling along the entire processing chain, which passes through various distinct systems, is critical for the detection and avoidance of misuse and manipulation. In this paper, we introduce a mechanism for establishing secure data provenance in real time, demonstrating an exemplary use-case based on a machine learning model that detects dangerous driving situations. We show with our approach based on W3C decentralized identity standards that data provenance in closed data systems can be effectively achieved using technical standards designed for an open data approach. I. I NTRODUCTION Driven by technological innovation and organic ecosystem growth, mobility value chains are significantly changing from monolithic and closed systems to distributed, open ones [1] [2]. Data flows are increasingly defined dynamically and stretch across multiple organizational boundaries and even legal jurisdictions with diverse rules and regulations [3]. The trustworthiness and accuracy of output data (such as in-car sensors) generated along distributed digital mobility value chains is of increasing importance for safety and reliability; this importance can only increase as Machine Learning (ML) systems grow more central to mobility systems [4]. The growing use of data-producing and/or Internet of Things (IoT) devices across every industry – including logistics, manufacturing, and mobility – is ushering in an era of data abundance. The trend towards processing open data in the mobility ecosystem in particular makes urgent the question of how to ensure the validity of the source data, and about how to ensure the quality and the accuracy of the output data as well. Furthermore, retroactively demoting or forgetting data from sources proven to be unreliable remains an elusive capability. Mobility systems have a very low tolerance for fraud and abuse, as the impact of fraudulent data in safety-critical features can have immediate real-world impact [5]. Upcoming regulation around automotive cyber-security underlines this importance [6]. Besides these security considerations, ensuring the privacy of the individuals when processing their data is also crucial. Understanding the data flow and providing proof that it is treated carefully and ethically is essential. Another important aspect is the quality assurance of the data: if customers have the option to forward on data to third party applications, how can they forward along metadata and trust ratings necessary for safety and benchmarking in those new contexts? We believe that only by providing full transparency and deeper understanding of the data and the way it gets processed the required quality can be provided, by detecting and mitigating flaws in the data pipeline. Businesses must know the origin and risks of data from different sources before using them. Businesses need highly automated and verifiable instruments for assessing the provenance of data vital for ML applications [7]. This paper is organized as follows: the remainder of section I will further detail the requirements for data provenance in an automotive context. section II will discuss related work across automotive, ML, and data-processing spheres. We introduce our approach in Figure III and the details and results of our implementation in Figure IV. Finally, we provide our conclusion in section V. Fig. 1. Architecture overview: data flow from the Data Producer to the Data Consumer and the provenance flow in the opposite direction. Data provenance in automotive An effective data provenance solution has to be applicable to multiple layers of a typical data processing flow in automotive 978-1-7281-8964-2/21/$31.00 ©2021 IEEE Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on January 04,2022 at 13:45:13 UTC from IEEE Xplore. Restrictions apply. contexts. A typical high-level architecture of such a flow is shown in Figure 1. The data processing flow starts with a Data Producer, typically an Electronic Control Unit (ECU), which transports a signal over a system bus. The signal is received and ”edge-processed” within the vehicle by a Data Collector. The first pre-processing of the signal (e.g., adding meta information, doing some pre-calculation) will be done here, before it will be transported to a Digital representation, which defines the incoming data of the physical vehicle through a digital model, and offers it to applications as a digital shadow or even as a digital twin of the vehicle [8]. From there, we assume an optional data fusion with External Data Sources (e.g. weather, traffic information, etc.). This data is then consumed by Data Processors, which in turn creates new data of interest for an external Data Consumer. Data Processors perform at least one data transformation. In case of ML, at least two data transformations have to be done pre-processing and applying the actual model[9]. In this paper, we propose and detail one mechanism for establishing data provenance in real-time along a data chain that sources driving event data and processes it into an ML label. When the data provenance of a given dangerousdriving machine-learning label is known, a scoring model can be applied to it that calculates the risks of consuming this data label for system control or responsible decisionmaking. This example, chosen for clarity and simplicity, is nevertheless applicable much more widely to any digital data chain. We feel that, with time, machine-learning labels will benefit from some degree of rating and scoring to be used safely, legally, and/or responsibly, balancing privacy and security requirements appropriately. We support standardization on decentralized identity primitives that make these kinds of rating and scoring systems more portable and interoperable, particularly in architectures where cryptographic agility can be incorporated to maximize forward compatibility. II. R ELATED W ORK At least since ML started to go mainstream, the provenance issues of big data have been almost universally acknowledged, although it is less of a solveable problem than a definitional impasse. How can one standardize the capture of data from improvised sources, which will be structured post-facto by the ML process? Some would say that the writing has been on the wall since at least 2009, when the authors of ”Provenance: a Future History” pointed out that the problem would undermine ML research until, in retrospect, it would look like a glaring problem overlooked all along [10]. Our intention in this section is to position our methods in a wide field of complementary approaches which all support and further the broader aim of improving the monitoring, refinement, and reputation capabilityes of ML pipelines. Decentralized identity and decentralized PKI offers a scaffolding and anchor points for accountable and traceable provenance. At the same time, a significant amount of standardization of the data exchange and integration of flexible and reflexive data capture is crucial. The complexity of the semantic and data-capture work needed to take full advantage of that scaffolding is not to be understated. In fact, pride of place is given in the aforementioned 2009 provenance manifesto to semantics and expressive data, which proved influential on the development of many big-data-oriented initiatives within the W3C and the semantic-web community. The system built on top of ML and PROV schemata elaborated on by Souza et al. in 2019 is a good example of how such cross-silo semantic capture could be tracked and accounted for in an ML context [7]. How to quantify and benchmark risk scoring is never a simple matter in ML, and some recent work has tried to address the quantification of provenance scoring specific to contexts analogous to those described. Barclay et al 2019, for example, specifically addresses shifting ethical standards and transparency vis-a-vis regulatory and ethical scrutiny. It is particularly important to recognize that provenance must always be linked to dynamic rather than static valuations and rubrics, as regulatory and liability frameworks will likely take decades to stabilize internationally [11]. Other related work seeks to incorporate risk scoring throughout the training process [12] [13], the scope of which is being expanded by the preceding efforts. Adjusting the ML training methodology to be more reflexive throughout on the basis of such scoring has been a major focus of efforts at Amazon’s ML design division, and presumably will be a feature of production-grade ML in the future, if not a core feature of off-the-shelf product offerings for ML training and for the lifecycle management thereof [14]. The contribution of this paper is to make an architecture proposal which achieves data transparency and data provenance by enriching the meta information of the data itself. This is done right where the creation or transformation of the data takes place. We focus in this paper on the automotive sector, but we believe that this contribution can have an impact on achieving reliable risk scoring of ML algorithms in general. III. P ROPOSED M ETHODS In order to realize data provenance in an automotive data processing chain, the methods we are proposing in this paper consists of two interdependent elements: 1) Identification of entities which create data or perform data transformations in the data processing chain. 2) Introduction of encrypted data structures for representing Distributed Automotive Data (DAD). In order to be able to render the data provenance meaningful, it is crucial to know in which way the data has been transformed at each step from the Data Producer to the Data Consumer. In particular, it has to be transparent which exact digital entity or algorithm has carried out each transformation on the data and when. The transformation can take place in one or many centralized or decentralized mobility systems, so the identification of all the entities is greatly simplified if the identities are maximally portable and not administered within their respective closed systems. Therefore, we propose to adopt the decentralized identifier (DID) standard Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on January 04,2022 at 13:45:13 UTC from IEEE Xplore. Restrictions apply. as an open, interoperable addressing scheme and to establish mechanisms for resolving DIDs across multiple centralized and/or decentralized mobility systems [15]. Even if these identifier resolution schemes are not used for discovery or communications within a given system, they are invaluable for reconstructing or tracing data trails that cross many systems, whether in real-time or forensically. Figure 2 shows which identities in the data processing flow introduced in Figure 1 get provisioned with DIDs in our architecture, and thus get cryptographic data-signing capabilities. Fig. 3. Data structure as chained transformation. Each DAD includes the DID of the transforming entity and points to the previous DAD(s), which were used during the transformation. like this: did:ethr:0x5ed65343eda1c46566dff6774132830b2b821b35 Fig. 2. Architecture with signing identities marked. Data provenance flow on the left with scoring on top, realized by the provenance of the data the algorithm relies on. DIDs were originally designed to function as identifiers for individual people, but can readily be extended to any entity or resource. They are derived from public/private key pairs, registered in an immutable registry for discovery purposes. Spherity and other companies pioneering the decentralized identity technology sector use innovative cryptographic solutions for secure key management such as private key sharding (multi-party computation), on-device/secure-enclave biometrics, and HSMs to make signatures and data trails more secure and non-repudiable [16]. The field is fast-moving and significant progress is being made in expanding the options for high-security use-cases and in building forward-secure cryptographic agility into systems to accommodate these new options. Each domain or namespace for DIDs corresponds to a method of encoding and decoding, making DIDs resolvable like domain names relative to a method-specific but interoperable resolution infrastructure [15]. For this project, Spherity registered its DIDs on the Ethereum Blockchain via the standard W3C DID method ethr [17]. In this method, the public key of any valid ethereum keypair (i.e., an ”ethereum address”) can be used as the identifier string within the namespace defined by the ETH prefix. Thus, our DIDs look As you can see in the ’iss’ and ’payload’[’inputDataDids’] parameters of the sample data label (Figure 5), all identities are expressed as DID references. Essentially, each transformation appends a new link in a chain of linked and signed versions– each data point can be updated, and each updated data point is both signed by the transformer and linked back to its previous state. By applying this to each transformation from the Data Producer to the Data Consumer, Figure 3 shows the resulting data chain, which enables the system to follow back how and where the data has been transformed and which algorithm was responsible for the transformation and how the incoming data at a transformation step influenced the outcome. In this way the data would ensure data provenance throughout the entire data processing chain. The outcome of the algorithm could be traced back to the Data Producer and ensures greater confidence in the outcomes of the algorithm. IV. R ESULTS In this paper we have presented the implementation of a verifiable data chain for a supervised learning scenario with an RNN algorithm detecting dangerous driving scenarios as shown in [18]. The referenced scenario detects situations of dangerous driving on an incoming stream of vehicle data and classifies the maneuver (e.g. left turn, right turn, acceleration, etc.). The algorithm predicts for every timestamp the result, based on the previous ten frames or data points. The input data consists of categorical (e.g. gear, brakes pressed) and continuous signals (e.g position, lateral and longitudinal acceleration, etc.). For our solution presented here, we used a cloud environment and historic dangerous driving event data sets that were used to train the RNN model. As shown in Figure 4, we used the historical data to simulate a live vehicle data stream. The historical data set contains, for each point in time, an array of data points, which are the relevant features for the RNN model. Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on January 04,2022 at 13:45:13 UTC from IEEE Xplore. Restrictions apply. Fig. 4. Overview of each step taken in the prototype implementation. The application is based on a data-stream of historic data. The data is processed and consumed by an algorithm. The proposed provenance flow is applied. Each array is sent to a component, responsible for the data handling and processing. It creates for every data point a DAD. Then, a feature vector of 10 entries is prepared for the RNN model. The outcome of the RNN Model - the classification of the situation as dangerous, the type of maneuver and the confidence - is then stored as another DAD. It refers to the DIDs of the entries, which are included in the DAD output of the final result (See Figure 5). In order to show how the application orders a stream of source events over time, we also present the incoming data on a map, linking backwards to the source DIDs and forward to the resulting DADs (See Figure 6). This way, the cryptographic data structure provides instruments for end-to-end verifiability that enabled us to prove the integrity of the data chain, identify all the entities involved in the creation of a specific machine learning label, and request, in turn, life-cycle credentials from these entities to feed a scoring model for the respective machine learning label. The end-to-end verifiability of entity attributes and quality data about the entities involved in cyber-physical value chains would allow us to build algorithms that accurately score machine learning output data. Any consumer of these output data could then assess their trustworthiness prior to processing them in the consumer’s application. In a second iteration of the project, we would focus on using the proposed methods in a real-world scenario, e.g. using live data streams from a fleet of real vehicles integrated with this validated data-chain infrastructure. Fig. 5. Output VC with provenance pointing back to DIDs of simulated telemetry devices (encrypted data and cryptographic signature cropped out for clarity) V. C ONCLUSION AND O UTLOOK The growing complexity of data chains and the increasing number of actors processing their data goes beyond what can be expected of manual quality assurance and maintenance processes. A transparent and reliable evaluation of the risk and quality of data inference methodologies is essential, and this requires scaffolding for such accounting. Data provenance Fig. 6. Web interface of the application presenting the results. On top there is the visualization of the algorithm output, in the middle the raw values with corresponding DIDs and below the algorithm DID, the raw value DIDs, the algorithm output and the signature. Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on January 04,2022 at 13:45:13 UTC from IEEE Xplore. Restrictions apply. about the entities involved in a data processing chain and the resulting machine learning labels (using DIDs, VCs, and DLT to ensure uniform metadata) provides a foundation for reliable risk scoring. Harmonized and standardized data (and more importantly, metadata) is the key to AI explainability, whether managed in traditional top-down ways, by new forms of reputation, or by new forms of actuary accounting and trustworthiness ratings. In any of these cases, verifiable credentials about identity subjects - i.e., vehicles, pre-processing and ML algorithms - could be consumed by algorithms at the heart of scoring models that both assess risk and refine the labeling process and its outputs. Overall, we believe many different costs can be reduced significantly: those incurred by poor data quality, those resulting from poorly-understood data flows from which inferences are drawn by the customer or the vehicle context and the risk of data manipulation in any data system, whether open or closed. New economic opportunities internal to the data marketplace will open up, we believe, as the minimum and average level of data quality in marketplaces rises. As shown in Alvarez-Coello et al. [19], it is beneficial for the industry to move towards a data-centric architecture, which can drive major gains in the stability and reliability of dataprocessing flows. This stability and reliability is necessary to maintain safety and innovation, and our solution introduced here contributes directly to these ends. This approach also has significant indirect benefits as well for the quality-assurance and legal aspects of these systems. The kinds of discovery and forensic audits required by both routine regulatory compliance and dispute resolution could be executed in a much more efficient way once entire data processing pipelines become verifiable to any auditor with the right consents or credentials. This also fosters innovation and business-process agility, as individual actors (even non-human ones!) would be better able to assess the risks of relying on data sets, data sources, and algorithms dynamically. We have shown how such data provenance can be applied to data streams in an automotive context. The scenario was based on historical data and simulated a typical live data chain. Next steps would include applying the solution to a situation in which a real vehicle serves as the actual Data Producer, extending the concept from the restricted environment shown in this work to a real world application with all different layers of the data chain shown in Figure 1. R EFERENCES [1] P. Yadav, S. Hassan, A. Ojo, and E. Curry, “The role of open data in driving sustainable mobility in nine smart cities,” 06 2017, pp. pp. 1248–1263). [2] “Driving Positive Outcomes through Open Data Solutions for Mobility,” Dell, Lero, Forum For the Future, Open DataSoft, City of Palo Alto, Tech. Rep., 02 2018. [Online]. Available: https://www.dell.com/learn/pa/en/pacorp1/ corporate∼corp-comm∼en/documents∼mobility-open-data.pdf [3] N. Gruschka, V. Mavroeidis, K. Vishi, and M. Jensen, “Privacy Issues and Data Protection in Big Data: A Case Study Analysis under GDPR,” arXiv:1811.08531 [cs], Nov. 2018, arXiv: 1811.08531. [Online]. Available: http://arxiv.org/abs/1811.08531 [4] B. Spanfelner, D. Richter, S. Ebel, R. B. GmbH, U. Wilhelm, R. B. GmbH, W. Branz, R. B. GmbH, C. Patz, and R. B. GmbH, “Challenges in applying the ISO 26262 for driver assistance systems,” Tagung Fahrerassistenz, p. 23, 2012. [5] P. Koopman and M. Wagner, “Challenges in Autonomous Vehicle Testing and Validation,” SAE International Journal of Transportation Safety, vol. 4, no. 1, pp. 15–24, Apr. 2016. [Online]. Available: https:// www.sae.org/content/2016-01-0128/ [6] Ondrej Burkacky, Johannes Deichmann, Benjamin Klein, Klaus Pototzky, and Gundbert Scherf, “Cybersecurity in automotive,” McKinsey, Mar. 2020. [Online]. Available: https://www.gsaglobal.org/wp-content/uploads/2020/ 03/Cybersecurity-in-automotive-Mastering-the-challenge.pdf [7] R. Souza and et al, “Provenance data in the machine learning lifecycle in computational science and engineering,” p. 10, 10 2019. [Online]. Available: https://www.researchgate.net/publication/336410355 [8] W. Kritzinger, M. Karner, G. Traar, J. Henjes, and W. Sihn, “Digital Twin in manufacturing: A categorical literature review and classification,” IFAC-PapersOnLine, vol. 51, no. 11, pp. 1016–1022, 2018. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/ S2405896318316021 [9] S. B. Kotsiantis, D. Kanellopoulos, and P. E. Pintelas, “Data Preprocessing for Supervised Leaning,” vol. 1, no. 12, p. 6, 2007. [10] J. Cheney, S. Chong, N. Foster, M. Seltzer, and S. Vansummeren, “Provenance: A future history,” 10 2009, pp. 957–964. [Online]. Available: https://www.researchgate.net/publication/221320910 [11] I. Barclay, A. D. Preece, I. J. Taylor, and D. C. Verma, “Quantifying transparency of machine learning systems through analysis of contributions,” CoRR, vol. abs/1907.03483, 2019. [Online]. Available: http://arxiv.org/abs/1907.03483 [12] R. Souza, L. Azevedo, V. Lourenço, E. Soares, R. Thiago, R. Brandão, D. Civitarese, E. V. Brazil, M. Moreno, P. Valduriez, M. Mattoso, R. Cerqueira, and M. A. S. Netto, “Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering,” arXiv:1910.04223 [cs], Oct. 2019, arXiv: 1910.04223. [Online]. Available: http://arxiv.org/abs/1910.04223 [13] H. Miao, A. Li, L. S. Davis, and A. Deshpande, “Towards Unified Data and Lifecycle Management for Deep Learning,” in 2017 IEEE 33rd International Conference on Data Engineering (ICDE). San Diego, CA, USA: IEEE, Apr. 2017, pp. 571–582. [Online]. Available: http:// ieeexplore.ieee.org/document/7930008/ [14] S. Schelter, J.-H. Böse, J. Kirschnick, T. Klein, and S. Seufert, “Automatically tracking metadata and provenance of machine learning experiments,” 2017. [Online]. Available: http://learningsys.org/nips17/ assets/papers/paper 13.pdf [15] “Decentralized identifiers (dids) v1.0,” W3C Working Draft 22 June 2020. [Online]. Available: https://www.w3.org/TR/did-core/ [16] C. Allen, A. Brock, V. Buterin, J. Callas, D. Dorje, C. Lundkvist, P. Kravchenko, J. Nelson, D. Reed, M. Sabadello, G. Slepak, N. Thorp, and H. T Wood, “Decentralized public key infrastructure,” 12 2015. [Online]. Available: https://github.com/WebOfTrustInfo/rwot1-sf/blob/ master/final-documents/dpki.pdf [17] ConsenSys, “Did method ethr specification v3.0,” 2020. [Online]. Available: https://github.com/decentralized-identity/ethr-did-resolver/ tree/3.0.0 [18] D. Alvarez-Coello, B. Klotz, D. Wilms, S. Fejji, J. M. Gomez, and R. Troncy, “Modeling dangerous driving events based on in-vehicle data using Random Forest and Recurrent Neural Network,” in 2019 IEEE Intelligent Vehicles Symposium (IV). Paris, France: IEEE, Jun. 2019, pp. 165–170. [Online]. Available: https://ieeexplore.ieee.org/document/ 8814069/ [19] D. Alvarez-Coello, D. Wilms, A. Bekan, and J. Marx Gomez, “Towards a Data-Centric Architecture in the Automotive Industry,” in International Conference on ENTERprise Information Systems (CENTERIS). Algarve, Portugal: Elsevier, Oct. 2020, accepted. Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on January 04,2022 at 13:45:13 UTC from IEEE Xplore. Restrictions apply.