Uploaded by GÜLCAN DOĞAN

Data Provenance in Vehicle Data Chains

advertisement
2021 IEEE 93rd Vehicular Technology Conference (VTC2021-Spring) | 978-1-7281-8964-2/20/$31.00 ©2021 IEEE | DOI: 10.1109/VTC2021-Spring51267.2021.9448697
Data Provenance in Vehicle Data Chains
Daniel Wilms
Dr. Carsten Stoecker
Dr. Juan Caballero
Research Engineer,
BMW Technology Office Israel Ltd
daniel.wilms@bmwtechoffice.co.il
Founder and CEO,
Spherity GmbH
carsten.stoecker@spherity.com
Research, Spherity GmbH,
Decentralized Identity Fdn
juan.caballero@spherity.com
Abstract—With almost every new vehicle being connected, the
importance of vehicle data is growing rapidly. Many mobility
applications rely on the fusion of data coming from heterogeneous
data sources, like vehicle and ”smart-city” data or process data
generated by systems out of their control. This external data
determines much about the behaviour of the relying applications:
it impacts the reliability, security and overall quality of the
application’s input data and ultimately of the application itself.
Hence, knowledge about the provenance of that data is a critical
component in any data-driven system. The secure traceability of
the data handling along the entire processing chain, which passes
through various distinct systems, is critical for the detection
and avoidance of misuse and manipulation. In this paper, we
introduce a mechanism for establishing secure data provenance
in real time, demonstrating an exemplary use-case based on a
machine learning model that detects dangerous driving situations.
We show with our approach based on W3C decentralized identity
standards that data provenance in closed data systems can be
effectively achieved using technical standards designed for an
open data approach.
I. I NTRODUCTION
Driven by technological innovation and organic ecosystem growth, mobility value chains are significantly changing from monolithic and closed systems to distributed, open
ones [1] [2]. Data flows are increasingly defined dynamically
and stretch across multiple organizational boundaries and even
legal jurisdictions with diverse rules and regulations [3]. The
trustworthiness and accuracy of output data (such as in-car
sensors) generated along distributed digital mobility value
chains is of increasing importance for safety and reliability;
this importance can only increase as Machine Learning (ML)
systems grow more central to mobility systems [4].
The growing use of data-producing and/or Internet of Things
(IoT) devices across every industry – including logistics,
manufacturing, and mobility – is ushering in an era of data
abundance. The trend towards processing open data in the
mobility ecosystem in particular makes urgent the question of
how to ensure the validity of the source data, and about how to
ensure the quality and the accuracy of the output data as well.
Furthermore, retroactively demoting or forgetting data from
sources proven to be unreliable remains an elusive capability.
Mobility systems have a very low tolerance for fraud and
abuse, as the impact of fraudulent data in safety-critical
features can have immediate real-world impact [5]. Upcoming
regulation around automotive cyber-security underlines this
importance [6].
Besides these security considerations, ensuring the privacy
of the individuals when processing their data is also crucial.
Understanding the data flow and providing proof that it is
treated carefully and ethically is essential. Another important
aspect is the quality assurance of the data: if customers have
the option to forward on data to third party applications, how
can they forward along metadata and trust ratings necessary for
safety and benchmarking in those new contexts? We believe
that only by providing full transparency and deeper understanding of the data and the way it gets processed the required
quality can be provided, by detecting and mitigating flaws in
the data pipeline. Businesses must know the origin and risks
of data from different sources before using them. Businesses
need highly automated and verifiable instruments for assessing
the provenance of data vital for ML applications [7].
This paper is organized as follows: the remainder of
section I will further detail the requirements for data provenance in an automotive context. section II will discuss related
work across automotive, ML, and data-processing spheres. We
introduce our approach in Figure III and the details and results
of our implementation in Figure IV. Finally, we provide our
conclusion in section V.
Fig. 1. Architecture overview: data flow from the Data Producer to the Data
Consumer and the provenance flow in the opposite direction.
Data provenance in automotive
An effective data provenance solution has to be applicable to
multiple layers of a typical data processing flow in automotive
978-1-7281-8964-2/21/$31.00 ©2021 IEEE
Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on January 04,2022 at 13:45:13 UTC from IEEE Xplore. Restrictions apply.
contexts. A typical high-level architecture of such a flow is
shown in Figure 1. The data processing flow starts with a
Data Producer, typically an Electronic Control Unit (ECU),
which transports a signal over a system bus. The signal is
received and ”edge-processed” within the vehicle by a Data
Collector. The first pre-processing of the signal (e.g., adding
meta information, doing some pre-calculation) will be done
here, before it will be transported to a Digital representation,
which defines the incoming data of the physical vehicle
through a digital model, and offers it to applications as a
digital shadow or even as a digital twin of the vehicle [8].
From there, we assume an optional data fusion with External
Data Sources (e.g. weather, traffic information, etc.). This data
is then consumed by Data Processors, which in turn creates
new data of interest for an external Data Consumer. Data
Processors perform at least one data transformation. In case
of ML, at least two data transformations have to be done pre-processing and applying the actual model[9].
In this paper, we propose and detail one mechanism for
establishing data provenance in real-time along a data chain
that sources driving event data and processes it into an
ML label. When the data provenance of a given dangerousdriving machine-learning label is known, a scoring model
can be applied to it that calculates the risks of consuming
this data label for system control or responsible decisionmaking. This example, chosen for clarity and simplicity, is
nevertheless applicable much more widely to any digital data
chain. We feel that, with time, machine-learning labels will
benefit from some degree of rating and scoring to be used
safely, legally, and/or responsibly, balancing privacy and security requirements appropriately. We support standardization
on decentralized identity primitives that make these kinds of
rating and scoring systems more portable and interoperable,
particularly in architectures where cryptographic agility can
be incorporated to maximize forward compatibility.
II. R ELATED W ORK
At least since ML started to go mainstream, the provenance
issues of big data have been almost universally acknowledged,
although it is less of a solveable problem than a definitional
impasse. How can one standardize the capture of data from
improvised sources, which will be structured post-facto by the
ML process? Some would say that the writing has been on the
wall since at least 2009, when the authors of ”Provenance: a
Future History” pointed out that the problem would undermine
ML research until, in retrospect, it would look like a glaring
problem overlooked all along [10]. Our intention in this section
is to position our methods in a wide field of complementary
approaches which all support and further the broader aim of
improving the monitoring, refinement, and reputation capabilityes of ML pipelines.
Decentralized identity and decentralized PKI offers a scaffolding and anchor points for accountable and traceable provenance. At the same time, a significant amount of standardization of the data exchange and integration of flexible and reflexive data capture is crucial. The complexity of the semantic
and data-capture work needed to take full advantage of that
scaffolding is not to be understated. In fact, pride of place
is given in the aforementioned 2009 provenance manifesto to
semantics and expressive data, which proved influential on the
development of many big-data-oriented initiatives within the
W3C and the semantic-web community. The system built on
top of ML and PROV schemata elaborated on by Souza et al.
in 2019 is a good example of how such cross-silo semantic
capture could be tracked and accounted for in an ML context
[7].
How to quantify and benchmark risk scoring is never a
simple matter in ML, and some recent work has tried to
address the quantification of provenance scoring specific to
contexts analogous to those described. Barclay et al 2019,
for example, specifically addresses shifting ethical standards
and transparency vis-a-vis regulatory and ethical scrutiny. It
is particularly important to recognize that provenance must
always be linked to dynamic rather than static valuations and
rubrics, as regulatory and liability frameworks will likely take
decades to stabilize internationally [11].
Other related work seeks to incorporate risk scoring
throughout the training process [12] [13], the scope of which
is being expanded by the preceding efforts. Adjusting the
ML training methodology to be more reflexive throughout on
the basis of such scoring has been a major focus of efforts
at Amazon’s ML design division, and presumably will be a
feature of production-grade ML in the future, if not a core
feature of off-the-shelf product offerings for ML training and
for the lifecycle management thereof [14].
The contribution of this paper is to make an architecture proposal which achieves data transparency and data provenance
by enriching the meta information of the data itself. This is
done right where the creation or transformation of the data
takes place. We focus in this paper on the automotive sector,
but we believe that this contribution can have an impact on
achieving reliable risk scoring of ML algorithms in general.
III. P ROPOSED M ETHODS
In order to realize data provenance in an automotive data
processing chain, the methods we are proposing in this paper
consists of two interdependent elements:
1) Identification of entities which create data or perform
data transformations in the data processing chain.
2) Introduction of encrypted data structures for representing
Distributed Automotive Data (DAD).
In order to be able to render the data provenance meaningful, it is crucial to know in which way the data has
been transformed at each step from the Data Producer to
the Data Consumer. In particular, it has to be transparent
which exact digital entity or algorithm has carried out each
transformation on the data and when. The transformation can
take place in one or many centralized or decentralized mobility
systems, so the identification of all the entities is greatly
simplified if the identities are maximally portable and not
administered within their respective closed systems. Therefore,
we propose to adopt the decentralized identifier (DID) standard
Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on January 04,2022 at 13:45:13 UTC from IEEE Xplore. Restrictions apply.
as an open, interoperable addressing scheme and to establish
mechanisms for resolving DIDs across multiple centralized
and/or decentralized mobility systems [15]. Even if these
identifier resolution schemes are not used for discovery or
communications within a given system, they are invaluable for
reconstructing or tracing data trails that cross many systems,
whether in real-time or forensically. Figure 2 shows which
identities in the data processing flow introduced in Figure 1
get provisioned with DIDs in our architecture, and thus get
cryptographic data-signing capabilities.
Fig. 3. Data structure as chained transformation. Each DAD includes the
DID of the transforming entity and points to the previous DAD(s), which
were used during the transformation.
like this:
did:ethr:0x5ed65343eda1c46566dff6774132830b2b821b35
Fig. 2. Architecture with signing identities marked. Data provenance flow
on the left with scoring on top, realized by the provenance of the data the
algorithm relies on.
DIDs were originally designed to function as identifiers
for individual people, but can readily be extended to any
entity or resource. They are derived from public/private key
pairs, registered in an immutable registry for discovery purposes. Spherity and other companies pioneering the decentralized identity technology sector use innovative cryptographic
solutions for secure key management such as private key
sharding (multi-party computation), on-device/secure-enclave
biometrics, and HSMs to make signatures and data trails more
secure and non-repudiable [16]. The field is fast-moving and
significant progress is being made in expanding the options
for high-security use-cases and in building forward-secure
cryptographic agility into systems to accommodate these new
options.
Each domain or namespace for DIDs corresponds to a
method of encoding and decoding, making DIDs resolvable
like domain names relative to a method-specific but
interoperable resolution infrastructure [15]. For this project,
Spherity registered its DIDs on the Ethereum Blockchain via
the standard W3C DID method ethr [17]. In this method, the
public key of any valid ethereum keypair (i.e., an ”ethereum
address”) can be used as the identifier string within the
namespace defined by the ETH prefix. Thus, our DIDs look
As you can see in the ’iss’ and ’payload’[’inputDataDids’]
parameters of the sample data label (Figure 5), all identities are
expressed as DID references. Essentially, each transformation
appends a new link in a chain of linked and signed versions–
each data point can be updated, and each updated data point is
both signed by the transformer and linked back to its previous
state. By applying this to each transformation from the Data
Producer to the Data Consumer, Figure 3 shows the resulting
data chain, which enables the system to follow back how and
where the data has been transformed and which algorithm
was responsible for the transformation and how the incoming
data at a transformation step influenced the outcome. In this
way the data would ensure data provenance throughout the
entire data processing chain. The outcome of the algorithm
could be traced back to the Data Producer and ensures greater
confidence in the outcomes of the algorithm.
IV. R ESULTS
In this paper we have presented the implementation of a
verifiable data chain for a supervised learning scenario with
an RNN algorithm detecting dangerous driving scenarios as
shown in [18]. The referenced scenario detects situations of
dangerous driving on an incoming stream of vehicle data and
classifies the maneuver (e.g. left turn, right turn, acceleration,
etc.). The algorithm predicts for every timestamp the result,
based on the previous ten frames or data points. The input data
consists of categorical (e.g. gear, brakes pressed) and continuous signals (e.g position, lateral and longitudinal acceleration,
etc.).
For our solution presented here, we used a cloud environment and historic dangerous driving event data sets that were
used to train the RNN model. As shown in Figure 4, we used
the historical data to simulate a live vehicle data stream. The
historical data set contains, for each point in time, an array of
data points, which are the relevant features for the RNN model.
Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on January 04,2022 at 13:45:13 UTC from IEEE Xplore. Restrictions apply.
Fig. 4. Overview of each step taken in the prototype implementation. The
application is based on a data-stream of historic data. The data is processed
and consumed by an algorithm. The proposed provenance flow is applied.
Each array is sent to a component, responsible for the data
handling and processing. It creates for every data point a DAD.
Then, a feature vector of 10 entries is prepared for the RNN
model. The outcome of the RNN Model - the classification
of the situation as dangerous, the type of maneuver and the
confidence - is then stored as another DAD. It refers to the
DIDs of the entries, which are included in the DAD output
of the final result (See Figure 5). In order to show how the
application orders a stream of source events over time, we also
present the incoming data on a map, linking backwards to the
source DIDs and forward to the resulting DADs (See Figure
6).
This way, the cryptographic data structure provides instruments for end-to-end verifiability that enabled us to prove the
integrity of the data chain, identify all the entities involved in
the creation of a specific machine learning label, and request,
in turn, life-cycle credentials from these entities to feed a
scoring model for the respective machine learning label.
The end-to-end verifiability of entity attributes and quality
data about the entities involved in cyber-physical value chains
would allow us to build algorithms that accurately score
machine learning output data. Any consumer of these output
data could then assess their trustworthiness prior to processing
them in the consumer’s application.
In a second iteration of the project, we would focus on using
the proposed methods in a real-world scenario, e.g. using live
data streams from a fleet of real vehicles integrated with this
validated data-chain infrastructure.
Fig. 5. Output VC with provenance pointing back to DIDs of simulated
telemetry devices (encrypted data and cryptographic signature cropped out
for clarity)
V. C ONCLUSION AND O UTLOOK
The growing complexity of data chains and the increasing
number of actors processing their data goes beyond what can
be expected of manual quality assurance and maintenance
processes. A transparent and reliable evaluation of the risk
and quality of data inference methodologies is essential, and
this requires scaffolding for such accounting. Data provenance
Fig. 6. Web interface of the application presenting the results. On top there
is the visualization of the algorithm output, in the middle the raw values with
corresponding DIDs and below the algorithm DID, the raw value DIDs, the
algorithm output and the signature.
Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on January 04,2022 at 13:45:13 UTC from IEEE Xplore. Restrictions apply.
about the entities involved in a data processing chain and
the resulting machine learning labels (using DIDs, VCs, and
DLT to ensure uniform metadata) provides a foundation for
reliable risk scoring. Harmonized and standardized data (and
more importantly, metadata) is the key to AI explainability,
whether managed in traditional top-down ways, by new forms
of reputation, or by new forms of actuary accounting and trustworthiness ratings. In any of these cases, verifiable credentials
about identity subjects - i.e., vehicles, pre-processing and ML
algorithms - could be consumed by algorithms at the heart of
scoring models that both assess risk and refine the labeling
process and its outputs.
Overall, we believe many different costs can be reduced significantly: those incurred by poor data quality, those resulting
from poorly-understood data flows from which inferences are
drawn by the customer or the vehicle context and the risk of
data manipulation in any data system, whether open or closed.
New economic opportunities internal to the data marketplace
will open up, we believe, as the minimum and average level
of data quality in marketplaces rises.
As shown in Alvarez-Coello et al. [19], it is beneficial for
the industry to move towards a data-centric architecture, which
can drive major gains in the stability and reliability of dataprocessing flows. This stability and reliability is necessary to
maintain safety and innovation, and our solution introduced
here contributes directly to these ends.
This approach also has significant indirect benefits as well
for the quality-assurance and legal aspects of these systems.
The kinds of discovery and forensic audits required by both
routine regulatory compliance and dispute resolution could
be executed in a much more efficient way once entire data
processing pipelines become verifiable to any auditor with the
right consents or credentials. This also fosters innovation and
business-process agility, as individual actors (even non-human
ones!) would be better able to assess the risks of relying on
data sets, data sources, and algorithms dynamically.
We have shown how such data provenance can be applied
to data streams in an automotive context. The scenario was
based on historical data and simulated a typical live data chain.
Next steps would include applying the solution to a situation
in which a real vehicle serves as the actual Data Producer,
extending the concept from the restricted environment shown
in this work to a real world application with all different layers
of the data chain shown in Figure 1.
R EFERENCES
[1] P. Yadav, S. Hassan, A. Ojo, and E. Curry, “The role of open data
in driving sustainable mobility in nine smart cities,” 06 2017, pp. pp.
1248–1263).
[2] “Driving
Positive
Outcomes
through
Open
Data
Solutions for Mobility,” Dell, Lero, Forum For the Future,
Open DataSoft, City of Palo Alto, Tech. Rep., 02
2018. [Online]. Available: https://www.dell.com/learn/pa/en/pacorp1/
corporate∼corp-comm∼en/documents∼mobility-open-data.pdf
[3] N. Gruschka, V. Mavroeidis, K. Vishi, and M. Jensen, “Privacy
Issues and Data Protection in Big Data: A Case Study Analysis
under GDPR,” arXiv:1811.08531 [cs], Nov. 2018, arXiv: 1811.08531.
[Online]. Available: http://arxiv.org/abs/1811.08531
[4] B. Spanfelner, D. Richter, S. Ebel, R. B. GmbH, U. Wilhelm, R. B.
GmbH, W. Branz, R. B. GmbH, C. Patz, and R. B. GmbH, “Challenges
in applying the ISO 26262 for driver assistance systems,” Tagung
Fahrerassistenz, p. 23, 2012.
[5] P. Koopman and M. Wagner, “Challenges in Autonomous Vehicle
Testing and Validation,” SAE International Journal of Transportation
Safety, vol. 4, no. 1, pp. 15–24, Apr. 2016. [Online]. Available: https://
www.sae.org/content/2016-01-0128/
[6] Ondrej Burkacky, Johannes Deichmann, Benjamin Klein,
Klaus
Pototzky,
and
Gundbert
Scherf,
“Cybersecurity
in
automotive,”
McKinsey,
Mar.
2020.
[Online]. Available: https://www.gsaglobal.org/wp-content/uploads/2020/
03/Cybersecurity-in-automotive-Mastering-the-challenge.pdf
[7] R. Souza and et al, “Provenance data in the machine learning lifecycle
in computational science and engineering,” p. 10, 10 2019. [Online].
Available: https://www.researchgate.net/publication/336410355
[8] W. Kritzinger, M. Karner, G. Traar, J. Henjes, and W. Sihn,
“Digital Twin in manufacturing: A categorical literature review and
classification,” IFAC-PapersOnLine, vol. 51, no. 11, pp. 1016–1022,
2018. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/
S2405896318316021
[9] S. B. Kotsiantis, D. Kanellopoulos, and P. E. Pintelas, “Data Preprocessing for Supervised Leaning,” vol. 1, no. 12, p. 6, 2007.
[10] J. Cheney, S. Chong, N. Foster, M. Seltzer, and S. Vansummeren,
“Provenance: A future history,” 10 2009, pp. 957–964. [Online].
Available: https://www.researchgate.net/publication/221320910
[11] I. Barclay, A. D. Preece, I. J. Taylor, and D. C. Verma, “Quantifying
transparency of machine learning systems through analysis of
contributions,” CoRR, vol. abs/1907.03483, 2019. [Online]. Available:
http://arxiv.org/abs/1907.03483
[12] R. Souza, L. Azevedo, V. Lourenço, E. Soares, R. Thiago,
R. Brandão, D. Civitarese, E. V. Brazil, M. Moreno, P. Valduriez,
M. Mattoso, R. Cerqueira, and M. A. S. Netto, “Provenance Data
in the Machine Learning Lifecycle in Computational Science and
Engineering,” arXiv:1910.04223 [cs], Oct. 2019, arXiv: 1910.04223.
[Online]. Available: http://arxiv.org/abs/1910.04223
[13] H. Miao, A. Li, L. S. Davis, and A. Deshpande, “Towards Unified Data
and Lifecycle Management for Deep Learning,” in 2017 IEEE 33rd
International Conference on Data Engineering (ICDE). San Diego,
CA, USA: IEEE, Apr. 2017, pp. 571–582. [Online]. Available: http://
ieeexplore.ieee.org/document/7930008/
[14] S. Schelter, J.-H. Böse, J. Kirschnick, T. Klein, and S. Seufert,
“Automatically tracking metadata and provenance of machine learning
experiments,” 2017. [Online]. Available: http://learningsys.org/nips17/
assets/papers/paper 13.pdf
[15] “Decentralized identifiers (dids) v1.0,” W3C Working Draft 22 June
2020. [Online]. Available: https://www.w3.org/TR/did-core/
[16] C. Allen, A. Brock, V. Buterin, J. Callas, D. Dorje, C. Lundkvist,
P. Kravchenko, J. Nelson, D. Reed, M. Sabadello, G. Slepak, N. Thorp,
and H. T Wood, “Decentralized public key infrastructure,” 12 2015.
[Online]. Available: https://github.com/WebOfTrustInfo/rwot1-sf/blob/
master/final-documents/dpki.pdf
[17] ConsenSys, “Did method ethr specification v3.0,” 2020. [Online].
Available: https://github.com/decentralized-identity/ethr-did-resolver/
tree/3.0.0
[18] D. Alvarez-Coello, B. Klotz, D. Wilms, S. Fejji, J. M. Gomez, and
R. Troncy, “Modeling dangerous driving events based on in-vehicle data
using Random Forest and Recurrent Neural Network,” in 2019 IEEE
Intelligent Vehicles Symposium (IV). Paris, France: IEEE, Jun. 2019,
pp. 165–170. [Online]. Available: https://ieeexplore.ieee.org/document/
8814069/
[19] D. Alvarez-Coello, D. Wilms, A. Bekan, and J. Marx Gomez, “Towards
a Data-Centric Architecture in the Automotive Industry,” in International Conference on ENTERprise Information Systems (CENTERIS).
Algarve, Portugal: Elsevier, Oct. 2020, accepted.
Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on January 04,2022 at 13:45:13 UTC from IEEE Xplore. Restrictions apply.
Download