Implementing a Security Architecture for Safety-Critical Railway Infrastructure Michael Eckel∗ Stefan Katzenbeisser§ ∗ Fraunhofer Don Kuzhiyelil† Jasmin Cosic¶ Christoph Krau߇ Matthias Drodt¶ Maria Zhdanova∗ Jean-Jacques Pitrolle† Institute for Secure Information Technology SIT, Darmstadt, Germany {firstname.lastname}@sit.fraunhofer.de † SYSGO GmbH, Klein-Winternheim, Germany {firstname.lastname}@sysgo.com ‡ Darmstadt University of Applied Sciences, Darmstadt, Germany christoph.krauss@h-da.de § University of Passau, Passau, Germany stefan.katzenbeisser@uni-passau.de ¶ DB Netz AG, Frankfurt am Main, Germany {firstname.lastname}@deutschebahn.com Abstract—The digitalization of safety-critical railroad infrastructure enables new types of attacks. This increases the need to integrate Information Technology (IT) security measures into railroad systems. For that purpose, we rely on a security architecture for a railway object controller which controls field elements that we developed in previous work. Our architecture enables the integration of security mechanisms into a safety-certified railway system. In this paper, we demonstrate the practical feasibility of our architecture by using a Trusted Platform Module (TPM) 2.0 and a Multiple Independent Levels of Safety and Security (MILS) Separation Kernel (SK) for our implementation. Our evaluation includes a test bed and shows how certification and homologation can be achieved. Index Terms—Railway, Security, Safety, TPM, MILS I. I NTRODUCTION The safety-critical railway infrastructure is currently undergoing a digitalization process. The Operational Technology (OT) for monitoring and controlling railway ControlCommand and Signalings (CCSs) systems is changing with the use of COTS products and IP-based communications, as well as the increase in communications between systems. Previously used closed and manufacturer-specific systems— typically characterized by proprietary, monolithic, and expensive systems—are increasingly being replaced by standard hardware and software technologies. As part of the NeuPro project, Deutsche Bahn (DB) in Germany plans to digitalize its infrastructure by 2037 [1]. This implies a step-by-step process with several stages: from equipping all trains with European Train Control System (ETCS) equipment and building a new high-speed rail network to transitioning to a pure ETCS system in the future. This change can be observed not only in Germany at DB, but also in the European Union (EU) and worldwide. For example, further development of railway systems and implementation of necessary safety and security logic is on the agenda of the EULYNX Cluster, an European initiative to standardize interfaces and elements of signaling systems (cf. https://eulynx.eu/). A typical railway signalling architecture consists of two main layers: the field element layer that contains field elements and their Object Controllers (OCs) and the interlocking layer with Maintenance and Data Management (MDM) and Interlocking System (ILS). Field elements are sensors and actuators, such as railroad signals, gates, and switches as well as train detection systems (TDSs). An OC usually controls exactly one field element and provides an interface to the ILS by translating digital interlocking commands into electrical signals that steer the field element and by reporting the element’s state back to the ILS. The ILS is responsible for the safe operation of trains, i.e., for determining of technical dependencies for train routes and sending commands to proper field elements. In case an error or a fault occurring in a field element, the ILS switches to the safe state (fail-safe) and blocks the route until the dependency is restored. The MDM is in charge of providing software updates for the components in the interlocking and field element layer, logging of diagnostic data and potential security events, and time synchronization. An Operation Control Center (OCC) centralizes supervision of ILSs as an overarching layer. All components down to OCs are connected via the so-called railroad WAN, i.e., Ethernetand IP-based communication network. Railroad operations that mandate resilient transport and reliable message delivery may use the Rail Safe Transport Application (RaSTA) protocol [2]. The first step in the realization of the NeuPro project is the digitalization of OCs. OCs play a crucial role in the translation of analog control signals for field elements and digital commands received from the ILS. Since the DB railroad network consists of more than 3,300 ILSs and more than 200,000 field elements, the integration of IT into control processes is aimed at enabling more efficient and improved railroad operations. However, digitalization also increases the risk of IT attacks, making it imperative to jointly examine safety and security [3], [4]. Integrating security mechanisms into a safety-certified OT system without losing certification is a major challenge [5]. According to EN 50128 [6], all software components must be certified to the highest Safety Integrity Level (SIL), unless freedom of interference can be provided. In commercial deployments, security applications are often developed and verified using less rigorous methods than is the case for high SIL applications. The use of external open source libraries, such as OpenSSL, is common for the development of security applications. In addition, it is essential to regularly update security applications to fix vulnerabilities or introduce new, more secure cryptographic algorithms. For these reasons, it is not desirable for a security application to be certified with the highest SIL; because this would mean repeating the long and costly certification process for each software update. In previous work, we proposed a security architecture for safety-critical railway infrastructure, enabling the joint operation of safety and security mechanisms on a single hardware platform [7], [8]. The architecture consists of a hardware platform with a Trusted Platform Module (TPM) 2.0, the Multiple Independent Levels of Safety and Security (MILS) Separation Kernel (SK), and various security applications. To facilitate the use of the TPM by different security applications, we introduce a TPM Resource Manager (RM) in our architecture that manages concurrent access to the resource-limited TPM. In this paper, we describe how 1) how this architecture can be implemented on an OC as a solution for replacing legacy OCs in the infrastructure of DB, 2) how such an architecture can receive certification and homologation1 , and 3) how we evaluate it, including a test environment. The remainder of the paper is organized as follows. In Section II, we briefly describe our security architecture and in Section III the requirements it should address. The implementation is described in Section IV. Our evaluation is presented in Section V. Section VI describes how this architecture can be certified and approved. Finally, we conclude the paper and identify future work in Section VII. II. BACKGROUND AND P REVIOUS W ORK This paper builds on our previous work in [7], [8], where we analyze requirements and introduce a security architecture for a railway OC that enables security mechanisms to be run on a single hardware platform with a safety application. The following security goals guide the architectural design [7]: 1 Homologation is the name for the approval process of railroad vehicles and railroad lines in accordance with a railroad commissioning approval regulation. • • • • • • • Availability: The system should be able to provide the expected functionality and data at any point in time. Integrity: The system should ensure hardware, software and data integrity for its components and interfaces. Authenticity: The system should verify that any data, especially, software packages and configuration updates as well as network communications have a trusted origin. Confidentiality: Security information such as access credentials and cryptographic keys must be kept confidential. Accountability: Any change in the system, its components or interfaces should be traceable to an authorized entity. Non-repudiation: An authorized entity should not be able to deny any performed change. Auditability: Security events need to be logged. Figure 1 shows our proposed architecture. It consists of three main components: a hardware platform with a hardware security module in the form of a TPM 2.0, a MILS SK, and various security applications. The TPM serves as a security anchor and enables, among other things, secure storage of cryptographic keys (e. g., to secure communication connections), measured boot to record software executions in a tamperevident manner, and remote attestation to allow authorized external parties to detect tampering with the system software. The MILS SK allows the joint operation of safety and security applications on the same hardware. The SK controls critical hardware interfaces and ensures the non-interference and the resource availability for a safety application. In our case, the safety application is a digital OC for NeuPro. Security applications are, e. g., anomaly detection methods which detect attacks over the network, secure software update protocols, or a classic firewall. Possible applications are not limited to these examples, the integration of further safety- or security-relevant sensors located on the tracks can also be enabled this way as shown in the study [9]. III. S ECURITY REQUIREMENTS In [8], we analyzed relevant threats by using the requirements engineering process of DIN VDE V 0831-104 [3] and IEC 62443. We identified 14 security requirements for the proposed architecture that are summarized in Table I. Requirements regarding key storage and key protection can best be fulfilled with hardware support. We choose the TPM for hardware cryptographic support since it provides secure key generation, storage and protection. Further, the TPM is a low-budget device and it is hardened against physical attack. With the existing TPM ecosystem, we have open-source software available, including Unified Extensible Firmware Interface (UEFI), Bootloader, Operating System (OS) kernel, and a whole TPM middleware: the TPM Software Stack (TSS). We assume physical security to be ensured (e. g., using housing with burglar alarm), so requirement R1 is out of scope for this work. In Section V, we refine these requirements with regard to the chosen security technologies and discuss how our implementation fulfills them. Fig. 1: Security Architecture for Railway Control-Command and Signaling [8] TABLE I: Security requirements for railway CCS architecture R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 The The The The The The The The The The The The The The system system system system system system system system system system system system system system shall shall shall shall shall shall shall shall shall shall shall shall shall shall detect unauthorized physical access to its subsystems and/or prevent relevant exploitation of physical access not allow the compromise of a communication key not disclose classified or confidential data to an illegitimate user exclude compromised endpoints from communication not use insecure transfer methods not allow any unauthorized user to access an endpoint not allow unauthorized and unauthenticated communication between endpoints not violate the runtime behavior requirements allow for the updating of security mechanisms, credentials, and configurations to patch known vulnerabilities not allow the execution of unauthorized software instances maintain the transmission system requirements defined in EN 50159 provide means to detect an undesirable system state change and anomalies impede that an unauthorized user can force it into one of the fall-back levels defined by the railway safety process maintain the integrity of software, firmware, configuration, and hardware IV. I MPLEMENTATION In this section, we describe our implementation of the security architecture for the railway OC as proposed in [7], [8]. In order to fulfill some of our requirements, we use the TPM to protect and securely store cryptographic keys. Further, we use the TPM in conjunction with measured boot to irrevocably store measurements of boot and system software in a tamperproof manner. With remote attestation these measurements can be verified by an external party in order to detect potentially malicious software executions. We utilize a TPM in version 2.0 for several reasons. In contrast to TPM 1.2, it provides cryptographic agility. That is, from a specification point of view it can support multiple cryptographic algorithms instead of a fixed set. In future, this agility may be useful to support quantum-resistant algorithms. Moreover, we use TPM 2.0 extended authorization to further protect cryptographic keys with the help of authorization policies enforced by the TPM 2.0. In this section, we first describe briefly the architecture of our implementation (Section IV-A). Then we describe the MILS platform which runs safety and non-safety partitions (Section IV-B). The TPM RM plays a central role in the architecture and is explained in Section IV-C. Section IV-D details the OC application and the safety communication stack with the RaSTA protocol [2]. Based on this architectural foundation we explain the security applications Measured Boot, remote attestation, and secure software update in Section IV-E. A. Implementation Architecture Our implementation architecture is depicted in Fig. 2. The backend at the top hosts a Security Operations Center (SOC) and the MDM. The OC in the middle constitutes the main part of our architecture, featuring the MILS SK and a TPM as well as safety and security applications. The OC controls one or more field elements, such as switches or signals. The SOC is the central point for all security-related services and protects the IT infrastructure and data from internal and external threats. It monitors, collects, analyzes, and examines all security-relevant information for anomalies. Based on this information, the SOC raises alerts and takes countermeasures to protect systems, data, and applications. In the railway context, the SOC relies on information from MDM systems. One MDM system is responsible for managing multiple OCs. It runs the safety-critical counterpart of the OC, sends management commands to the OC safety application and receives status information from it. In our architecture, we extend the MDM with security aspects: anomaly detection, safety and security monitoring, secure update, and remote attestation. The OC consists of a hardware platform, a software platform, and safety and security partitions. The hardware platform provides a Root of Trust for Measurement (RTM), a TPM, and the UEFI. The RTM is the first component that is run after the system is powered on. It is trusted implicitly and the code is (usually) realized in hardware. The RTM constitutes the first component of the Chain of Trust (COT) in the measured boot process. All software measurements are stored securely in the TPM as well as in the boot log. Further, the TPM is used for securely generating and storing cryptographic keys. The UEFI runs firmware which is TPM aware and supports measured boot, thus, extending the COT. The software platform encompasses a measured boot enabled boot loader and the MILS SK PikeOS. The MILS SK consists, among others, of the partition updater and the partition loader. The partition updater handles updating of partitions, i. e., replacing an existing partition with an updated one. The partition loader is responsible for shutting down and starting up partitions. We modify the partition loader to record partition measurements in the TPM as well as in the partition log to keep track of all loaded partitions. The OC is controlled by a safety application based on management commands sent by MDM over the RaSTA protocol. Safety and security applications are mapped to separate partitions provided by the MILS SK which ensures the spatial and temporal isolation. Further, security services for secure update and remote attestation are executed on the OC which communicate with the appropriate counterparts in the MDM system. B. MILS Platform When combining applications of different SILs on one hardware platform, we have to provide fault containment mechanisms to prevent cascading failures, i. e., a failure in one component causing failure in another one. Also, for independent certification of applications at their required assurance levels—i. e., security applications with non-SIL/QM and safety applications with SIL4—we have to prove independence between the applications. For achieving this, we use a MILS based partitioned architecture [10] that executes applications in sandboxes, called (resource) partitions. Communication between applications is limited to explicitly defined communication channels. Partitions can be separated in space and time using resource partitioning and time partitioning. When applications are executed in separate partitions, mutual dependencies between applications are reduced to the communication via explicitly defined communication channels. The cornerstone of this partitioned architecture is a software SK, a special type of operating system whose primary function is to provide separation of partitions as well as explicit secure communication channels. In our architecture, we use PikeOS, a certifiable SK and hypervisor [11]. In PikeOS, resource partitioning is achieved by statically assigning computing resources, such as memory, I/O and file devices, communication objects, and cores to partitions. PikeOS ensures that during run-time, an application has guaranteed access to the resources of its partition and that these resources are not accessible from applications belonging to other partitions. To enforce resource partitioning, PikeOS relies on the Memory Management Unit (MMU) and I/O MMU to control access to main memory and I/O memory resources, respectively. PikeOS uses time partitioning to divide CPU time among partitions and to implement time separation. It can be used to ensure that all applications receive a predefined amount of execution time and to prevent one thread from starving others, even in the case of a faulty thread. In its simplest form, PikeOS time partitioning can be used to allocate a certain CPU quota (specified as time partition) to each partition. Here, there is a one-to-one relationship between time and resource partitions, and the partitions are separated in time. In advanced configurations, multiple partitions can share the same time partition. In this case, application threads from different partitions mapped to the same time partition are scheduled based on thread priority, and temporal interference between applications may occur. During design time of the partitioned architecture, the system integrator estimates computing resources, such as CPU time and RAM, that are required by applications in order to fulfill their functional and non-functional requirements (e. g., timing and throughput requirements). The system integrator then allocates the estimated set of resources to the partitions that execute the applications. Safety and security applications are mapped to partitions that are separated in space and time. Communication channels between safety and security applications are also defined at design time. They are realized using communication objects provided by the SK. At runtime, the SK ensures that the resource allocation policy defined by the system integrator is adhered to and that the applications only affect each other via the explicitly configured communication channels. Because of this separation, failures cannot propagate from security applications to safety applications. As a result, applications can be certified independently of each other at their respective assurance levels. Fig. 2: Implementation Architecture for Railway Control-Command and Signaling C. TPM Resource Manager The TPM RM is responsible for managing the limited resources of a TPM, swapping in and out objects, sessions, and sequences as needed. For applications, the RM provides a view to the TPM as if it had (virtually) unlimited resources, similar to a virtual memory manager [12]. The RM is part of the TSS architecture. It sits directly on top of the TPM. Applications use one of the TSS user Application Programming Interfaces (APIs) to talk to it: System API (SAPI), Enhanced System API (ESAPI) or Feature API (FAPI). The RM has multi-user support, scheduling calls on a per- command basis from the caller. For applications, the RM behaves transparently, i. e., there is no difference for an application talking to the TPM directly or to the RM. Context swapping as done by the RM is particularly useful in a multiprocess environment where applications are unaware of each other, such as in PikeOS. In our implementation, we have a need for a TPM RM since we have three partitions using the TPM: partition loader, remote attestation, and secure update. The RM specification from the Trusted Computing Group (TCG) [12] leaves it open to implementations to define command priorities and connection privileges. The official tpm2-abrmd implementation of the RM on GitHub does not implement any restrictions or privileges [13]. Further, tpm2abrmd depends on D-Bus, which is not available in PikeOS. For these reasons, we develop a custom RM which allows us to restrict TPM resource consumption as well as to assign TPM usage priorities to PikeOS partitions. Priorities are defined per connection and our internal command reordering gives priority to commands from privileged PikeOS partitions, such as the partition loader. We introduce TPM proxy interfaces for all partitions which require TPM access. TPM proxy interface applications connect internally to the TPM RM partition, providing the same interface as the real TPM device driver to applications. All proxy interfaces can be located in a single partition or each in a separate one. Fig. 3 shows a use case where each TPM proxy interface runs in its own partition. Each TPM proxy interface partition has a pair of unidirectional channels to communicate with the RM. The RM allocates memory statically for each client. At compile time, the RM is configured with static parameters: • • • MAX_CONNECTIONS defines the maximum client connection handle by the RM. MAX_SAVED_OBJECTS defines the maximum number of objects and sequences that can be held in the RM per clients. MAX_SESSION_PER_CLIENT defines the maximum number of sessions that can be active for each client. This also sets the maximum number of sessions that can be loaded in the RM per client. Internally, the RM is split into two execution entities. One periodically reads requests from clients from TPM proxy interfaces and stores requests in a FIFO based-priority. The other swaps TPM contexts as needed, sends requests extracted from the FIFO based-priority to the TPM device driver, and writes the response back to the appropriate TPM proxy interface. D. Object Controller Application and Safety Communication Stack The RaSTA transport protocol is used for the safety-critical communication channel between OC endpoints and the MDM. RaSTA is standardized in DIN VDE V 0831-200 [14]. It uses two independent communication channels to achieve the safety properties availability and integrity against transmission errors required according to the standard EN 50159. To achieve security against active attacks, RaSTA is tunneled through a Virtual Private Network (VPN) like IPSec. As shown in Fig. 2, we use two physically distinct Ethernet controllers for RaSTA, eth0 and eth1. The SIL4 OC application and RaSTA are executed inside of safety partitions. The IP stacks and network drivers of the two redundant channels are mapped to two separate communication partitions. Memory and timing requirements for the OC application and communication stacks are computed using tools qualified according to the EN 50128 standard. During design time, safety and communication partitions are assigned with required resources, such as windows in the time schedule and stack regions, that are enforced by the SK at runtime. Placing the network stack of redundant communication paths in separate partitions, achieves independence between the two channels and avoids cascading failures from one channel to the other. E. Security Applications Our architecture introduces several security applications. Measured boot provides the foundation for remote attestation by keeping track of all software executions on the platform. Remote attestation then reports the platform operational state in a tamper-proof manner—i. e. authenticated, integrityprotected, and ensuring freshness/recentness—to the appropriate counterpart in the MDM. Secure update of partitions provides authentication and integrity of updated partitions, using the TPM. 1) Measured Boot: In a measured boot process, all software executions on a platform are recorded in an event log as well as in the TPM to ensure integrity. Technically, a measurement is a hash digest of the software binary. TPM Platform Configuration Registers (PCRs) are used to anchor log entries tamperproof by providing a cryptographic folding hash function, called extend(): P CRi+1 = hash(P CRi |measurementi+1 ). This allows for continuously extending hashes of log entries into PCRs, forming a COT. PCRs can neither be reset nor be set to arbitrary values during runtime. They can only be extended and read, and are reset on boot to their initial values. This is an essential requirement in order to detect compromised software components. The Measured Boot process starts with the RTM. After it has finished executing its main logic, it measures the next component in the boot sequence, the UEFI. It records the measurement in the boot log and extends the hash of the boot log entry to a PCR in the TPM. Then, control is passed to the UEFI. This principle of first measure, then start repeats along system components, i. e., RTM, UEFI, boot loader, and the SK. The partition loader of the SK is responsible for measuring and starting all safety and security partitions. It is part of the SK, and as such already measured. Whenever a partition is started, the partition loader measures it, logs it into the partition log, anchors the log entry in the TPM, and then starts it. Exactly as before, with one significant difference: The SK stays in control and does not pass it to another component. The produced boot log can then be reported by means of remote attestation. Fig. 3: TPM Resource Manager (RM) in PikeOS context 2) Remote Attestation: Our remote attestation implementation is based on the Challenge-Response based Remote Attestation with TPM 2.0 (CHARRA) [15]. CHARRA is a Linux-based proof-of-concept implementation in C (C99) of the “Challenge/Response Remote Attestation” interaction model of the Internet Engineering Task Force (IETF) Remote Attestation Procedures (RATS) Reference Interaction Models [16]. The remote attestation protocol employed by CHARRA is as follows: 1) The remote verifier establishes a DTLS connection with the attester. 2) The verifier requests an attestation from the attester, transmitting a challenge in the form of a random nonce, a selection of PCRs, and a key ID with which the attester is supposed to sign the attestation. The nonce is used to guarantee freshness and to prevent replay attacks. 3) The attester performs a TPM quote operation to sign the internal state of the TPM, i. e., the PCR values according to the PCR selection, incorporating the provided nonce. 4) The attester sends back to the verifier the TPM quote, the boot log, and the partition log. 5) The verifier verifies the signature of the TPM quote as well as the integrity of the logs by comparing them against the PCR values. Further, the verifier matches the log entries against a whitelist which holds known-good hashes of boot software and PikeOS partitions. In case of a deviation between the reported and expected reference values, the system may potentially be compromised, and mitigation actions, such as going to failsafe mode, should be triggered. To communicate between the MDM and the Remote Attestation Service PikeOS partition, CHARRA uses libcoap [17], an implementation of the Constrained Application Protocol (CoAP) [18]. Constrained Binary Object Representation (CBOR) [19] is used for wire-encoding data structures, utilizing QCBOR [20]. The ESAPI [21] of the TSS is used internally to talk to the TPM. We ported CHARRA and all libraries–including mbedTLS, the TSS ESAPI, and QCBOR to PikeOS–except for libcoap because porting overhead is too huge. Since libcoap constitutes an essential part of CHARRA we decided to run CHARRA in an ELinOS partition, an embedded Linux environment for PikeOS. This way, all dependencies are easily met. CoAP payload size is limited to 1 KiB per Protocol Data Unit (PDU). Due to the size of attestation data and log files we easily exceed this limit. Accordingly, we implement CoAP blockwise transfers to compensate for this limitation. 3) Secure Update: According to the railway operational guidelines, when a software update of safety application is performed, a technician shall validate the functionality on the field before the updated system is made operational. Due to this restriction, we do not allow remote software updates of safety applications. However, security applications which are not subject to safety certification in our architecture, are allowed to be updated remotely in a process we call secure update. Frequent updates to security applications are necessary, e. g., to mitigate newly discovered weaknesses in cryptographic algorithms or in their implementation, to fix bugs in software, or simply to introduce new features to security applications. To be valid in the railway environment, security applications must be approved according to a coordinated security certification process. This certification process must be done according to Common Criteria (IEC 15408) and is part of the approval process. A certificate is issued by the appropriate standardization body of a particular country, e. g. the BSI in Germany. In order to be able to provide the required level of protection, security components must be continuously updated. Due to the different boundary conditions, expected update cycles for security applications are much smaller and more frequent (from one hour to one day) compared to that in the safety systems (from one week to one month). So updates and patches can be (time-)critical and at the same time actually require recertification which can take up to several months. These contradictory requirements result in the deployment of the latest and secure but uncertified security components in railway systems. The secure update mechanism used in our architecture is described below. A PikeOS partition hosting security application is updated as a whole by shutting down an existing partition and then applying and loading the entire image of a new updated partition. In order to secure the update process, a cryptographic key pair is generated to sign and verify partition images before they are applied. The private portion of that key remains in the MDM. The public portion is stored in the NVRAM of the TPM on the OC. We protect the TPM NVRAM area against deletion by using platform authorization with a secure passphrase in the TPM platform hierarchy. The passphrase is only known to the MDM, so that only the MDM is able to delete the public key from the NVRAM. In production, this process must happen during manufacturing of the OC. Whenever a partition update is due, the secure update component in the MDM signs the partition image with the private key. Then, the MDM transfers the partition image and the signature to the secure update service partition in the OC. There, the partition verifies the signature using the public part of the key from the TPM, utilizing the TPM RM partition. We use OpenSSL to perform signing and verification operations. Only if the signature is valid, the partition image update is allowed to be applied. After integrity verification of the partition update package, reprogramming of persistent memory is performed by the partition updater. Once the reprogramming is done, the partition updater verifies the integrity of updated components and communicates the status to the secure update service. Safety partitions are not affected by the partition updating process, due to the resource separation enforced by the SK. The partition updater is responsible for providing the appropriate security properties for the update process, such as confidentiality and integrity of the communication channel, client/server authentication, remote attestation, and integrity checking of partition update packages. The partition loader provides the required safety properties for the update process. The functionality implemented by the partition loader include system state and power management, life-cycle of updated applications, error handling, and recovery/fallback processes. Management of security software—i. e. development, deployment, transfer to the SOC and MDM is subject to the railway operator. We assume that all software development processes in the backend are in accordance with the necessary safety and security guidelines. Our focus is in on the MDM and the secure transfer and secure update. 4) Cryptographic Key Management: In our concept and the implementation, we use a couple of cryptographic keys. Table II provides an overview of all cryptographic keys that are created, where they are stored, used, and when they are destroyed. V. E VALUATION In this section, we briefly describe the evaluation of our implementation by describing how the requirements are met, how we realized the test bed, and how our architecture can be integrated on a SIL4 hardware. A. Compliance with Security Requirements For evaluation purposes, we refined the generic security requirements from Section III into specific ones that take into account the technology choices made for the implementation of the security architecture described above. The overall approach is similar to SREP (Security Requirements Engineering Process) [22], [23]: first, in our previous work [8], relevant generic security requirements were determined using standard DIN VDE V 0831-104 (i.e., IEC 62443 applied to the railway domain) [3] and, second, in this work, the specific knowledge about the solution’s architecture (including related functional limitations and security threats) was used by security experts to elicit system-specific requirements that can later be utilized in the solution’s validation and testing. Table III links our refined requirements to the generic ones listed in the 3rd column. In the following, we discuss mitigations the implementation includes to meet the requirements. For R1r , R2r and R7r TPM’s protected storage and enhanced authorization functionality are employed, i.e., keys for the OC can be generated directly in the TPM to never leave this TPM, other credentials are stored integrity-protected. The access to the keys can be sealed to the system’s boot state and/or to the particular application or user that attempts the access by means of the pre-defined TPM-based policies. R3r and R4r are achieved by combining measured boot and platform integrity validation for an OC using the attestation protocol directly on start-up, with the OC being included into operation only if its integrity can be verified by MDM. If the state of the booted OC’s software differs from what MDM expects (the attestation response cannot be validated or there is no response), the MDM sends an alarm to SOC, which in turn forwards this information to the operator’s command and control center responsible for railway safety. This way, any routes and train drivers that can be at risk due to the compromised OC can be timely taken care of without any interference with safety-critical functions. To meet R5r , the system only uses DTLS from mbedTLS for all communications such as CHARRA and secure update. For R6r , the same secure update mechanism can be used that allows updating any vulnerable software in security applications. As an update for a safety application needs to be validated and authorized by an operator in the field, it is currently excluded from the secure update. R8r is fulfilled by the use of SK based partitioned architecture. It also allows independent restarting r of partitions thus fulfilling R9r . R10 is achieved by using the TPM to securely store a public key to which only the MDM knows the private portion. The public key is protected against deletion. An update image is signed by the MDM and verified r by the OC using the protected public key. R11 is realized r by the partition updater as described in Section IV-E3. R12 can be satisfied with mechanisms described in [24], using unique features of TPM 2.0. This is currently work in progress. r R13 is achieved by using measured boot in the boot chain starting from the RTM. The separation mechanisms included r in PikeOS ensure R14 . B. Test Environment To evaluate the implemented architecture, we integrated the prototype of our implementation into a testbed for railway operations simulations. The testbed is based on a digital model railway with currently two switches and two signals as field elements on which a model train is running (cf. Fig. 4). The field elements can be controlled by a legacy OC or an OC that implements our security architecture. The hardware is TABLE II: Cryptographic Key Management Key Created by Stored Used Destroyed MDM DTLS Key for Secure Update (asymmetric) MDM (during deployment) Private part in the MDM, public part in the OC Secure Update partition. for secure update process usually never, unless the system (OC or MDM) is compromised OC DTLS Key for Secure Update (asymmetric) OC (during deployment) Private part in the OC Secure Update partition, public part in the MDM. for secure update process usually never, unless the system (OC or MDM) is compromised MDM DTLS Key for Remote Attestation (asymmetric) MDM (during deployment) Private part in the MDM, public part in the OC Remote Attestation partition. for remote attestation process usually never, unless the system (OC or MDM) is compromised OC DTLS Key for Remote Attestation (asymmetric) OC (during deployment) Private part in the OC Remote Attestation partition, public part in the MDM . for remote attestation process usually never, unless the system (OC or MDM) is compromised Remote Attestation (asymmetric) Key OC (during manufacturing) Private part on the OC (protected by the TPM), public part on the MDM. for remote attestation process never, unless the system is decommissioned (the key should be destroyed properly by deleting it) Secure Update Signature Key (asymmetric) MDM (during manufacturing) Private part on the MDM, public part in the TPM of the OC (protected by the TPM). for secure update process never, unless the system is decommissioned (the key should be destroyed properly by deleting it) TABLE III: Refined security requirements R1r R9r r R10 r R11 r R12 The system shall store credentials used to authenticate endpoints (MDM, ILS, peer OCs) or users under protection of TPM The system shall control and restrict access to the area where the authentication credentials for endpoints and users are stored (e.g., with password or policy) There shall be means to detect endpoints that run non-authenticated software Before bringing the OC into service, the platform integrity shall be validated, i.e. the system shall not be put to operational state when it executes unauthorized software The system shall use only an industry-standard communication stack There shall be means to update/replace the communication stack when a vulnerability is detected Access to the credentials (e.g., keys) shall only be allowed if the system is not modified in an unauthorized manner The system shall ensure that the resource allocation for fulfilling the runtime requirements (e.g. latency, boottime) computed at design time is met at runtime An application/component shall be updatable without restarting the system or affecting the rest of the system The authenticity and integrity of update images shall be validated The system shall ensure persistence of an update image The system shall provide protection against downgrade attacks during update r R13 r R14 The system shall record the integrity of software components at load time in a tamper-evident manner The system shall provide isolation of safety-relevant components from any other components of the system R2r R3r R4r R5r R6r R7r R8r R2 R3 R5 R6 R7 R13 R2 R3 R5 R6 R7 R13 R4 R4 R10 R13 R14 R5 R5 R6 R13 R14 R8 R11 R10 R13 R14 R9 R9 R14 R9 R9 R12 R13 R14 R13 board MEN G22 equipped with TPM 2.0. A safety-backend implements functions for railway interlocking and operations control and a security-backend implements MDM and SOC functionalities providing firmware and configuration updates and performing periodic platform attestations. The communication of the OC to the model railway is handled by Railuino2 . The testbed allows simulating local and remote attackers and play through various test scenarios to evaluate our solution and the effects it has on the railway operations. Examples of test scenarios used to evaluate the above mitigations are as follows: Detecting compromised OCs: A local attacker boots manip- ulated firmware on an OC to control it as they like. To make this attack scalable a remote attacker manipulates an update image in the MDM. This scenario can be configured to test the mitigation for the requirements R3r , r r r r , R12 , R13 , and partially R14 . R4r , R7r , R10 Preventing OC’s resource exhaustion: A software stack used to built a (security) application on an OC has a software bug or an exploit crafted by an attacker leading to resources exhaustion and undesired behavior in the OC. This scenario allows testing the mitigation for the r requirements R8r and R14 . 2 The Railuino is an Arduino Uno Rev3 with an ATmega328P, https://code. google.com/archive/p/railuino/ In this work, our goal is to build a certifiable reference design for augmenting safety applications with security mech- C. Redundancy Architecture for SIL4 Object Controller Fig. 4: Testbed for railroad simulations (MEN G22 on the right) anisms on the same hardware platform. We use the MEN G22 board for our implementation where the Hardware Fault Tolerence (HFT) requirement of SIL4 control logic is not addressed. For fulfilling the HFT ≥ 1 requirement of the SIL4 OC main logic, we need a redundancy architecture where the control logic is executed on independent processors. Fig. 5 shows an exemplary redundancy design using a three Central Processing Unit (CPU) safe hardware platform3 . This design includes two independent CPUs that are responsible for redundant execution of OC main logic and one I/O processor that is responsible for I/O devices in the platform, TPM, network controllers, and I/O device that controls the Field Elements. When mapping our implementation to such a redundant platform, we execute security applications and the safety network stack, including RaSTA, on the I/O processor. This is needed as these applications require access to the hardware TPM and network controllers. A TPM has a unique secret from which initial keys are derived. A random number generator is used to generate additional secure keys. For this reason, two different TPMs always generate two different keys. Therefore, a TPM cannot be designed redundantly and is directly connected to the I/O CPU. The OC control logic is executed on the control processor in parallel, receives the same input from the RaSTA stack, and computes the results. The control processors also implement two safety monitors which synchronize the two CPUs, compare inputs and outputs for each command received, and push the results to the I/O processor. In case of a result mismatch, system is set into “Failure Mode” and becomes inactive. VI. C ERTIFICATION AND H OMOLOGATION The railway infrastructure must be considered a critical infrastructure, which by law requires special technical and organizational measures to enhance its cybersecurity (e.g., the act on the Federal Office for Information Security (BSI Act 3 The MEN F75P single-board vital computer, e. g., is such a redundant hardware platform – BSIG) in Germany). Evidence of the effectiveness of the security measures needs to be reported periodically to the authorities. At the same time, the infrastructure is safetycritical and requires safety certification in order to protect passengers and the environment from physical harm stemming from malfunctions. At the moment, there is no established standard in the railway domain that specifies the interplay between the safety and the security components as well as methods for certification (work on a European technical specification for security in the railway domain, prTS 50701 [4], is underway). In order to certify the architecture described in the paper, we thus propose to separate the concerns of safety and security to the largest possible extent – as it is likely that security components need to be updated periodically (which would trigger the need for re-certification), while safety-critical components have a significantly longer lifespan. We thus propose to certify that the architecture provides a clear separation between all safety-critical components and security components, so that the latter can be seen as “security shell” around the former, thus implementing the concept proposed in [25]. This safety certification can be performed according to well-established safety standards (such as EN 50159), under the assumption that the security components adequately shields safety. For the security certification, we propose to follow the upcoming prTS 50701, which will likely become the de-facto standard for security in railway operation. The architecture proposed in the paper is already compliant to prTS 50701. The railway safety standard EN 50128 defines the requirements of the software for railway control and protection systems. The standard defines development, verification and validation processes for reducing the systematic software failures to acceptable levels. Due to the rigorous processes to be followed, the cost for safety critical software directly depends on the number of Source Lines of Code (SLOC). Thus, keeping the SLOC as small as possible reduces the certification effort significantly. Thanks to our partitioned architecture, we can limit the safety certification efforts to Fig. 5: Redundancy architecture with three CPUs the safety application alone without the need to certify the security application even though they are integrated on the same hardware platform. According to EN 50128, the subcomponent that provides independence between safety and non-safety applications shall be certified at the same level as the highest SIL safety application. PikeOS SK that we use in our design is certifiable to SIL4 which corresponds to the SIL required by the OC safety application. A TPM 2.0 is certifiable according to FIPS 140-2 level 1 or 24 as well as Common Criteria (CC) EAL4+ Moderate based on the TPM 2.0 Protection Profile.5 Formally verifying that our implementation of remote attestation in conjunction with TPM based measured boot is sound and yields expected results is work-in-progress. VII. C ONCLUSION In this paper, we described the practical implementation and evaluation of our railway security architecture introduced in [7], [8]. A particular challenge was the integration of the Trusted Platform Module (TPM) into the Multiple Independent Levels of Safety and Security (MILS) Separation Kernel (SK) PikeOS, and to enable TPM access to all security partitions with the TPM Resource Manager (RM). Our evaluation showed that our solution is a viable approach for integrating safety and security applications on one platform, and that it has the potential for receiving certification and homologation. As future work, we plan to integrate our solution in a test field of the Deutsche Bahn (DB). This allows for close-toreal-life evaluation of the prototype with a fully functional safety application, including the RaSTA stack and a security application on a single secure hardware platform. ACKNOWLEDGMENT The work presented in this paper has been partly funded by the German Federal Ministry of Education and Research (BMBF) under the project “HASELNUSS” (ID 16KIS0597K). Maria Zhdanova is member of the TALENTA program of the “Fraunhofer-Gesellschaft”. 4 https://trustedcomputinggroup.org/resource/ tcg-fips-140-2-guidance-for-tpm-2-0/ 5 https://trustedcomputinggroup.org/resource/pc-client-tpm-certification/ R EFERENCES [1] H. Leister, “ETCS und digitale Technologie für Stellwerke,” EisenbahnRevue International, vol. 8-9, 2017. [2] DIN VDE V 0831-200:2015-06, “Elektrische Bahn-Signalanlagen, Teil 200: Sicheres Übertragungsprotokoll RaSTA nach DIN EN 50159 (VDE 0831-159),” 2015. [3] DKE, “Elektrische Bahn-Signalanlagen – Teil 104: Leitfaden für die ITSicherheit auf Grundlage der IEC 62443 (DIN VDE V 0831-104),” Oct 2015. [4] CENELEC, “PD CLC/TS 50701 Railway applications - Cybersecurity,” 2020. [5] C. Schlehuber, M. Heinrich, T. Vateva-Gurova, S. Katzenbeisser, and N. Suri, “Challenges and approaches in securing safety-relevant railway signalling,” in European Symposium on Security and Privacy Workshops (EuroS&PW), 2017. [6] CENELEC - European Committee for Electrotechnical Standardization, EN50128 - Railway applications - Communications, signalling and processing systems - Software for railway control and protection systems, 2010, no. EN 50128:2001 E. [7] H. Birkholz, C. Krauß, M. Zhdanova, D. Kuzhiyelil, T. Arul, M. Heinrich, S. Katzenbeisser, N. Suri, T. Vateva-Gurova, and C. Schlehuber, “A reference architecture for integrating safety and security applications on railway command and control systems,” in 4th International Workshop on MILS: Architecture and Assurance for Secure Systems, (MILS@DSN 2018), 2018. [8] M. Heinrich, T. Vateva-Gurova, T. Arul, S. Katzenbeisser, N. Suri, H. Birkholz, A. Fuchs, C. Krauß, M. Zhdanova, D. Kuzhiyelil, S. Tverdyshev, and C. Schlehuber, “Security requirements engineering in safety-critical railway signalling networks,” Security and Communication Networks, vol. vol. 2019, Article ID 8348925, 2019. [9] T. Arul, J. Cosic, M. Drodt, M. Friedrich, M. Heinrich, M. Kant, S. Katzenbeisser, H. Klarer, P. Rauscher, M. Schubert, G. Still, D. Wallenhorst, and M. Zhdanova, “IT/OT-security for internet of railway things (IoRT),” https://haselnuss-projekt.de/downloads/Whitepaper IoRT-Security en.pdf, Working Group CYSIS, retrieved August 30, 2021. [10] S. Tverdyshev, H. Blasum, B. Langenstein, J. Maebe, B. De Sutter, B. Leconte, B. Triquet, K. Müller, M. Paulitsch, A. Söding-Freiherr von Blomberg et al., “Mils architecture,” Zenodo, 2013. [11] SYSGO GmbH, “PikeOS hypervisor webpage,” https://www.sysgo.com/ products/pikeos-hypervisor/, retrieved June 29, 2021. [12] Trusted Computing Group, “TCG TSS 2.0 TAB and Resource Manager Specification,” Apr. 2019. [13] Fraunhofer SIT, Intel, and Infineon Technologies. (2016, Apr.) TPM2 Access Broker & Resource Manager. [Online]. Available: https://github.com/tpm2-software/tpm2-abrmd [14] DKE, “Electric signalling systems for railways - Part 200: Safe transmission protocol according to DIN EN 50159 (VDE 0831-159). (DIN VDE V 0831-200),” June 2015. [15] M. Eckel. (2019, Sep.) CHARRA: Challenge-Response based Remote Attestation with TPM 2.0. [Online]. Available: https://github.com/ Fraunhofer-SIT/charra [16] H. Birkholz and M. Eckel, “Reference Interaction Models for Remote Attestation Procedures,” Internet Engineering Task Force, Internet-Draft draft-ietf-rats-reference-interaction-models, Jan. 2020, [17] [18] [19] [20] [21] [22] [23] [24] [25] work in progress. [Online]. Available: https://datatracker.ietf.org/doc/ draft-ietf-rats-reference-interaction-models/ O. Bergmann. (2010, Jul.) libcoap: A C implementation of the Constrained Application Protocol (RFC 7252). [Online]. Available: https://github.com/Fraunhofer-SIT/charra Z. Shelby, K. Hartke, and C. Bormann, “The Constrained Application Protocol (CoAP),” Internet Requests for Comments, RFC Editor, RFC 7252, Jun. 2014. [Online]. Available: https://tools.ietf.org/html/rfc7252 C. Bormann and P. Hoffman, “Concise Binary Object Representation (CBOR),” Internet Requests for Comments, RFC Editor, RFC 8949, Dec. 2020. [Online]. Available: https://tools.ietf.org/html/rfc8949 L. Lundblade. (2018, Sep.) QCBOR: an implementation of nearly everything in RFC8949. [Online]. Available: https://github.com/laurencelundblade/QCBOR Trusted Computing Group, “TCG TSS 2.0 Enhanced System API (ESAPI) Specification,” May 2020. D. Mellado, E. Fernández-Medina, and M. Piattini, “Applying a security requirements engineering process,” in Computer Security – ESORICS 2006, D. Gollmann, J. Meier, and A. Sabelfeld, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp. 192–206. A. Toval, J. Nicolás, B. na Moros, and O. Garcı́a, “Requirements reuse for improving information systems security: A practitioner’s approach,” Requirements Engineering Journal, vol. 6, pp. 205–219, 2001. A. Fuchs, C. Krauß, and J. Repp, “Advanced Remote Firmware Upgrades Using TPM 2.0,” in Proceedings of the 31th International Conference on ICT Systems Security and Privacy Protection (IFIP SEC), 2016. C. Schlehuber, M. Heinrich, T. Vateva-Gurova, S. Katzenbeisser, and N. Suri, “A security architecture for railway signalling,” in 36th International Conference on Computer Safety, Reliability and Security SAFECOMP, 2017. ACRONYMS API CBOR CC CCS CHARRA CoAP COT CPU DB EU ESAPI ETCS FAPI HFT IETF ILS IT MDM MILS MMU OCC OC OS OT PCR PDU RaSTA RATS Application Programming Interface . . . . . . . . . . . . 5 Constrained Binary Object Representation . . . . . 7 Common Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Control-Command and Signaling . . . . . . . . . . . . . . 1 Challenge-Response based Remote Attestation with TPM 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Constrained Application Protocol. . . . . . . . . . . . . .7 Chain of Trust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Central Processing Unit . . . . . . . . . . . . . . . . . . . . . 10 Deutsche Bahn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 European Union . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Enhanced System API of the TPM 2.0 TSS . . . . 5 European Train Control System . . . . . . . . . . . . . . . 1 Feature API of the TPM 2.0 TSS . . . . . . . . . . . . . 5 Hardware Fault Tolerence . . . . . . . . . . . . . . . . . . . 10 Internet Engineering Task Force . . . . . . . . . . . . . . . 7 Interlocking System . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Information Technology . . . . . . . . . . . . . . . . . . . . . . 1 Maintenance and Data Management . . . . . . . . . . . 1 Multiple Independent Levels of Safety and Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Memory Management Unit . . . . . . . . . . . . . . . . . . . 4 Operation Control Center . . . . . . . . . . . . . . . . . . . . . 1 Object Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Operating System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Operational Technology . . . . . . . . . . . . . . . . . . . . . . 1 Platform Configuration Register . . . . . . . . . . . . . . . 6 Protocol Data Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Rail Safe Transport Application . . . . . . . . . . . . . . . 1 Remote Attestation Procedures . . . . . . . . . . . . . . . . 7 RM RTM SAPI SIL SK SLOC SOC TCG TDS TPM TSS UEFI VPN Resource Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Root of Trust for Measurement . . . . . . . . . . . . . . . 4 System API of the TPM 2.0 TSS . . . . . . . . . . . . . 5 Safety Integrity Level . . . . . . . . . . . . . . . . . . . . . . . . 2 Separation Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Source Lines of Code . . . . . . . . . . . . . . . . . . . . . . . 10 Security Operations Center . . . . . . . . . . . . . . . . . . . 3 Trusted Computing Group . . . . . . . . . . . . . . . . . . . . 6 train detection system . . . . . . . . . . . . . . . . . . . . . . . . 1 Trusted Platform Module . . . . . . . . . . . . . . . . . . . . . 1 TPM Software Stack . . . . . . . . . . . . . . . . . . . . . . . . . 2 Unified Extensible Firmware Interface . . . . . . . . . 2 Virtual Private Network . . . . . . . . . . . . . . . . . . . . . . 6