Scientific Computing in Space Using COTS Processors Jeremy Ramos Honeywell DSES Roger Sowada Honeywell DSES David Lupia Honeywell DSES jeremy.ramos@honeywell.com roger.j. sowada@honeywell.com david.lupia@honeywell.com Agenda Introduction Background Detail Description Implementation Approach Development Efforts Acknowledgements University of Florida Physical Sciences Inc. RPI Middleware Provider Chris Walters and Technical Staff NASA New Millennium Program Ramos SEU Sensor Provider Gary Galica and Robin Cox WW Technologies Inc. Key contributors to software prototype effort and research Alan George and the High-performance Computing and Simulation Lab Program Sponsor 2 150/MAPLD 2005 Processing Platforms for New Science The success of recent rover missions are a perfect example of the type of science we want to support Though returns from rover missions are significant they could be orders of magnitude greater with sufficient autonomy and on-board processing capabilities Similarly, deep space probes as well as Earth orbiting instruments can benefit from increases in on-board processing capabilities In all cases increases in science data returns are dependant on the spacecraft’s processing platform capabilities Ramos 3 150/MAPLD 2005 Payload Processing Conceptual Model Data Rates (Mbps) Data Rates Operations/Sec Algorithm Complexity and Abstraction Frame-Level Signal Processing High-Level Logic Operations Time Dependent Processing TDP Object Dependent Processing ODP Mission Dependent Processing MDP HIGH MED LOW LOW MED HIGH 10,000 100,000 1,000 10,000 1,000 100 100 10 10 1 TDP Ramos ODP Telemetry Low BW Algorithm Complexity (MIPSMOPS/) Sensor Array Sample-Level Signal Processing MDP 4 150/MAPLD 2005 Technology Advance Ramos A spacecraft onboard payload data processing system architecture, including a software framework and set of fault tolerance techniques, which provides: A. An architecture and methodology that enables COTS based, high performance, scalable, multi-computer systems, incorporating reconfigurable co-processors, and supporting parallel/distributed processing for science codes, that accommodates future COTS parts/standards through upgrades. B. An application software development and runtime environment that is familiar to science application developers, and facilitates porting of applications from the laboratory to the spacecraft payload data processor. C. An autonomous and adaptive controller for fault tolerance configuration, responsive to environment, application criticality and system mode, that maintains required dependability and availability while optimizing resource utilization and system efficiency. D. Methods and tools which allow the prediction of the system’s behavior in the space environment, including: predictions of availability, dependability, fault rates/types, and system level performance. 5 150/MAPLD 2005 Radiation Environments Traditionally microelectronics have been designed and manufactured specifically for use in radiation environments Some COTS microelectronic manufacturing process yield components that are partly resistant to radiation effects (tolerant to TID and latch-up immune) In most cases Single Event Effects are of greatest concern - Resulting in mostly bit flips (SEU) and functional interrupts (SEFIs) Upset rate as a function of orbit location 1.00E-04 upsets per bit-day 1.00E-05 1.00E-06 1.00E-07 heavy ion upsets 1.00E-08 proton upsets 1.00E-09 Total upsets 1.00E-10 1.00E-11 1.00E-12 Orbit Location (with precession) Discrete Simulation for 7 orbits of Xilinx V2 FPGA Shows trend driven by changes in particle flux Orbit: 300km perigee, 1400 apogee, 70° inclination Ramos 6 Natural Radiation 150/MAPLD 2005 N-Modular Redundancy The popular approach for mitigating SEUs is to employ fixed component level redundancy. This technique can be applied at all levels of the system hierarchy from circuit to box. One major disadvantage of fixed redundancy is low efficiency and unrealized system capacity. Module 1 Module 2 Module 3 Example N-Mod Redundancy TMR (Triple Modular Redundancy) Typically used in COTS-based microprocessor and Xilinx FPGAbased reconfigurable designs. Majority Voter Ramos 7 150/MAPLD 2005 Adaptive Fault Tolerance Current COTS-based space computing/electronics systems use fixedarchitecture designs based on brute-force, worst case fault masking techniques. Triple Modular Redundancy (TMR) is typically a hard-wired design approach for Rad Tolerant G4 PPC processors and Xilinx FPGAs The effectiveness and performance (MIPS/W) gains that the COTS device brings is degraded substantially by the use of a fixed design, worst-case redundancy scheme. EAFTC enables the computer subsystem to take advantage of changing orbital environments during a mission life to utilize the COTS processing elements more efficiently as the environment allows. This allows the EAFTC system to adaptively trade performance verses reliability in real time. EAFTC Based System Software Implemented FT Ramos COTS Processing Components in a Reconfigurable Arch Environmental Sensory (Radiation, position) 8 Adaptive Control Algorithms 150/MAPLD 2005 EAFTC Operational Scenario SEU Rates MIPS/Watt for worst case design MIPS per Watt Average MIPS/Watt for EAFTC design Orbit Position EAFTC exploits the SEU to orbit position relation as well as the variable criticality of system tasks The fundamental process implemented in the system consists of three steps: measure the environment and system state assess the environmental threat to the applications availability adapt the processing applications configuration (i.e. fault tolerance) to effectively mitigate the threat presented by the environment. On average more computation can be performed using EAFTC with less energy Ramos 9 150/MAPLD 2005 Memory (Boot and System) 750 FX Power PC Spacecraft I/F Data Processor 1 ... Consist of several APC Nodes Networked together with RapidIO Adaptive Processing Computer Spacecraft I/F System Controller A N Ports ... APC Cluster FPGA Co-Processor High-Speed Network Interface Instruments System Controller B Processor Controller I/O Interface Hardware Architecture Data Processor N Reconfigurable based processing node Multiple modes/configurations High-performance COTS processor (PPC) RapidIO network interface Reconfigurable co-processor Network A Network B Mission Specific Devices Provides measure of SEU-inducing flux & particle energy Used by EAFTC controller to determine real-time threat level to SEUs Separate heavy ion and proton sensors Threshold Output Alarm Analog/ Digital Electronics Ion scintillator and Photo Detector Threshold COTS Proton scintillator and Photo Detector Output Threshold Alarm Analog/ Digital Electronics Control and Data Output Alarm Analog/ Digital Electronics Control and Data Control and Data SSM Controller FPGA Thermistor Controller for APC Cluster Hosts EAFTC controller software and other experiment related control software RadHard processor and interfaces for reliable controller of COTS cluster Proton scintillator and Photo Detector PWR (3.3,1.5,+/-12V) SEU Alarm SSIO System Controller cPCI Connector Ramos 10 150/MAPLD 2005 Adaptive Processing Computer Conceptual Block Diagram BOOT Memory 512KB Reprogrammable Non-volatile Memory With EDAC 128MB RAM 1GB (Error Correction with Scrubbing) Co-Processor FPGA Health and Status Power PC UART Processor Controller SSIO Discretes Network I/F Clock Generation External Reset Ramos Reset Generation High Speed Network Switch 3 Ports 3 Ports 32-Bit PCI PWR Detection and Control Current Sensor JTAG Port Temperature Sensor 11 150/MAPLD 2005 EAFTC Application Platform • Scientific Application • Application Specific FT • FT Manager • EAFTC Controller • Job Manager System Controller Policies Configuration Parameters Mission Specific FT Control Applications FT Middleware Data Processor Application Specific Generic Fault Tolerant Framework OS Hardware Application Application Programming Interface (API) FT Lib Co Proc Lib FT Middleware OS OS/Hardware Specific Hardware FPGA Network • Local Management Agents • Replication Services • Fault Detection SAL (System Abstraction Layer) Ramos 12 150/MAPLD 2005 EAFTC Middleware Provides a high-performance platform for parallel/distributed applications Cluster and job management to provide a single system view to the application Message Passing Interface API Platform abstraction to include OS system calls and hardware registers Mission Level Customization through policies Scalable architecture to support clustering of resources on multi-computer system Reconfigurable co-processors devices for application acceleration Provides a high-availability platform for applications An autonomous and adaptive controller for fault tolerance configuration that maintains required dependability and availability while optimizing resource utilization and system efficiency. Checkpoint and rollback service for application recovery in the event of a fault. Application level replication services to facilitate reliable deployment of applications in SEU susceptible COTS processing resources EAFTC Middleware offers numerous benefits as a system platform Capitalize on cost savings in the use of commercial hardware Capitalize on latest processing technology through technology refresh Reduces cost and extends system life through a software-based middleware solution Scales to meet system requirements Customizable degree of fault tolerance to meet specific system needs Ramos 13 150/MAPLD 2005 EAFTC Software Architecture System Controller Data Processors Mission Specific Parameters Application Process ESM JM FTM JMA FTMA RS MPI CR DMS, CMS, AMS, and RDB DMS, CMS, AMS, and RDB VxWorks OS, network stack, and drivers Linux OS, and Drivers System Controller Data Processor with FPGA Co-Processor FCPS Network and sideband signals ■ Mission Specific Components ■ EAFTC Specific Components ■ Self Reliant Components ■ Platform Components ■ Application Components Ramos ESM – Environmental Sensor Monitor JM – Job Management FTM- Fault Tolerance Manager MPI – Message Passing Interface FCPS – FPGA Co-Processor Services CR – Checkpoint and Rollback CMS – Cluster Management Services AMS – Availability Management Services DMS – Distributed Messaging Services RDB – Replicated Database 14 150/MAPLD 2005 EAFTC Software Components Collaboration ESM FTM ESM FTM Ramos JM SR JM SR JMA SR P1.1 FTMA JMA SR JMA RS T1 FTMA SR P1.2 FTMA JMA MPI SR 15 RS T2 FTMA P1.3 JMA SR FTMA T3 JMA MPI SR RS FTMA MPI 150/MAPLD 2005 EAFTC Technology Advances to TRL7 Flight Experiment cPCI Chassis with Power Instrumentation Instrumentation Bus Increasing fidelity and capability System Controller (Ganymede) Data Processor 1 (Motorola SBC with FPGA PMC) Data Processor 2 (Motorola SBC with FPGA PMC) ~10,000MIPS ~10,000MIPS ~150MIPS Data Processor 3 (Motorola SBC) Data Processor 4 (Motorola SBC) ~1500MIPS ~1500MIPS Experiment Controller and Data Collection 1 Gbs TRL6 Technology Validation 1 Gbs per link 100 Mbs Gigabit Ethernet Switch TRL6 Validation - Demonstrate enhanced EAFTC technologies in a laboratory environment on prototype flight hardware including exposure to radiation beam - Validate and refine predictive models and predictive model parameters with experiment data - complete set of canonical fault injection experiments TRL5 Technology Validation TRL4 Validation - Demonstrated basic EAFTC technologies in a laboratory environment on COTS hardware testbed NASA adds requirement including radiation for fault tolerant cluster source and sensor and MPI capability - Environment Sensor - Alert Generator - High Availability Middleware - Replication TRL4 Services Technology TRL7 Technology Validation TRL7 Validation - Demonstrate EAFTC technologies in a real space environment - Validate predictive models and predictive model parameters with experiment data - TRL7 experiments will be identical to those performed and rung out during TRL6 demonstration and validation TRL5 Validation - Demonstrate basic EAFTC technologies in a laboratory environment on testbed hardware with partially integrated Fault Tolerance Services - Develop predictive models - Validate and refine predictive models and predictive model parameters with experiment data - partial set of canonical fault injection experiments Validation Ramos 16 150/MAPLD 2005 EAFTC Model Flow Inputs: • Orbit • Epoch • Radiation characterization of components • System architecture • HW architecture Inputs: • Decomposed HW Architecture • Comprehensive Fault Model Rad Effects Model Canonical Fault Model Particle Canonical fluxes, fault types Energies, & component SEE effects Canonical fault types HW SEU Susceptibility Model Model Fault rates for each fault type in the canonical fault model (ln) Inputs: • Probability that fault effects application • Detection coverage for each fault/error type in the canonical model • Recovery coverage for each fault/error type in the canonical fault model • Detection and recovery latencies for each fault • Number of mode change types and rates • Time to effect mode change • Probability that mode change is successful Ramos 17 Availability & Reliability Models Inputs: • Mission application characterization and constraints • Peak Throughput per CPU • Number of nodes in cluster • Algorithm/Architecture Coupling Efficiency for application • Network-level parallelization efficiency • Measured OS and FT Services overhead • Measured execution times for applications Availability & Reliability Performance Model Delivered Throughput Delivered Throughput Density Effective System Utilization 150/MAPLD 2005 TRL4 EAFTC System Technology Demonstration Successful demonstration of EAFTC system The EAFTC prototype comprises key technology elements Cluster Computer Autonomous Controller Replication Services Environment input is simulated via SPENVIS radiation models Instrumentation for power utilization is included in the model Profiling is integrated on Data Processors for cpu utilization measurement Workload is provided via synthetic benchmark application on Data Processors Ramos 18 150/MAPLD 2005 Computer Capacity Experiment TMR 3 node system EAFTC 4 node system average power: 72 Watts average power: 97 Watts average system effective MIPS: 973 MIPS average system effective MIPS: 2661 MIPS average system efficiency: 13 MIPS/Watt average system efficiency: 28 MIPS/Watt Comparison: 35% increase in power consumption, 173% increase in effective MIPS, and 115% increase in efficiency Ramos 19 150/MAPLD 2005 TRL5 Platform Consists of 4 Data Processors implemented with COTS Single Board Computers (SBCs) and PCI Mezzanine Cards SBCs will implement a PPC 750FX microprocessor running the Linux operating system and a Software Fault Injectors for fault simulation. The PMCs will implement a Xilinx Virtex2 FPGA that will serve as the co-processor for its host SBC The System Controller will be implemented with a software development unit of our flight SBC. All nodes in the cluster will be interconnected via a GigE switch. A Development Workstation will be used for software development, experiment control, and instrumentation data collection. Software Implemented Fault Injection (SWIFI) will be the primary method for simulating faults. Other methods may be used such as manual node resets, network traffic fault injections (via software or hardware fault injection methods), and test port inserted faults Ramos 20 150/MAPLD 2005 New Millennium Program Space Technology 8 New Millennium Program NASA program for technology development Currently working on its 8th technology development program In Formulation phase to evaluate 4 subsystem technologies (one of them EAFTC) The objective of the NMP ST8 EAFTC mission is to validate EAFTC technology at TRL7 through experimentation in space. SSR PDR CDR Launch 7/05 5/06 (TRL5) 5/07 (TRL6) 12/08 (TRL7 after 6 month on-orbit experiment) Our team’s overall goal is to demonstrate that EAFTC is a competitive and low-risk solution for missions needing COTS high-performance onboard payload processing. Ramos We will demonstrate that by using EAFTC we can maximize and significantly improve the performance of a COTS based computer in orbit. 21 150/MAPLD 2005 Summary EAFTC is an enabling technology for high performance spacecraft computing. As part of our NMP sponsored efforts a TRL4 system has been demonstrated Efforts continue towards a TRL5 system demonstration. Ramos 22 150/MAPLD 2005