Scientific Computing in Space Using COTS Processors

advertisement
Scientific Computing in Space Using COTS
Processors
Jeremy Ramos
Honeywell DSES
Roger Sowada
Honeywell DSES
David Lupia
Honeywell DSES
jeremy.ramos@honeywell.com
roger.j. sowada@honeywell.com
david.lupia@honeywell.com
Agenda






Introduction
Background
Detail Description
Implementation Approach
Development Efforts
Acknowledgements

University of Florida



Physical Sciences Inc.




RPI Middleware Provider
Chris Walters and Technical Staff
NASA New Millennium Program

Ramos
SEU Sensor Provider
Gary Galica and Robin Cox
WW Technologies Inc.


Key contributors to software prototype effort and research
Alan George and the High-performance Computing and Simulation Lab
Program Sponsor
2
150/MAPLD 2005
Processing Platforms for New Science
 The success of recent rover missions
are a perfect example of the type of
science we want to support
 Though returns from rover missions are
significant they could be orders of
magnitude greater with sufficient
autonomy and on-board processing
capabilities
 Similarly, deep space probes as well
as Earth orbiting instruments can
benefit from increases in on-board
processing capabilities
 In all cases increases in science data
returns are dependant on the
spacecraft’s processing platform
capabilities
Ramos
3
150/MAPLD 2005
Payload Processing Conceptual Model
Data Rates (Mbps)
Data Rates
Operations/Sec
Algorithm
Complexity
and Abstraction
Frame-Level
Signal Processing
High-Level Logic
Operations
Time
Dependent
Processing
TDP
Object
Dependent
Processing
ODP
Mission
Dependent
Processing
MDP
HIGH
MED
LOW
LOW
MED
HIGH
10,000
100,000
1,000
10,000
1,000
100
100
10
10
1
TDP
Ramos
ODP
Telemetry
Low
BW
Algorithm Complexity
(MIPSMOPS/)
Sensor
Array
Sample-Level
Signal Processing
MDP
4
150/MAPLD 2005
Technology Advance

Ramos
A spacecraft onboard payload data processing system architecture,
including a software framework and set of fault tolerance techniques,
which provides:
A.
An architecture and methodology that enables COTS based, high
performance, scalable, multi-computer systems, incorporating
reconfigurable co-processors, and supporting parallel/distributed
processing for science codes, that accommodates future COTS
parts/standards through upgrades.
B.
An application software development and runtime environment that is
familiar to science application developers, and facilitates porting of
applications from the laboratory to the spacecraft payload data processor.
C.
An autonomous and adaptive controller for fault tolerance configuration,
responsive to environment, application criticality and system mode, that
maintains required dependability and availability while optimizing resource
utilization and system efficiency.
D.
Methods and tools which allow the prediction of the system’s behavior in
the space environment, including: predictions of availability, dependability,
fault rates/types, and system level performance.
5
150/MAPLD 2005
Radiation Environments
 Traditionally microelectronics have been designed and manufactured specifically
for use in radiation environments
 Some COTS microelectronic manufacturing process yield components that are
partly resistant to radiation effects (tolerant to TID and latch-up immune)
 In most cases Single Event Effects are of greatest concern - Resulting in mostly bit
flips (SEU) and functional interrupts (SEFIs)
Upset rate as a function of orbit location
1.00E-04
upsets per bit-day
1.00E-05
1.00E-06
1.00E-07
heavy ion upsets
1.00E-08
proton upsets
1.00E-09
Total upsets
1.00E-10
1.00E-11
1.00E-12
Orbit Location (with precession)



Discrete Simulation for 7 orbits of Xilinx V2 FPGA
Shows trend driven by changes in particle flux
Orbit: 300km perigee, 1400 apogee, 70° inclination
Ramos
6
Natural Radiation
150/MAPLD 2005
N-Modular Redundancy
 The popular approach for mitigating SEUs is to employ fixed
component level redundancy.
 This technique can be applied at all levels of the system hierarchy
from circuit to box.
 One major disadvantage of fixed redundancy is low efficiency and
unrealized system capacity.
Module 1
Module 2
Module 3
Example N-Mod Redundancy
 TMR (Triple Modular Redundancy)
 Typically used in COTS-based
microprocessor and Xilinx FPGAbased reconfigurable designs.
Majority
Voter
Ramos
7
150/MAPLD 2005
Adaptive Fault Tolerance
 Current COTS-based space computing/electronics systems use fixedarchitecture designs based on brute-force, worst case fault masking
techniques.

Triple Modular Redundancy (TMR) is typically a hard-wired design approach for Rad
Tolerant G4 PPC processors and Xilinx FPGAs
 The effectiveness and performance (MIPS/W) gains that the COTS device
brings is degraded substantially by the use of a fixed design, worst-case
redundancy scheme.
 EAFTC enables the computer subsystem to take advantage of changing orbital
environments during a mission life to utilize the COTS processing elements
more efficiently as the environment allows. This allows the EAFTC system to
adaptively trade performance verses reliability in real time.
EAFTC Based System
Software
Implemented FT
Ramos
COTS Processing
Components in a
Reconfigurable Arch
Environmental
Sensory
(Radiation, position)
8
Adaptive Control
Algorithms
150/MAPLD 2005
EAFTC Operational Scenario
SEU Rates
MIPS/Watt for
worst case design
MIPS per Watt
Average MIPS/Watt for
EAFTC design
Orbit Position
 EAFTC exploits the SEU to orbit position relation as well as
the variable criticality of system tasks
 The fundamental process implemented in the system
consists of three steps:



measure the environment and system state
assess the environmental threat to the applications availability
adapt the processing applications configuration (i.e. fault
tolerance) to effectively mitigate the threat presented by the
environment.
 On average more computation can be performed using
EAFTC with less energy
Ramos
9
150/MAPLD 2005
Memory
(Boot and System)
750 FX
Power
PC
Spacecraft I/F


Data
Processor
1
...
Consist of several APC
Nodes
Networked together
with RapidIO
 Adaptive Processing Computer
Spacecraft I/F
System
Controller
A


N Ports
...
 APC Cluster
FPGA
Co-Processor
High-Speed
Network
Interface
Instruments
System
Controller
B
Processor
Controller
I/O Interface
Hardware Architecture

Data
Processor
N


Reconfigurable based processing
node
Multiple modes/configurations
High-performance COTS processor
(PPC)
RapidIO network interface
Reconfigurable co-processor
Network A
Network B
Mission Specific
Devices




Provides measure of
SEU-inducing flux &
particle energy
Used by EAFTC
controller to determine
real-time threat level to
SEUs
Separate heavy ion and
proton sensors
Threshold
Output
Alarm Analog/
Digital Electronics
Ion
scintillator and
Photo Detector
Threshold
COTS Proton
scintillator and
Photo Detector
Output
Threshold
Alarm Analog/
Digital Electronics
Control and
Data
Output
Alarm Analog/
Digital Electronics
Control and
Data
Control and
Data
SSM
Controller
FPGA
Thermistor

Controller for APC Cluster
Hosts EAFTC controller
software and other
experiment related control
software
RadHard processor and
interfaces for reliable
controller of COTS cluster
Proton
scintillator and
Photo Detector
PWR (3.3,1.5,+/-12V)

 SEU Alarm
SSIO
 System Controller
cPCI Connector
Ramos
10
150/MAPLD 2005
Adaptive Processing Computer Conceptual Block Diagram
BOOT Memory
512KB
Reprogrammable
Non-volatile
Memory With
EDAC
128MB
RAM
1GB (Error
Correction with
Scrubbing)
Co-Processor
FPGA
Health and Status
Power
PC
UART
Processor
Controller
SSIO
Discretes
Network I/F
Clock
Generation
External Reset
Ramos
Reset
Generation
High Speed
Network Switch
3 Ports
3 Ports
32-Bit PCI
PWR Detection
and Control
Current
Sensor
JTAG Port
Temperature
Sensor
11
150/MAPLD 2005
EAFTC Application Platform
• Scientific Application
• Application Specific FT
• FT Manager
• EAFTC Controller
• Job Manager
System Controller
Policies
Configuration
Parameters
Mission Specific FT Control
Applications
FT Middleware
Data Processor
Application
Specific
Generic Fault
Tolerant
Framework
OS
Hardware
Application
Application Programming
Interface (API)
FT Lib
Co Proc Lib
FT Middleware
OS
OS/Hardware
Specific
Hardware
FPGA
Network
• Local Management
Agents
• Replication
Services
• Fault Detection
SAL
(System Abstraction Layer)
Ramos
12
150/MAPLD 2005
EAFTC Middleware
 Provides a high-performance platform for parallel/distributed applications





Cluster and job management to provide a single system view to the application
Message Passing Interface API
Platform abstraction to include OS system calls and hardware registers
Mission Level Customization through policies
Scalable architecture to support clustering of resources on multi-computer
system
 Reconfigurable co-processors devices for application acceleration
 Provides a high-availability platform for applications

An autonomous and adaptive controller for fault tolerance configuration that
maintains required dependability and availability while optimizing resource
utilization and system efficiency.
 Checkpoint and rollback service for application recovery in the event of a fault.
 Application level replication services to facilitate reliable deployment of
applications in SEU susceptible COTS processing resources
 EAFTC Middleware offers numerous benefits as a system platform



Capitalize on cost savings in the use of commercial hardware
Capitalize on latest processing technology through technology refresh
Reduces cost and extends system life through a software-based middleware
solution
 Scales to meet system requirements
 Customizable degree of fault tolerance to meet specific system needs
Ramos
13
150/MAPLD 2005
EAFTC Software Architecture
System Controller
Data Processors
Mission Specific Parameters
Application Process
ESM
JM
FTM
JMA
FTMA
RS
MPI
CR
DMS, CMS, AMS, and RDB
DMS, CMS, AMS, and RDB
VxWorks OS, network stack, and
drivers
Linux OS, and Drivers
System Controller
Data Processor with FPGA Co-Processor
FCPS
Network and sideband signals
■ Mission Specific Components
■ EAFTC Specific Components
■ Self Reliant Components
■ Platform Components
■ Application Components
Ramos
ESM – Environmental Sensor Monitor
JM – Job Management
FTM- Fault Tolerance Manager
MPI – Message Passing Interface
FCPS – FPGA Co-Processor Services
CR – Checkpoint and Rollback
CMS – Cluster Management Services
AMS – Availability Management Services
DMS – Distributed Messaging Services
RDB
– Replicated Database
14
150/MAPLD 2005
EAFTC Software Components Collaboration
ESM
FTM
ESM
FTM
Ramos
JM
SR
JM
SR
JMA
SR
P1.1
FTMA
JMA
SR
JMA
RS
T1
FTMA
SR
P1.2
FTMA
JMA
MPI
SR
15
RS
T2
FTMA
P1.3
JMA
SR
FTMA
T3
JMA
MPI
SR
RS
FTMA
MPI
150/MAPLD 2005
EAFTC Technology Advances to TRL7 Flight Experiment
cPCI Chassis with Power Instrumentation
Instrumentation
Bus
Increasing fidelity
and capability
System
Controller
(Ganymede)
Data
Processor 1
(Motorola
SBC with
FPGA PMC)
Data
Processor 2
(Motorola
SBC with
FPGA PMC)
~10,000MIPS
~10,000MIPS
~150MIPS
Data
Processor 3
(Motorola
SBC)
Data
Processor 4
(Motorola
SBC)
~1500MIPS
~1500MIPS
Experiment
Controller
and Data
Collection
1 Gbs
TRL6
Technology
Validation
1 Gbs
per link
100 Mbs
Gigabit
Ethernet Switch
TRL6 Validation
- Demonstrate enhanced EAFTC
technologies in a laboratory
environment on prototype
flight hardware including
exposure to radiation beam
- Validate and refine predictive models
and predictive model parameters
with experiment data
- complete set of canonical fault
injection experiments
TRL5
Technology
Validation
TRL4 Validation
- Demonstrated basic
EAFTC technologies in
a laboratory environment
on COTS hardware testbed
NASA adds requirement
including radiation
for fault tolerant cluster
source and sensor
and MPI capability
- Environment Sensor
- Alert Generator
- High Availability
Middleware
- Replication
TRL4
Services
Technology
TRL7
Technology
Validation
TRL7 Validation
- Demonstrate EAFTC
technologies in a real
space environment
- Validate predictive
models and predictive
model parameters with
experiment data
- TRL7 experiments
will be identical to
those performed and
rung out during TRL6
demonstration and
validation
TRL5 Validation
- Demonstrate basic EAFTC technologies in a
laboratory environment on testbed hardware
with partially integrated Fault Tolerance Services
- Develop predictive models
- Validate and refine predictive models and
predictive model parameters with experiment data
- partial set of canonical fault injection experiments
Validation
Ramos
16
150/MAPLD 2005
EAFTC Model Flow
Inputs:
• Orbit
• Epoch
• Radiation
characterization
of components
• System
architecture
• HW architecture
Inputs:
• Decomposed HW Architecture
• Comprehensive Fault Model
Rad Effects
Model
Canonical
Fault Model
Particle
Canonical
fluxes,
fault types
Energies,
& component
SEE effects
Canonical
fault types
HW SEU
Susceptibility Model
Model
Fault rates for
each fault type in
the canonical
fault model (ln)
Inputs:
• Probability that fault effects application
• Detection coverage for each fault/error type
in the canonical model
• Recovery coverage for each fault/error type
in the canonical fault model
• Detection and recovery latencies for each fault
• Number of mode change types and rates
• Time to effect mode change
• Probability that mode change is successful
Ramos
17
Availability
& Reliability
Models
Inputs:
• Mission application
characterization and constraints
• Peak Throughput per CPU
• Number of nodes in cluster
• Algorithm/Architecture Coupling
Efficiency for application
• Network-level parallelization
efficiency
• Measured OS and FT Services
overhead
• Measured execution times for
applications
Availability
& Reliability
Performance
Model
Delivered Throughput
Delivered Throughput Density
Effective System Utilization
150/MAPLD 2005
TRL4 EAFTC System Technology Demonstration
 Successful demonstration of
EAFTC system
 The EAFTC prototype comprises
key technology elements

Cluster Computer
 Autonomous Controller
 Replication Services
 Environment input is simulated via
SPENVIS radiation models
 Instrumentation for power
utilization is included in the model
 Profiling is integrated on Data
Processors for cpu utilization
measurement
 Workload is provided via synthetic
benchmark application on Data
Processors
Ramos
18
150/MAPLD 2005
Computer Capacity Experiment
TMR 3 node system
EAFTC 4 node system

average power: 72 Watts

average power: 97 Watts

average system effective MIPS: 973 MIPS

average system effective MIPS: 2661 MIPS

average system efficiency: 13 MIPS/Watt

average system efficiency: 28 MIPS/Watt
Comparison: 35% increase in power consumption, 173% increase in effective MIPS, and 115%
increase in efficiency
Ramos
19
150/MAPLD 2005
TRL5 Platform







Consists of 4 Data Processors implemented
with COTS Single Board Computers (SBCs)
and PCI Mezzanine Cards
SBCs will implement a PPC 750FX
microprocessor running the Linux operating
system and a Software Fault Injectors for fault
simulation.
The PMCs will implement a Xilinx Virtex2
FPGA that will serve as the co-processor for its
host SBC
The System Controller will be implemented
with a software development unit of our flight
SBC.
All nodes in the cluster will be interconnected
via a GigE switch.
A Development Workstation will be used for
software development, experiment control, and
instrumentation data collection.
Software Implemented Fault Injection (SWIFI)
will be the primary method for simulating faults.
Other methods may be used such as manual
node resets, network traffic fault injections (via
software or hardware fault injection methods),
and test port inserted faults
Ramos
20
150/MAPLD 2005
New Millennium Program Space Technology 8
 New Millennium Program

NASA program for technology development
 Currently working on its 8th technology development program
 In Formulation phase to evaluate 4 subsystem technologies (one of them
EAFTC)
 The objective of the NMP ST8 EAFTC mission is to validate EAFTC
technology at TRL7 through experimentation in space.




SSR
PDR
CDR
Launch
7/05
5/06 (TRL5)
5/07 (TRL6)
12/08 (TRL7 after 6 month on-orbit experiment)
 Our team’s overall goal is to demonstrate that EAFTC is a competitive
and low-risk solution for missions needing COTS high-performance onboard payload processing.

Ramos
We will demonstrate that by using EAFTC we can maximize and
significantly improve the performance of a COTS based computer in orbit.
21
150/MAPLD 2005
Summary
 EAFTC is an enabling technology for high performance
spacecraft computing.
 As part of our NMP sponsored efforts a TRL4 system has
been demonstrated
 Efforts continue towards a TRL5 system demonstration.
Ramos
22
150/MAPLD 2005
Download