Presentation - IEEE High Performance Extreme Computing

advertisement
Scrubbing Optimization via
Availability Prediction (SOAP) for
Reconfigurable Space Computing
HPEC 2012
Quinn Martin
Alan George
SOAP

Background



SOAP Approach





FPGAs and Radiation in Space
Traditional Scrubbing Methods
Mission Parameters
Markov Models
Mission Case Studies
Results
Conclusions
2
FPGAs

Field-Programmable Gate Arrays (FPGAs)

Implement custom digital logic hardware with
fabric of logic resources and interconnect




Lookup tables (LUTs) implement combinational logic
User flip flops (FFs) implement sequential logic
Switch and connection boxes route among resources
Many are reconfigurable



Allows update of routing and logic state
Partial reconfiguration can update partition of device
E.g., Virtex from Xilinx and Stratix from Altera
3
Reconfigurable FPGAs in Space

Advantages
Very high performance/power ratio
 Reconfigurable (fully and partially)

Adaptable to changing environments and mission
requirements
 Can update design after launch


Disadvantages
Relatively difficult to design/test applications
 Configuration memory vulnerable to radiation

Can change application processor architecture in
unpredictable way
 Must repair upsets via configuration scrubbing

4
Radiation Effects on FPGAs
 Single-event
Effects (SEE)
Single-event Latchup (SEL) – Causes current
spike that may damage device
 Single-event Upset (SEU) – Changes state of
bit(s), e.g. from logic ‘0’ to ‘1’



Can be single-bit upset (SBU) or multi-bit upset (MBU)
Single-event Functional Interrupt (SEFI) – Like
SEU, but affecting critical device resource
 Total

Ionizing Dose
Degrades performance over time leading to
eventual device failure
5
Xilinx V-5/V-6 Configuration
 Programmed
via SelectMAP interface
Runtime configuration interface
 Also allows readback of existing configuration
 32 bits per configuration word
 Parallel bus width of 8, 16, or 32 bits
 Max clock frequency 100 MHz

 Configuration
memory arranged in frames
Minimum unit of access to config. memory
 Virtex-5 – 41 words per frame
 Virtex-6 – 81 words per frame

6
FPGA Scrubbing
 FPGA

Configuration Scrubbing
Quickly repairs SEUs before accumulation
Accumulation defeats redundancy strategies (e.g.,
TMR)
 Fast repair can prevent SEUs from manifesting as
errors


Can be decomposed into basic scrubbing
techniques
Correction techniques repair upsets
 Detection techniques discover and locate upsets

7
FPGA Scrubbing Techniques

Correction Techniques
Golden Copy – Repairs configuration based on
know “golden” copy (e.g., in rad-hard PROM)
Frame ECC – Repairs based on per-frame error
syndrome code stored on-chip


Detection Techniques
Frame ECC – Detects based on per-frame
SECDED Hamming code
CRC-32– Detects using device-wide CRC-32

8
FPGA Scrubbing Strategies
 Scrubbing
Strategies
Any combination of detection and correction
techniques with controller to implement algorithm
Blind Scrubbing – Golden copy correction only
Readback Scrubbing – Some detection
technique used

9
FPGA Scrubbing Strategies
10
SOAP Approach

Scrubbing Optimization via Availability
Prediction (SOAP)
Uses system availability as primary metric for
scrubbing efficacy
 Models scrubbing strategies as Markov diagrams
 Vary free parameters to find optimal scrubbing
system

Environmental parameters λ and α (orbits)
 System parameters B and fCCLK (memory and pin
constraints)
 Scrubbing parameters μ and γ (device configuration
capability)

11
SOAP Approach
12
Environmental Parameters

λ - SEU rates for devices in various orbits of
interest


Calculated per-bit and per-device using
CREME96
α – Correction factors for single-bit and multibit upsets (SBU/MBU)

From beam tests on Virtex-5 devices
13
System Parameters
Factors chosen by the system designer
based on available memories, power
budget, etc.
 Affect scrubbing detection and correction
rates (see equations on next slide)
 B – Configuration bus width in bits
 fCCLK – Configuration clock speed in Hz

14
Scrubbing Parameters
μ – Repair rate for scrubbing technique
(per second)
 γ – Detection rate for scrubbing technique
(per second)

15
Markov Algorithm Models

Blind


Built-in CRC-32


Basic detection
Frame ECC with CRC-32


No detection
CRC acts as “safety net” for upsets
undetected by Frame ECC
Frame ECC with CRC-32 and Essential
Bits (EB)

Only scrubs errors that may be critical
16
Blind Scrubbing
17
Readback CRC-32 Scrubbing
18
CRC-32 w/ Frame ECC Scrubbing
19
Case Study
Applies SOAP method to hypothetical
systems with realistic parameters
 Devices

Xilinx Virtex-5
 Xilinx Virtex-6


Orbits
ISS low earth orbit (LEO)
 Molniya highly elliptical orbit (HEO)


8-bit SelectMAP bus at 33 MHz

Accounts for access speed of slow rad-hard
PROM
20
Case Study

Two mission types

Non upset critical (non-UC) – System continues
to run upon detection and correction of upset
Only

count critical upsets as system “unavailable”
Upset critical (UC) – System requires reset
upon detection of upset to ensure state integrity
Requires
detection
All detected upsets render system unavailable for
reset period
Will benefit from essential bits mask used in
detection
21
Non-UC Results
Continuous blind scrubbing offers highest
availability
 CRC-32 offers similar availability with low
implementation complexity
 Frame ECC suffers because TBUs can be
falsely corrected, resulting in further errors

22
UC Results
23
UC Results
24
Results

Frame ECC with CRC-32 and Essential Bits
mask offers highest availability
Roughly one extra nine over other methods
 Xilinx-provided soft-error mitigation (SEM) core
implements similar
strategy


Other strategies still competitive
Complex state machine or software and additional
memory required for Frame ECC/EB
 Model does not account for vulnerability associated
with internal scrubbing

25
Conclusions
Predicts availability for various FPGA
scrubbing strategies on real and hypothetical
platforms
 Uses analytical models rather than
experimentation

Markov availability modeling with parametric
approach
 Allows optimization of scrubbing strategy during
design phase


In case study, blind scrubbing best for non-UC
and Frame ECC with EB mask best for UC
26
Download