Architectural Optimizations for Low-Power Real

advertisement
Architectural Optimizations for Low-Power
Real-Time Speech Recognition
Rajeev Krishna, Scott Mahlke, Todd Austin
Advanced Computer Architecture Lab
University of Michigan
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
1
What is Speech Recognition?
Large vocabulary, Speaker independent
Representative of the class of natural I/O applications
Complicated by natural variations in acoustics and meaning
Performance constraints preclude use in portable systems
250
Words per Minute
•
•
•
•
Excited Speech
7 min
6 min
200
14 min
Unexcited Speech
150
100
2 hrs
50
6 hrs
0
SA-1110 206Mhz
Xscale 400Mhz
PIII - 600Mhz PIII - 900Mhz
PIII - 1Ghz
Processor Type
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
2
Performance Characteristics
•
•
•
•
Signal Processing (DSP Style)
Search – (Gaussian Scoring, Model Evaluation)
Hidden Markov Models used to describe language
Characteristics violate design assumptions of modern processors
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
3
The Source of the Problem: Search
DH
“Their Car” = DH EH R [word] K AA R
P(“DH”)
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
4
The Source of the Problem: Search
DH EH R [word] K AA R
DH
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
6
The Source of the Problem: Search
DH EH R [word] K AA R
DH
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
7
The Source of the Problem: Search
DH EH R [word] K AA R
“Their”
EH
DH
R
AX
IH
AH
IY
“The”
“Ear”
[word]
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
8
The Source of the Problem: Search
DH EH R [word] K AA R
“Their”
EH
DH
R
AX
IH
AH
IY
“The”
“Ear”
[word]
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
9
The Source of the Problem: Search
DH EH R [word] K AA R
“Their”
DH
EH
AX
R
K
IH
AH
“Car”
AA
AE
P
R
“Cap”
IY
“The”
T
“Ear”
“Cat”
[word]
[word]
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
10
The Source of the Problem: Search
DH EH R [word] K AA R
DH
EH
R
AX
IH
AH
IY
EH
K
AA
R
NH
AE
P
L
N
T
F
OY
Advanced Computer Architecture Lab
University of Michigan
S
CASES 2003
Rajeev Krishna
11
The Source of the Problem: Search
DH EH R [word] K AA R
DH
EH
R
AX
IH
AH
IY
IY
EH
K
AA
R
NH
AE
P
TH
L
N
T
SH
OY
OW
S
T
F
G
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
12
The Source of the Problem: Search
AX
DH EH R [word] K AA R
JH
DH
GH
G
EH
AX
R
IH
CH
AH
IY
EH
IH
V
G
SH
K
OW
NH
IY
IY DK
DUH L
F
K
OY
Advanced Computer Architecture Lab
University of Michigan
ZH
AA
R
Z
AE
P
IH
OW
T
N
ER
OW
TH
SH
F
S
T
CASES 2003
Rajeev Krishna
13
This Work
• Focus on exposing parallelism
• Architectural Model
– Hybrid CMP/SMT architecture
• Programming Model
– Programmer exposes concurrency
– Architecture matches to resource availability
• Analysis of bottlenecks
–
–
–
–
Parallelization Overhead
Communication Overhead
Architectural Constraints
Memory Constraints
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
14
Architectural Model - Overview
• Base Xscale 400MHz Embedded Processor
• Speech processing unit
• Memory System Interface
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
15
Architectural Model – Processing Element
•
•
•
•
Execution model based on simple integer pipeline
Per-thread register contexts
Control logic
Small cache
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
16
Performance Analysis
• Detailed multiprocessor simulator based on SimpleScalar/ARM
• Hand parallelized copy of CMU-SPHINX library
– First cut static load balancing via hMetis
• Ideal Memory System
– Fixed memory latency (100 processor cycles), unlimited bandwidth
• True Memory System
– Detailed SDRAM simulator by Wang/Jacobs (University of Maryland)
• Workload Energy Consumption
– Combine estimates from multiple sources (details in paper).
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
17
Idealized Performance
• Idealized Model:
– Free inter-processor
communication
– 100 cycle memory latency
– unlimited BW
• 40% overhead
• Multi-Threading effective
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
18
Idealized Workload Energy Consumption
• Energy for Ideal system
• Reduction in energy due to
reduced time dissipating static
power
• Demonstrates potential to offset
increased energy consumption
of hardware
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
19
Tolerance of Memory Delays
• Relative performance of 100
cycle memory latency
compared to 50 cycle
memory latency
• Still unlimited bandwidth
• Added contexts tolerates
much of added delay
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
20
Variations that Affected Performance
• Static Partition Quality
– 15-20% speedup with profile based partition
• Dynamic Load Balancing
– 10% speedup with few contexts
• Work Queue Size
– 10% speedup with small work queue with few contexts
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
21
Variations that did Not Affect Performance
• Thread Spawn Latency
– Large latency has minimal impact
• Control Network
– Constraining to 8 bit bus with 2 cycle protocol overhead has minimal impact
• Global Locking
– Performance effect is minor relative to other factors
– Easily tolerated by added contexts
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
22
Full Memory System Simulation
Performance and energy for 100Mhz, 1 channel, 64 bit DRAM
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
23
DRAM Request Rate
200Mhz vs. 100Mhz
2 simultaneous requests vs. 1
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
24
DRAM Transfer / Resource Conflict Rate
Data placement by bank vs. standard
16 byte channel width vs. 8 byte
The Punch Line: Request Rate is key.
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
25
Future Directions
• Focus on memory system optimizations
– Partition reference stream between mutable and immutable data
– Potential benefit to a large level-2 cache
– Processor–on–memory : Shift delay to control network
• Domain specific ISA extensions
– Reduce need for memory by adding computation capability to processors
– Reduce instructions executed, improve efficiency
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
26
Summary and Conclusion
• This paper:
– presents a hybrid SMT/CMP architecture for low-power continuous SR.
– evaluate performance / bottlenecks for a number of architectural constraints
– focus on evaluation of parallelism
• Architectural Constraints
– SMT capabilities tolerate a number of system latencies
– Programming model is effective at exploiting concurrency
• Memory System Constraints
– Memory system bandwidth is the most significant performance bottleneck
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
27
Questions
?
?
?
?
?
?
?
?
Advanced Computer Architecture Lab
University of Michigan
?
?
?
?
CASES 2003
Rajeev Krishna
28
Knowledge Base
•
•
•
•
Language Model generated with Cambridge Statistical Modeling Toolkit
Input corpus from famous speeches and text from Project Gutenberg
Experiments performed with 11400 word vocabulary
Worst case results of trial inputs
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
29
Energy Estimation
• XScale Power
– PXA250 power consumption (active / idle) from product datasheet
• Processing Element
– Conservative area-scale estimate of relevant XScale die area
• Cache / Register Contexts
– Cache active energy taken from Cacti 3, idle ~ 25% active
– Thread register contexts also taken from cacti 3, compared to area estimate
• RAM
– Micron Technologies SDRAM system power estimator
– Considers Rd/Wr, Active, Precharge, Background, and Refresh power
Advanced Computer Architecture Lab
University of Michigan
CASES 2003
Rajeev Krishna
30
Download