Architectural Optimizations for Low-Power Real-Time Speech Recognition Rajeev Krishna, Scott Mahlke, Todd Austin Advanced Computer Architecture Lab University of Michigan Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 1 What is Speech Recognition? Large vocabulary, Speaker independent Representative of the class of natural I/O applications Complicated by natural variations in acoustics and meaning Performance constraints preclude use in portable systems 250 Words per Minute • • • • Excited Speech 7 min 6 min 200 14 min Unexcited Speech 150 100 2 hrs 50 6 hrs 0 SA-1110 206Mhz Xscale 400Mhz PIII - 600Mhz PIII - 900Mhz PIII - 1Ghz Processor Type Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 2 Performance Characteristics • • • • Signal Processing (DSP Style) Search – (Gaussian Scoring, Model Evaluation) Hidden Markov Models used to describe language Characteristics violate design assumptions of modern processors Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 3 The Source of the Problem: Search DH “Their Car” = DH EH R [word] K AA R P(“DH”) Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 4 The Source of the Problem: Search DH EH R [word] K AA R DH Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 6 The Source of the Problem: Search DH EH R [word] K AA R DH Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 7 The Source of the Problem: Search DH EH R [word] K AA R “Their” EH DH R AX IH AH IY “The” “Ear” [word] Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 8 The Source of the Problem: Search DH EH R [word] K AA R “Their” EH DH R AX IH AH IY “The” “Ear” [word] Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 9 The Source of the Problem: Search DH EH R [word] K AA R “Their” DH EH AX R K IH AH “Car” AA AE P R “Cap” IY “The” T “Ear” “Cat” [word] [word] Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 10 The Source of the Problem: Search DH EH R [word] K AA R DH EH R AX IH AH IY EH K AA R NH AE P L N T F OY Advanced Computer Architecture Lab University of Michigan S CASES 2003 Rajeev Krishna 11 The Source of the Problem: Search DH EH R [word] K AA R DH EH R AX IH AH IY IY EH K AA R NH AE P TH L N T SH OY OW S T F G Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 12 The Source of the Problem: Search AX DH EH R [word] K AA R JH DH GH G EH AX R IH CH AH IY EH IH V G SH K OW NH IY IY DK DUH L F K OY Advanced Computer Architecture Lab University of Michigan ZH AA R Z AE P IH OW T N ER OW TH SH F S T CASES 2003 Rajeev Krishna 13 This Work • Focus on exposing parallelism • Architectural Model – Hybrid CMP/SMT architecture • Programming Model – Programmer exposes concurrency – Architecture matches to resource availability • Analysis of bottlenecks – – – – Parallelization Overhead Communication Overhead Architectural Constraints Memory Constraints Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 14 Architectural Model - Overview • Base Xscale 400MHz Embedded Processor • Speech processing unit • Memory System Interface Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 15 Architectural Model – Processing Element • • • • Execution model based on simple integer pipeline Per-thread register contexts Control logic Small cache Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 16 Performance Analysis • Detailed multiprocessor simulator based on SimpleScalar/ARM • Hand parallelized copy of CMU-SPHINX library – First cut static load balancing via hMetis • Ideal Memory System – Fixed memory latency (100 processor cycles), unlimited bandwidth • True Memory System – Detailed SDRAM simulator by Wang/Jacobs (University of Maryland) • Workload Energy Consumption – Combine estimates from multiple sources (details in paper). Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 17 Idealized Performance • Idealized Model: – Free inter-processor communication – 100 cycle memory latency – unlimited BW • 40% overhead • Multi-Threading effective Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 18 Idealized Workload Energy Consumption • Energy for Ideal system • Reduction in energy due to reduced time dissipating static power • Demonstrates potential to offset increased energy consumption of hardware Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 19 Tolerance of Memory Delays • Relative performance of 100 cycle memory latency compared to 50 cycle memory latency • Still unlimited bandwidth • Added contexts tolerates much of added delay Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 20 Variations that Affected Performance • Static Partition Quality – 15-20% speedup with profile based partition • Dynamic Load Balancing – 10% speedup with few contexts • Work Queue Size – 10% speedup with small work queue with few contexts Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 21 Variations that did Not Affect Performance • Thread Spawn Latency – Large latency has minimal impact • Control Network – Constraining to 8 bit bus with 2 cycle protocol overhead has minimal impact • Global Locking – Performance effect is minor relative to other factors – Easily tolerated by added contexts Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 22 Full Memory System Simulation Performance and energy for 100Mhz, 1 channel, 64 bit DRAM Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 23 DRAM Request Rate 200Mhz vs. 100Mhz 2 simultaneous requests vs. 1 Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 24 DRAM Transfer / Resource Conflict Rate Data placement by bank vs. standard 16 byte channel width vs. 8 byte The Punch Line: Request Rate is key. Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 25 Future Directions • Focus on memory system optimizations – Partition reference stream between mutable and immutable data – Potential benefit to a large level-2 cache – Processor–on–memory : Shift delay to control network • Domain specific ISA extensions – Reduce need for memory by adding computation capability to processors – Reduce instructions executed, improve efficiency Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 26 Summary and Conclusion • This paper: – presents a hybrid SMT/CMP architecture for low-power continuous SR. – evaluate performance / bottlenecks for a number of architectural constraints – focus on evaluation of parallelism • Architectural Constraints – SMT capabilities tolerate a number of system latencies – Programming model is effective at exploiting concurrency • Memory System Constraints – Memory system bandwidth is the most significant performance bottleneck Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 27 Questions ? ? ? ? ? ? ? ? Advanced Computer Architecture Lab University of Michigan ? ? ? ? CASES 2003 Rajeev Krishna 28 Knowledge Base • • • • Language Model generated with Cambridge Statistical Modeling Toolkit Input corpus from famous speeches and text from Project Gutenberg Experiments performed with 11400 word vocabulary Worst case results of trial inputs Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 29 Energy Estimation • XScale Power – PXA250 power consumption (active / idle) from product datasheet • Processing Element – Conservative area-scale estimate of relevant XScale die area • Cache / Register Contexts – Cache active energy taken from Cacti 3, idle ~ 25% active – Thread register contexts also taken from cacti 3, compared to area estimate • RAM – Micron Technologies SDRAM system power estimator – Considers Rd/Wr, Active, Precharge, Background, and Refresh power Advanced Computer Architecture Lab University of Michigan CASES 2003 Rajeev Krishna 30