Two Case Studies in Predictable Application Scheduling Using Rialto/NT Michael B. Jones – Microsoft Research John Regehr – University of Virginia Stefan Saroiu – University of Washington 1 Application Case Studies Two applications needing predictable execution on Windows 2000 Soft Modem Driver Digital Audio Player The case studies analyze behavior on normal Windows 2000 study improvements possible using Rialto/NT CPU Reservation mechanism 2 Consumer Real-Time General-purpose Operating Systems, such as Windows 2000: Increasing use of time-dependent tasks maximize aggregate throughput approximate fair sharing of the resources signal processing, audio, video Need support for: predictable scheduling for independently developed applications low latency responses explicit resource allocation mechanisms 3 Rialto/NT Abstractions Two real-time software abstractions: CPU Reservations – ongoing reservation for at least X time units out of every Y units for a thread Time Constraints – one-shot time reservation for specified amount of work between start time and deadline Case studies use only CPU Reservations 4 Rialto/NT Implementation Rialto/NT developed on top of Windows 2000 priority scheduler Limitations: CPU Reservations must be integer multiples of milliseconds Frequency of reservations must be power-of-two multiple of 1ms 5 First Case Study Predictable Scheduling for a Soft Modem 6 Why Study Soft Modems ? Signal Processing done on host CPU: While coexisting with other system activities requires predictable scheduling requires low latency responses Soft Modem is a background real-time task Successful in home computer market: Low cost Easy to update – software upgrade 7 Methodology Instrumented Windows 2000 performance kernel: Driver Software: No source for signal processing code Measurement Environment: Logs predefined and custom events Writes them to a memory buffer Dumps buffers to disk at end of trace All experiments run with normal-priority spinning competitor thread System: Windows 2000 Professional Pentium II 450 MHz (uniprocessor) 384 MB ECC SDRAM - 100 MB allocated to logging 8 Vendor Driver Version Processing in Interrupt (INT) Operation of the modem: 1. DMA transfers between A/D and D/A and physical memory 2. When enough data samples, the modem raises an interrupt 3. Inside ISR, process incoming data and provide outgoing samples, before buffers exhausted Uses input and output data buffers holding 512 16-bit samples (1024 bytes/buffer) 9 Three Additional Versions DPC Version (DPC) Thread Version (THR) The ISR queues a DPC DPC performs signal processing The ISR queues a DPC that signals a thread via a semaphore Thread performs signal processing Experimented with several different priorities Rialto/NT Version (RES) Same as THR, but thread scheduled using Rialto/NT real-time periodic CPU Reservation 10 Interrupt Rate 3 different phases, interrupts very regular Rate of Interrupts (INT) Dialing Training On-hook Connected 35 Milliseconds 30 25 20 15 10 5 0 0 5 10 15 20 25 30 Time (seconds) Falls within PC 99 recommended interrupt rates of 3-16ms 11 Elapsed Times in ISR (INT) 1.8 ms with repeatable worst case of 3.3 ms Elapsed Times in Interrupt Handler (INT) 3.5 On-hook Dialing Training Connected Milliseconds 3 2.5 2 1.5 1 0.5 0 0 5 10 15 20 25 30 Time (seconds) PC 99 recommends maximum time during which a driver-based modem disables interrupts should not exceed 100 µs 12 CPU Utilization 14.7% sustained load on 450MHz Pentium II CPU Load On-hook 35% Dialing Training Connected CPU Load 30% 25% 20% 15% 10% 5% 0% 0 5 10 15 20 25 30 Time (seconds) 13 Elapsed Times in ISR (DPC) ISR times now small, typically < 6µs Elapsed Times In Interrupt Handler (DPC) On-hook 16 Dialing Training Connected Microseconds 14 12 10 8 6 4 2 0 0 5 10 15 20 25 30 Time (seconds) 14 Elapsed Times in Queued DPC But now long DPC times: 1.8ms avg., 3.3 max (same as elapsed times in ISR for INT) Elapsed Times In Queued DPC (DPC) On-hook 3.5 Dialing Training Connected Milliseconds 3 2.5 2 1.5 1 0.5 0 0 5 10 15 20 25 30 Time (seconds) PC 99 recommends that the total execution time required for all queued DPCs should not exceed 500 µs 15 Samples Pending to be Processed (INT & THR 24) Small relative to 512 sample buffer size Samples Pending to be Processed (INT) On-hook Unprocessed Samples 35 Dialing Training Connected 30 25 20 15 10 5 0 0 5 10 15 20 25 30 Time (seconds) Samples Pending to be Processed (THR 24) 35 Unprocessed Samples Dialing On-hook Training Connected 30 25 20 15 10 5 0 0 5 10 15 20 25 30 Time (seconds) 16 Samples Pending to be Processed (THR 8) Unsurprisingly, contention kills modem Samples Pending to be Processed (THR 8) On-hook Unprocessed Samples 600 Dialing "Please hang up and try your call again" 500 400 300 200 100 0 0 5 10 15 20 25 30 35 Time (seconds) 17 Latency Results Set the multimedia timers to fire once every millisecond Register a routine to be called every millisecond Routine does very little work Stores cycle counter value and sleeps again Histograms show differences between recorded times and ideal times 18 Coexisting Thread Latencies (Control Case - No Modem) Maximum 1978µs between wakeups Control Case - No Modem 96.8% 2.5% 2.0% 1.5% 1.0% 0.5% 0.0% 50 10 0 15 0 40 0 85 0 90 0 95 0 10 00 10 50 11 00 18 50 19 00 19 50 20 00 Percentage of Callbacks 3.0% Latency (microseconds) 19 Coexisting Thread Latencies (INT) Maximum 5313µs between wakeups INT Version 83.1% 2.5% 2.0% 1.5% 1.0% 0.5% 0.0% 50 30 0 55 0 80 0 10 50 13 00 15 50 18 00 20 50 23 00 25 50 28 00 30 50 33 00 35 50 38 50 53 50 Percentage of Callbacks 3.0% Latency (microseconds) 20 Coexisting Thread Latencies (DPC) Maximum 4396µs between wakeups DPC Version 82.6% 2.5% 2.0% 1.5% 1.0% 0.5% 0.0% 50 30 0 55 0 80 0 10 50 13 00 15 50 18 00 20 50 23 00 25 50 28 00 30 50 33 50 39 50 Percentage of Callbacks 3.0% Latency (microseconds) 21 Coexisting Thread Latencies (THR 24) Maximum 2239µs between wakeups THR Version (24) 93.8% 2.5% 2.0% 1.5% 1.0% 0.5% 95 0 10 50 11 50 16 50 19 00 20 00 21 00 85 0 75 0 35 0 15 0 0.0% 50 Percentage of Callbacks 3.0% Latency (microseconds) 22 What Have We Learned So Far? Signal processing in the context of the interrupt handler is: Vendor choice understandable unnecessary detrimental to the latencies and predictability of coexisting activities For any priority there is a potentially unbounded delay between the interrupt and the thread running In practice Delays are reasonable for well-configured systems [Intel OSDI ’99] Using interrupts extreme form of priority inflation 23 Two Possible Solutions Rate Monotonic Analysis – determine the “right” priority assignments among all threads - two problems: Assumes cooperative priority assignment among all threads - unrealistic Working priority assignment dependent upon timing requirements of all threads Changes in application mix may require changes in priority assignments Use a time-based real-time scheduler Such as Rialto/NT 24 Samples Pending to be Processed (RES 2ms/8ms – 25%) Fits well within 512-sample buffer size Samples Pending to be Processed (RES 2ms/8ms) Unprocessed Samples 160 On-hook Dialing Training Connected 140 120 100 80 60 40 20 0 0 5 10 15 20 Time (seconds) 25 30 35 25 Coexisting Thread Latencies (RES 2ms/8ms – 25%) Maximum 1971µs between wakeups 85.5% 7.0% 6.0% 5.0% 4.0% 3.0% 2.0% 1.0% 95 0 10 00 10 50 11 00 11 50 18 50 19 00 19 50 20 00 90 0 20 0 15 0 0.0% 10 0 Percentage of Callbacks RES Version (2ms/8ms) Latency (microseconds) 26 File Transfer Times Results for 10 copies of 200,000 bytes each INT DPC THR Pri 24 RES 1ms/7ms RES 2ms/13ms RES 2ms/14ms RES 3ms/15ms RES 3ms/16ms RES 4ms/16ms RES 8ms/20ms Min 36.334 36.272 36.319 36.333 36.288 38.631 36.275 97.289 36.255 36.347 Max Mean Std Dev Passed 36.398 36.367 0.029 10 36.447 36.396 0.048 10 36.475 36.384 0.056 10 36.724 36.426 0.112 10 36.975 36.547 0.232 10 91.713 65.172 37.535 2 36.586 36.387 0.108 10 180.415 110.523 26.408 9 37.116 36.415 0.256 10 36.476 36.394 0.039 10 For 1/8, 2/15, 3/17, 4/17, 7/20 no test passed27 Modem Reservation Ranges Sensitivity to both percentage and gaps Reservation Amount (ms) Modem Reservation Operating Ranges 10 9 8 7 6 Sufficient CPU Percentage and Frequency 5 4 3 2 1 0 Gaps Too Long Insufficient Percentage 0 2 4 6 8 10 12 14 16 18 20 22 Reservation Period (ms) Sufficient Marginal Insufficient Actual 14.7% of CPU 12.5ms Gaps If period < 12.5ms, must get 14.7% to work If period > 12.5ms, (period – amount) >= 12.5ms must also hold 28 Soft Modem Conclusions Signal Processing in interrupt context is: The DPC version has similar problems Threads help alleviate these problems Unnecessary Detrimental to the predictability and latencies of the coexisting activities Modem runs well with real-time priorities and non-realtime competition However modem threads may interfere with other threads Real-time scheduler allows Control over modem’s degree of interference with other time-sensitive activities Performance isolation for threads using reservations 29 Industry Perspective Vendor did try their own THR version Worked fine during normal load However, modem was starved when: Copying data between two IDE devices Using USB scanner (Intel 440BX chipset) that turned off interrupts for 30-50 ms Therefore they shipped the INT version Vendor is willing to be a “good citizen” only if ensured that others would be as well Systematic latency timing verification of components is needed to enforce good behavior 30 Soft DSL is Coming More demanding than soft modems G.lite 1.531Mbps downstream and 512Kbps upstream ~ 25% of a 600 MHz Pentium III Full rate DSL 4ms processing period 3.062Mbps downstream and 512Kbps upstream Nearly 50% of a 600 MHz Pentium III Soft Bluetooth period 312.5µs 31 Further Soft Modem Studies Software-based Digital Subscriber Line (SoftDSL) studies Multiple Soft Modems within the same machine Similar studies on multiprocessors 32 Second Case Study Predictable Scheduling for Digital Audio 33 Methodology Empirically reverse-engineer thread requirements in a complex, legacy soft real-time application Assign CPU reservations to threads without use of source code without modifying the application Measure application behavior during contention 34 Windows Media Player Default player for mp3, wav, avi, mpeg Experimental method Modelled contention using spinning thread at various priorities Gave CPU Reservations to media player threads Played an mp3 song Listened for glitches Used instrumented kernel to detect buffer under-runs 35 Media Player Thread Structure (Simplified) Thread Period (ms) Priority Kernel Mixer (*) 10 24 MP3 Decoder (*) 100 9 User Interface 45 8 Disk Reader 2000 8 (*) Received CPU Reservations in some experiments. 36 MP3 Playback w/o Contention Kmixer thread (top) runs every 10ms MP3 decoder (4th line) runs every 100ms Works fine 37 Starvation Caused by Competing Thread @ Priority 10 Media Player runs only when NT priority inversion avoidance logic kicks in 38 Media Player + Reservation 1ms every 16ms reserved for decoder thread Competing with priority 10 thread Works fine 39 Priority Inversion Caused by Competing Thread x Competitor thread (priority 9) preempts MP3 decoder while holding Kmixer buffer lock Kmixer misses next two time slots (x) x Starves, causes audio glitch Fix: raise decoder priority before grabbing lock 40 Media Player Deadlock Circular wait among Media Player threads Deadlock broken by a timeout Fix: file a bug report… 41 Media Player Results Expected In the presence of contention, the Windows priority scheduler allows real-time apps to starve This can be fixed by giving real-time threads CPU Reservation Unexpected Competitor thread changes sequencing, exposes races in Media Player Hard to write correct programs with many threads & mutexes Fixed using priority ceiling emulation 42 Implications of Results Periods of threads in complex legacy apps can be reverse engineered Amounts are platform-dependent and are harder Next step to store application requirements and use middleware to automatically assign reservations No application support needed Potentially a way around the chicken/egg problem of using reservations in a world of legacy OSs and applications 43 Possible Continued Media Experiments Study software DVD player CPU intensive and time sensitive 44 Overall Conclusions Status quo insufficient Applications either inflate their priorities as did the soft modem driver or are at the mercy of applications that may be run at higher priorities as is the case with the digital audio player CPU Reservations solve this problem by allowing applications to reliably obtain the time they need while allowing other applications to do the same 45 For More Information See Mike Jones (mbj@microsoft.com): or John Regehr (regehr@cs.utah.edu): http://www.cs.utah.edu/~regehr/ or Stefan Saroiu (tzoompy@cs.washington.edu): http://research.microsoft.com/~mbj/ http://www.cs.washington.edu/homes/tzoompy/ Related papers at Mike’s web site 46