Powerpoint slides - University of Utah

advertisement
Two Case Studies in
Predictable Application
Scheduling Using Rialto/NT
Michael B. Jones – Microsoft Research
John Regehr – University of Virginia
Stefan Saroiu – University of Washington
1
Application Case Studies

Two applications needing predictable
execution on Windows 2000
Soft Modem Driver
 Digital Audio Player


The case studies
analyze behavior on normal Windows 2000
 study improvements possible using
Rialto/NT CPU Reservation mechanism

2
Consumer Real-Time

General-purpose Operating Systems,
such as Windows 2000:



Increasing use of time-dependent tasks


maximize aggregate throughput
approximate fair sharing of the resources
signal processing, audio, video
Need support for:



predictable scheduling for independently
developed applications
low latency responses
explicit resource allocation mechanisms
3
Rialto/NT Abstractions

Two real-time software abstractions:
CPU Reservations – ongoing reservation for
at least X time units out of every Y units for
a thread
 Time Constraints – one-shot time
reservation for specified amount of work
between start time and deadline


Case studies use only CPU Reservations
4
Rialto/NT Implementation


Rialto/NT developed on top of
Windows 2000 priority scheduler
Limitations:
CPU Reservations must be integer
multiples of milliseconds
 Frequency of reservations must be
power-of-two multiple of 1ms

5
First Case Study
Predictable Scheduling for a Soft Modem
6
Why Study Soft Modems ?

Signal Processing done on host CPU:



While coexisting with other system
activities


requires predictable scheduling
requires low latency responses
Soft Modem is a background real-time task
Successful in home computer market:


Low cost
Easy to update – software upgrade
7
Methodology

Instrumented Windows 2000 performance kernel:




Driver Software:


No source for signal processing code
Measurement Environment:


Logs predefined and custom events
Writes them to a memory buffer
Dumps buffers to disk at end of trace
All experiments run with normal-priority spinning
competitor thread
System:



Windows 2000 Professional
Pentium II 450 MHz (uniprocessor)
384 MB ECC SDRAM - 100 MB allocated to logging
8
Vendor Driver Version Processing in Interrupt (INT)

Operation of the modem:
1. DMA transfers between A/D and D/A and
physical memory
 2. When enough data samples, the modem
raises an interrupt
 3. Inside ISR, process incoming data and
provide outgoing samples, before buffers
exhausted


Uses input and output data buffers holding
512 16-bit samples (1024 bytes/buffer)
9
Three Additional Versions

DPC Version (DPC)



Thread Version (THR)




The ISR queues a DPC
DPC performs signal processing
The ISR queues a DPC that signals a thread via a
semaphore
Thread performs signal processing
Experimented with several different priorities
Rialto/NT Version (RES)

Same as THR, but thread scheduled using
Rialto/NT real-time periodic CPU Reservation
10
Interrupt Rate
3 different phases, interrupts very regular
Rate of Interrupts (INT)
Dialing
Training
On-hook
Connected
35
Milliseconds
30
25
20
15
10
5
0
0
5
10
15
20
25
30
Time (seconds)
Falls within PC 99 recommended interrupt rates of 3-16ms
11
Elapsed Times in ISR (INT)
1.8 ms with repeatable worst case of 3.3 ms
Elapsed Times in Interrupt Handler (INT)
3.5
On-hook
Dialing
Training
Connected
Milliseconds
3
2.5
2
1.5
1
0.5
0
0
5
10
15
20
25
30
Time (seconds)
PC 99 recommends maximum time during which a driver-based
modem disables interrupts should not exceed 100 µs
12
CPU Utilization
14.7% sustained load on 450MHz Pentium II
CPU Load
On-hook
35%
Dialing
Training
Connected
CPU Load
30%
25%
20%
15%
10%
5%
0%
0
5
10
15
20
25
30
Time (seconds)
13
Elapsed Times in ISR (DPC)
ISR times now small, typically < 6µs
Elapsed Times In Interrupt Handler (DPC)
On-hook
16
Dialing
Training
Connected
Microseconds
14
12
10
8
6
4
2
0
0
5
10
15
20
25
30
Time (seconds)
14
Elapsed Times in Queued DPC
But now long DPC times: 1.8ms avg., 3.3 max
(same as elapsed times in ISR for INT)
Elapsed Times In Queued DPC (DPC)
On-hook
3.5
Dialing
Training
Connected
Milliseconds
3
2.5
2
1.5
1
0.5
0
0
5
10
15
20
25
30
Time (seconds)
PC 99 recommends that the total execution time required for all
queued DPCs should not exceed 500 µs
15
Samples Pending to be Processed
(INT & THR 24)
Small relative to 512 sample buffer size
Samples Pending to be Processed (INT)
On-hook
Unprocessed Samples
35
Dialing
Training
Connected
30
25
20
15
10
5
0
0
5
10
15
20
25
30
Time (seconds)
Samples Pending to be Processed (THR 24)
35
Unprocessed Samples
Dialing
On-hook
Training
Connected
30
25
20
15
10
5
0
0
5
10
15
20
25
30
Time (seconds)
16
Samples Pending to be
Processed (THR 8)
Unsurprisingly, contention kills modem
Samples Pending to be Processed (THR 8)
On-hook
Unprocessed Samples
600
Dialing
"Please hang up and try your call again"
500
400
300
200
100
0
0
5
10
15
20
25
30
35
Time (seconds)
17
Latency Results



Set the multimedia timers to fire once
every millisecond
Register a routine to be called every
millisecond
Routine does very little work


Stores cycle counter value and sleeps again
Histograms show differences between
recorded times and ideal times
18
Coexisting Thread Latencies
(Control Case - No Modem)
Maximum 1978µs between wakeups
Control Case - No Modem
96.8%
2.5%
2.0%
1.5%
1.0%
0.5%
0.0%
50
10
0
15
0
40
0
85
0
90
0
95
0
10
00
10
50
11
00
18
50
19
00
19
50
20
00
Percentage of Callbacks
3.0%
Latency (microseconds)
19
Coexisting Thread Latencies
(INT)
Maximum 5313µs between wakeups
INT Version
83.1%
2.5%
2.0%
1.5%
1.0%
0.5%
0.0%
50
30
0
55
0
80
0
10
50
13
00
15
50
18
00
20
50
23
00
25
50
28
00
30
50
33
00
35
50
38
50
53
50
Percentage of Callbacks
3.0%
Latency (microseconds)
20
Coexisting Thread Latencies
(DPC)
Maximum 4396µs between wakeups
DPC Version
82.6%
2.5%
2.0%
1.5%
1.0%
0.5%
0.0%
50
30
0
55
0
80
0
10
50
13
00
15
50
18
00
20
50
23
00
25
50
28
00
30
50
33
50
39
50
Percentage of Callbacks
3.0%
Latency (microseconds)
21
Coexisting Thread Latencies
(THR 24)
Maximum 2239µs between wakeups
THR Version (24)
93.8%
2.5%
2.0%
1.5%
1.0%
0.5%
95
0
10
50
11
50
16
50
19
00
20
00
21
00
85
0
75
0
35
0
15
0
0.0%
50
Percentage of Callbacks
3.0%
Latency (microseconds)
22
What Have We Learned So Far?

Signal processing in the context of the
interrupt handler is:



Vendor choice understandable


unnecessary
detrimental to the latencies and predictability of
coexisting activities
For any priority there is a potentially unbounded
delay between the interrupt and the thread running
In practice


Delays are reasonable for well-configured systems
[Intel OSDI ’99]
Using interrupts extreme form of priority inflation
23
Two Possible Solutions

Rate Monotonic Analysis – determine the
“right” priority assignments among all threads
- two problems:



Assumes cooperative priority assignment among all
threads - unrealistic
Working priority assignment dependent upon
timing requirements of all threads
 Changes in application mix may require changes
in priority assignments
Use a time-based real-time scheduler

Such as Rialto/NT
24
Samples Pending to be Processed
(RES 2ms/8ms – 25%)
Fits well within 512-sample buffer size
Samples Pending to be Processed (RES 2ms/8ms)
Unprocessed Samples
160
On-hook
Dialing
Training
Connected
140
120
100
80
60
40
20
0
0
5
10
15
20
Time (seconds)
25
30
35
25
Coexisting Thread Latencies
(RES 2ms/8ms – 25%)
Maximum 1971µs between wakeups
85.5%
7.0%
6.0%
5.0%
4.0%
3.0%
2.0%
1.0%
95
0
10
00
10
50
11
00
11
50
18
50
19
00
19
50
20
00
90
0
20
0
15
0
0.0%
10
0
Percentage of Callbacks
RES Version (2ms/8ms)
Latency (microseconds)
26
File Transfer Times
Results for 10 copies of 200,000 bytes each
INT
DPC
THR Pri 24
RES 1ms/7ms
RES 2ms/13ms
RES 2ms/14ms
RES 3ms/15ms
RES 3ms/16ms
RES 4ms/16ms
RES 8ms/20ms
Min
36.334
36.272
36.319
36.333
36.288
38.631
36.275
97.289
36.255
36.347
Max
Mean Std Dev Passed
36.398 36.367 0.029
10
36.447 36.396 0.048
10
36.475 36.384 0.056
10
36.724 36.426 0.112
10
36.975 36.547 0.232
10
91.713 65.172 37.535
2
36.586 36.387 0.108
10
180.415 110.523 26.408
9
37.116 36.415 0.256
10
36.476 36.394 0.039
10
For 1/8, 2/15, 3/17, 4/17, 7/20 no test passed27
Modem Reservation Ranges
Sensitivity to both percentage and gaps
Reservation Amount (ms)
Modem Reservation Operating Ranges
10
9
8
7
6
Sufficient
CPU Percentage
and Frequency
5
4
3
2
1
0
Gaps
Too
Long
Insufficient Percentage
0
2
4
6
8
10
12
14
16
18
20
22
Reservation Period (ms)
Sufficient
Marginal
Insufficient
Actual
14.7% of CPU
12.5ms Gaps
If period < 12.5ms, must get 14.7% to work
If period > 12.5ms, (period – amount) >= 12.5ms
must also hold
28
Soft Modem Conclusions

Signal Processing in interrupt context is:




The DPC version has similar problems
Threads help alleviate these problems



Unnecessary
Detrimental to the predictability and latencies of the
coexisting activities
Modem runs well with real-time priorities and non-realtime competition
However modem threads may interfere with other
threads
Real-time scheduler allows


Control over modem’s degree of interference with other
time-sensitive activities
Performance isolation for threads using reservations
29
Industry Perspective

Vendor did try their own THR version





Worked fine during normal load
However, modem was starved when:
 Copying data between two IDE devices
 Using USB scanner (Intel 440BX chipset) that
turned off interrupts for 30-50 ms
Therefore they shipped the INT version
Vendor is willing to be a “good citizen” only if
ensured that others would be as well
Systematic latency timing verification of
components is needed to enforce good behavior
30
Soft DSL is Coming

More demanding than soft modems


G.lite



1.531Mbps downstream and 512Kbps upstream
~ 25% of a 600 MHz Pentium III
Full rate DSL



4ms processing period
3.062Mbps downstream and 512Kbps upstream
Nearly 50% of a 600 MHz Pentium III
Soft Bluetooth period 312.5µs
31
Further Soft Modem Studies



Software-based Digital Subscriber
Line (SoftDSL) studies
Multiple Soft Modems within the
same machine
Similar studies on multiprocessors
32
Second Case Study
Predictable Scheduling for Digital Audio
33
Methodology

Empirically reverse-engineer thread
requirements in a complex, legacy soft
real-time application


Assign CPU reservations to threads


without use of source code
without modifying the application
Measure application behavior during
contention
34
Windows Media Player


Default player for mp3, wav, avi, mpeg
Experimental method
Modelled contention using spinning thread at
various priorities
 Gave CPU Reservations to media player
threads
 Played an mp3 song
 Listened for glitches
 Used instrumented kernel to detect buffer
under-runs

35
Media Player Thread
Structure (Simplified)
Thread
Period (ms) Priority
Kernel Mixer (*)
10
24
MP3 Decoder (*)
100
9
User Interface
45
8
Disk Reader
2000
8
(*) Received CPU Reservations in some experiments.
36
MP3 Playback w/o Contention



Kmixer thread (top) runs every 10ms
MP3 decoder (4th line) runs every 100ms
Works fine
37
Starvation Caused by Competing
Thread @ Priority 10

Media Player runs only when NT priority
inversion avoidance logic kicks in
38
Media Player + Reservation



1ms every 16ms reserved for decoder thread
Competing with priority 10 thread
Works fine
39
Priority Inversion Caused by
Competing Thread
x


Competitor thread (priority 9) preempts MP3
decoder while holding Kmixer buffer lock
Kmixer misses next two time slots (x)


x
Starves, causes audio glitch
Fix: raise decoder priority before grabbing lock
40
Media Player Deadlock



Circular wait among Media Player threads
Deadlock broken by a timeout
Fix: file a bug report…
41
Media Player Results

Expected
In the presence of contention, the Windows
priority scheduler allows real-time apps to
starve
 This can be fixed by giving real-time threads
CPU Reservation


Unexpected

Competitor thread changes sequencing,
exposes races in Media Player
 Hard to write correct programs with
many threads & mutexes
 Fixed using priority ceiling emulation
42
Implications of Results

Periods of threads in complex legacy apps
can be reverse engineered


Amounts are platform-dependent and are
harder
Next step to store application requirements
and use middleware to automatically assign
reservations
No application support needed
 Potentially a way around the chicken/egg
problem of using reservations in a world of
legacy OSs and applications

43
Possible Continued
Media Experiments

Study software DVD player

CPU intensive and time sensitive
44
Overall Conclusions

Status quo insufficient
Applications either inflate their priorities
 as did the soft modem driver
 or are at the mercy of applications that may be
run at higher priorities
 as is the case with the digital audio player


CPU Reservations solve this problem
by allowing applications to reliably obtain the
time they need
 while allowing other applications to do the same

45
For More Information

See Mike Jones (mbj@microsoft.com):


or John Regehr (regehr@cs.utah.edu):


http://www.cs.utah.edu/~regehr/
or Stefan Saroiu
(tzoompy@cs.washington.edu):


http://research.microsoft.com/~mbj/
http://www.cs.washington.edu/homes/tzoompy/
Related papers at Mike’s web site
46
Download