Measuring and Modeling
Hyper-threaded Processor
Performance
Ethan Bolker
UMass-Boston
September 17, 2003
• Joint work with Yiping Ding, Arjun Kumar
(BMC Software)
• Accepted for presentation at CMG32,
December 2003
• Paper (with references) available on request
Improving Processor Performance
• Speed up clock
• Invent revolutionary new architecture
• Replicate processors (parallel application)
• Remove bottlenecks (use idle ALU)
– caches
– pipelining
– prefetch
Default for new Intel high end chips
• One ALU
• Duplicate state of computation (registers) to create two logical processors
(chip size *= 1.05)
• Parallel instruction preparation (decode)
• ALU should see ready work more often
(provided there are two active threads)
Intel Technology Journal, Volume 06 Issue 01, February 14, 2002, p8
• Treat processor as a black box
• Experiment to observe behavior
• Model to predict behavior
• Batch workload: repeated dispatch of identical compute intensive jobs
– vary number of threads
– measure throughput (jobs/second)
1000
900
800
700
600
500
400
300
200
100
0
1 2 3 4 5 num ber of threads
6 7 8 one CPU, HTT off tw o CPUs, HTT off one CPU, HTT on tw o CPUs, HTT on
} puzzling
} makes sense
} make sense
• More interesting than batch
• Random size jobs arrive at random times
• M/M/1
M = “Markov”
M/*/* : arrival stream is Poisson, rate
*/M/* : job size exponentially distributed, mean s
*/*/1 : single processor
• Utilization: U = s
U is dimensionless: jobs/sec * sec/job
U < 1 else saturation
• Response time: r = s/(1-U) randomness
each job sees (virtual) processor slowed down (by other jobs) by factor 1/(1-U), so to accumulate s seconds of real work takes r = s/(1-U) seconds of real time
• Java driver
– chooses interarrival times and service times from exponential distributions,
– dispatches each job in its own thread,
– records actual job CPU usage, response time
• Input parameters
– job arrival rate
– mean job service time s
• Fix s = 1 second, vary
(hence U), track r
R = 1/(1-U)
4.5
4 theory: M/M/1
3.5
3
2.5
2
1.5
1
0.5
0
0 measured
0.2
0.4
0.6
utilization predicted
0.8
1 measured/predicted practice: measured
• “In theory, there is no difference between theory and practice. In practice, there is no relationship between theory and practice.”
Grant Gainey
• “The gap between theory and practice in practice is much larger than the gap between theory and practice in theory.”
Jeff Case
• Examine, tune benchmark driver
• Compute actual coefficients of variation, incorporate in corrected M/M/1 formula
• Nothing helps
• Postpone worry – in the meanwhile …
• Use this benchmark to measure the effect of hyper-threading on response time
• Use throughput (
) as the independent variable
• “Utilization” is ambiguous (digression)
4
3.5
3
2.5
2
1.5
1
0.5
0
0 0.2
0.4
0.6
Throughput
0.8
1 htt on htt off on/off
• Hyper-threading allows more of the application parallelism to make its way to the ALU
• Can we understand this quantitatively?
preparatory phase service time s
1 execution phase service time s
2
/2
/2
s
1 s
2 r = +
1 – (
/2) s
1
1 – s
2
3.5
3
2.5
2
1.5
1
0.5
0
0 0.2
0.4
0.6
throughput
0.8
s
1 s
2
= 0.13
= 0.81
1 measured measured/predicted predicted
• To compute response time r from model, need (virtual) service parameters s
1
(
is known)
, s
2
• Finding s
1
, s
2
– eyeball measured data
– fit two data points
– maximum likelihood
– derive from first principles
• s
1
= 0.13, s
2
= 0.81 make sense
15% of work is preparatory, 85% execution
Benchmark validation
(reprise)
• Chip hardware unchanged when HTT off
• Assume one path used
• Tandem queue
• Parameter estimation as before
0
3.5
3
2.5
2
1.5
1
0.5
0
0 0.2
0.4
0.6
Throughput
0.8
measured measured/predicted predicted s
1 s
2
= 0.045
= 0.878
1
• Do serious statistics
• Does 1+1 tandem queue model predict hyperthreading response as well as complex 2+1 model?
• Understand two-processor machine puzzle
• Explore how s
1 and s
2 vary with application
(e.g. fixed vs floating point)
• Find ways to estimate s
1 principles and s
2 from first
• Hyper-threading is …
• Abstraction (modelling) leverages information: you can often understand a lot even when you know very little
• r = s/(1-U) is worth remembering
• You do need to connect theory and practice
– and practice is harder than theory
• Questions?