Towards Optimally Scheduled Manycore Clouds Alexey Tumanov Wei Lin

advertisement
Towards Optimally Scheduled
Manycore Clouds
Alexey Tumanov
Wei Lin
18742, Spring 2011
Prof: Onur Mutlu
Papers Presented
An Operating System for Multicore and Clouds:
Mechanisms and Implementation
David Wentzlaff et al
Handling the Problems and Opportunities Posed by Multiple
On-Chip Memory Controllers
Manu Awasthi et al
Contention-aware Application Mapping for Network-on-Chip
Communication Architectures
Chen-Ling Chou & Radu Marculescu
Grand unifying theme:
optimally scheduled
manycore
clouds
Cloud+Manycore: Common Challenges
scalability
managing elasticity of both
resource demand
resource supply
fault isolation
programmability
coherence of programming & communication model
workload scheduling*
our contribution to the pool of common challenges
Factored Operating System
Vision:
to have a single system image OS that manages
hardware resources on a local manycore chip as well as
across the cluster of such chips
Trends:
the rise and fall of the single core application
performance scaling
cloud computing
manycore architectures
Single System Image Advantages
Ease of administration
manage one OS
Transparent sharing
e.g. memory & disk -- paging across machines
Consistent communication & programming model
no more fractured
communication
programming abstractions used
e.g. shared memory vs message passing
Easy process migration
agility in a dynamically changing environment
Debugging distributed multi-VM applications
FOS: Design Principles
Space multiplexing replaces time multiplexing
core availability soon overcome number of threads
OS functionality factored into services
each implemented as a parallel distributed service
libfos used to communicate to services
Adapts utilization to scale with demand
services monitored and services under higher demand
given more resources (cores)
Fault tolerance
watchdog process monitors services
in event of failure, new instances of services created
naming service links communication channels
FOS: Architecture Overview
Microkernel
one instance per core
provides protected messaging layer
name cache
basic context switching
API for modification of address space and thread creation
Memory Management & scheduling in user space
Messaging -- inter-process communication
process-to-process API
shared data is explicit
implementation/distance agnostic
mailboxes (inbound/outbound/internal)
FOS: Architecture Overview
Naming
all servers assigned a name
nameserver determines message routing
globally coherent namespace
frequent updates
resource pool servers may share the same name
round robin, closest server, <your idea here>...
OS Services
fleet of spatially distributed, cooperating servers
processes added/removed to scale with demand
complex implementation without support from FOS
cooperative multithreaded programming model
remote procedure calls
data structures
FOS: Architecture Overview
FOS: Preliminary Results
Operating System Overhead
and performance
Null System Calls
fos
min
11058
avg
11892
max
28700
stddev 513.1
Linux
1321
1328
9985
122.8
Table 1: intramachine response in cycles
min
avg
max
stddev
fos
0.232
0.353
0.548
0.028
Linux
0.199
0.274
0.395
0.0491
Table 2: Intermachine response in ms
Remote Procedure Call
Overhead
Ping Response
Process Spawn Time
min
avg
max
mdev
fos
0.029
0.031
0.059
0.008
Linux
0.017
0.032
0.064
0.009
Table 3: ping response in ms
min
avg
max
stddev
Local
0.977
1.549
10.356
1.881
Remote
1.615
4.070
13.337
4.601
Table 4: process creation time in ms
Performance optimization: data
placement
Vision:
Improve the performance of applications by intelligent
page placement and migration techniques
Trends:
Increase in cores result in memory controller bottleneck
Number of memory controllers cannot scale with cores
each controller must service more cores
Latency due to memory controller policy not necessarily
trivial compared to DRAM access
Performance optimization: data
placement
Requirements
minimize access latency to requested page
distance
network load
queuing delay at MC
bank and rank contention
row buffer hit rate
not significantly degrade accesses to other pages on the
same MC
must be dynamically determinable at runtime
Techniques
adaptive first touch (AFT) page placement
dynamic migration of data
Adaptive First Touch Page Placement
Threads are assigned to processors arbitrarily
On first touch, cost function must be minimized
loadj = average queuing delay at MCj
rowhitsj=average rate of row buffer hits at MCj
distancej=number of hops from core to MCj
each factor is weighted by importance
Last 5 mapping cache used to speed up decision if more
than one fault triggers within 5000 cycles
Page migration
Stable applications are considered in epochs
Donor MC
MC that drops 10% or more in row buffer hit rate from
last epoch
Recipient MC
physical proximal to donor
lowest number of page-conflicts in last epoch
Once migrated, not touched for at least 2 epochs
Maintaining correctness
TLB Update/Invalidate
Cache Invalidate
dirty writebacks
Where are you getting all this data?
Counters
Row-Hit
Row-Miss
Row-Conflict
System level daemon polls for this information
Queuing delay
Static baseline latency of memory
Runtime monitoring of current latency
Results: Summary
Adaptive First Touch: 17% Improvement
Dynamic Page Migration: 34% Improvement
Improved row-buffer hit rates
first touch: 16.6%
migration: 22.7%
Overhead
up to 15.6% increase in network traffic
average of about 4.4%
Recap
lots of execution containers to schedule
larger-granularity time quanta for running threads
optimal relative placement of threads and data
data placement & page migration is just one way
thread placement is the other -- arguably superior
the truth is probably somewhere in the middle
data and thread migration is an attractive tradeoff space
Performance optimization: thread
placement
Goal: improve application performance by minimizing:
communication energy consumption
due to NoC link traversal
inter-tile contention* (authors' contribution)
Proposed solution
integer linear programming (ILP) formulation
np-hard
techniques to relax the complexity of the problem
make it LP
branch & bound optimizations
Result:
improvement of average packet latency
throughput improvement
Logical Core Placement: Problem Stmt
Problem formulation
Logical Core Placement: Algorithm
ILP problem was relaxed to LP
that means: remove the integer variable requirement
The resulting heuristic algorithm is non-intuitive
Thread placement: Results
able to achieve within 5% of minimal cost produced by an
ILP solver
caveat: the paper says nothing about accuracy of
mapping function solution
comparison of energy-aware and contention-aware
throughput improvement up to 24% is possible at the
expense of average packet latency increase
throughput improvement at the expense of communication
energy increase
for real applications: up to 20% throughput gained by
giving up to 8.8% NoC latency
Caveats
biggest caveat of all: path-based contention
there's nothing in the model that prevents us from
including source- and dest- based contention as well
Summary
we're observing confluence of trends: cloud computing
(IaaS) and multicore
tangible intersection of challenges
our belief: possibility of developing a set of
techniques/algorithms addressing those challenges for both
Factored OS - single system image OS powering manycore
chips across a cluster of datacentre machines
need for optimality in workload scheduling
data placement
thread placement
The Blank Slide
la fin de la présentation :-)
Simulation Overview
Virtutech Simics
16 Cores
Distributed L2 cache
4 Memory Controllers
DRAMSim Framework
Workloads
PARSEC
SPECjbb2005
Stream
Migration of 10 pages
Epoch is 5 million cycles
TLB invalidation is 5000
cycles
500 million instructions
Results: Relative Throughput
Results: Row Buffer Hits
Data Placement vs Thread Placement
Data Placement
Advantages:
doesn't necessitate thread migration
preservation of cached state
Disadvantages:
consumes NoC bandwidth to migrate pages
*** will not scale with the growing number of cores!
if MCs are positioned on the perimeter of the grid
e.g. threads at the centre of the grid -- data
placement fails to improve their performance
Thread Placement
Advantages
ability to achieve minimal communication cost
more flexible - captures communication cost
minimization core-MC as well as core-core
Disadvantages
Thread Migration Advantages: Cont
no need to flush TLB on all cores
no need to invalidate migrated page's cache lines on all
cores
scales with increase in cores much better than page
migration
page migration is the preferred method for traditional
NUMA systems
does not require
use of the NoC bandwidth for page migration OR
dedicated MC-to-MC communication fabric
Download