Capriccio: Scalable Threads for Internet Service

advertisement
Capriccio: Scalable Threads
for Internet Service
Introduction


Internet services have ever-increasing
scalability demands
Current hardware is meeting these
demands


Software has lagged behind
Recent approaches are event-based

Pipeline stages of events
Drawbacks of Events

Events systems hide the control flow


Difficult to understand and debug
Eventually evolved into call-and-return
event pairs



Programmers need to match related events
Need to save/restore states
Capriccio: instead of event-based
model, fix the thread-based model
Goals of Capriccio
Support for existing thread API


Scalability to thousands of threads


Little changes to existing applications
One thread per
execution
Flexibility to address
application-specific
needs
Ideal
Ease of Programming

Threads
Events
Threads
Performance
Thread Design Principles



Kernel-level threads are for true
concurrency
User-level threads provide a clean
programming model with useful
invariants and semantics
Decouple user from kernel level
threads

More portable
Capriccio



Thread package
All thread operations are O(1)
Linked stacks



Address the problem of stack allocation
for large numbers of threads
Combination of compile-time and runtime analysis
Resource-aware scheduler
Thread Design and Scalability

POSIX API

Backward compatible
User-Level Threads
+ Performance
+ Flexibility
- Complex preemption
- Bad interaction with kernel scheduler
Flexibility




Decoupling user and kernel threads
allows faster innovation
Can use new kernel thread features
without changing application code
Scheduler tailored for applications
Lightweight
Performance



Reduce the overhead of thread
synchronization
No kernel crossing for preemptive
threading
More efficient memory management at
user level
Disadvantages

Need to replace blocking calls with
nonblocking ones to hold the CPU


Translation overhead
Problems with multiple processors

Synchronization becomes more
expensive
Context Switches


Built on top of Edgar Toernig’s
coroutine library
Fast context switches when threads
voluntarily yield
I/O


Capriccio intercepts blocking I/O calls
Uses epoll for asynchronous I/O
Scheduling


Very much like an event-driven
application
Events are hidden from programmers
Synchronization

Supports cooperative threading on
single-CPU machines

Requires only Boolean checks
Threading Microbenchmarks





SMP, two 2.4 GHz Xeon processors
1 GB memory
two 10 K RPM SCSI Ultra II hard
drives
Linux 2.5.70
Compared Capriccio, LinuxThreads,
and Native POSIX Threads for Linux
Latencies of Thread Primitives
Capriccio
LinuxThreads
NPTL
21.5
21.5
17.7
Thread
0.24
context
switch
Uncontended 0.04
mutex lock
0.71
0.65
0.14
0.15
Thread
creation
Thread Scalability




Producer-consumer microbenchmark
LinuxThreads begin to degrade after
20 threads
NPTL degrades after 100
Capriccio scales to 32K producers and
consumers (64K threads total)
Thread Scalability
I/O Performance

Network performance





Token passing among pipes
Simulates the effect of slow client links
10% overhead compared to epoll
Twice as fast as both LinuxThreads and
NPTL when more than 1000 threads
Disk I/O comparable to kernel threads
Linked Stack Management


LinuxThreads allocates 2MB per stack
1 GB of VM holds only 500 threads
Fixed Stacks
Linked Stack Management


But most threads consumes only a few
KB of stack space at a given time
Dynamic stack allocation can
significantly reduce the size of VM
Linked Stack
Compiler Analysis and Linked
Stacks

Whole-program analysis



Based on the call graph
Problematic for recursions
Static estimation may be too conservative
Compiler Analysis and Linked
Stacks

Grow and shrink the stack size on
demand


Insert checkpoints to determine whether
we need to allocate more before the next
checkpoint
Result in noncontiguous stacks
Placing Checkpoints


One checkpoint in every cycle in the
call graph
Bound the size between checkpoints
with the deepest call path
Dealing with Special Cases

Function pointers


Don’t know what procedure to call at
compile time
Can find a potential set of procedures
Dealing with Special Cases

External functions


Allow programmers to annotate external
library functions with trusted stack
bounds
Allow larger stack chunks to be linked for
external functions
Tuning the Algorithm

Stack space can be wasted


Internal and external fragmentation
Tradeoffs


Number of stack linkings
External fragmentation
Memory Benefits


Tuning can be application-specific
No preallocation of large stacks


Reduced requirement to run a large
numbers of threads
Better paging behavior

Stacks—LIFO
Case Study: Apache 2.0.44



Maximum stack allocation chunk: 2KB
Apache under SPECweb99
Overall slowdown is about 3%



Dynamic allocation 0.1%
Link to large chunks for external functions
0.5%
Stack removal 10%
Resource-Aware Scheduling

Advantages of event-based scheduling



Tailored for applications
With event handlers
Events provide two important pieces of
information for scheduling


Whether a process is close to completion
Whether a system is overloaded
Resource-Aware Scheduling



Thread-based
View applications as sequence of
stages, separated by blocking calls
Analogous to event-based scheduler
Blocking Graph



Node: A location in the program that
blocked
Edge: between two nodes if they were
consecutive blocking points
Generated at runtime
Resource-Aware Scheduling
1. Keep track of resource utilization
2. Annotate each node with resource
used and its outgoing edges
3. Dynamically prioritize nodes

Prefer nodes that release resources
Resources



CPU
Memory (malloc)
File descriptors (open, close)
Pitfalls

Tricky to determine the maximum
capacity of a resource

Thrashing depends on the workload


Resources interact


Disk can handle more requests that are
sequential instead of random
VM vs. disk
Applications may manage memory
themselves
Yield Profiling



User threads are problematic if a
thread fails to yield
They are easy to detect, since their
running times are orders of magnitude
larger
Yield profiling identifies places where
programs fail to yield sufficiently often
Web Server Performance





4x500 MHz Pentium server
2GB memory
Intel e1000 Gigabit Ethernet card
Linux 2.4.20
Workload: requests for 3.2 GB of
static file data
Web Server Performance



Request frequencies match those of
the SPECweb99
A client connects to a server repeated
and issue a series of five requests,
separated by 20ms pauses
Apache’s performance improved by
15% with Capriccio
Resource-Aware Admission
Control




Consumer-producer applications
Producer loops, adding memory, and
randomly touching pages
Consumer loops, removing memory
from the pool and freeing it
Fast producer may run out of virtual
address space
Resource-Aware Admission
Control


Touching pages too quickly will cause
thrashing
Capriccio can quickly detect the
overload conditions and limit the
number of producers
Programming Models for High
Concurrency

Event


Application-specific optimization
Thread

Efficient thread runtimes
User-Level Threads

Capriccio is unique




Blocking graph
Resource-aware scheduling
Target at a large number of blocking
threads
POSIX compliant
Application-Specific
Optimization


Most approaches require programmers
to tailor their application to manage
resources
Nonstandard APIs, less portable
Stack Management

No garbage collection
Future Work


Multi-CPU machines
Profiling tools for system tuning
Download