PPT

advertisement

Seven O’Clock: A New Distributed

GVT Algorithm using Network

Atomic Operations

David Bauer, Garrett Yaun

Christopher Carothers

Computer Science

Murat Yuksel

Shivkumar Kalyanaraman

ECSE

Global Virtual Time

Defines a lower bound on any unprocessed event in the system.

Defines the point beyond which events should not be reclaimed.

!

Imperative that GVT computation operate as efficiently as possible.

Key Problems

Simultaneous Reporting Problem Transient Message Problem arises “because not all processors message is delayed in the network will report their local minimum at precisely the same instant in wallclock time”.

and neither the sender nor the receiver consider that message in their respective GVT calculation.

Asynchronous Solution: create a synchronization, or “cut”, across the distributed simulation that divides events into two categories: past and future.

Consistent Cut: a cut where there is no message scheduled in the future of the sending processor, but received in the past of the destination processor.

Mattern’s GVT Algorithm

Construct cut via message-passing

Cost: O(log n) if tree , O(N) if ring

! If large number of processors, then free pool exhausted waiting for GVT to complete

Fujimoto’s GVT Algorithm

Construct cut using shared memory flag

Cost: O(1)

Sequentially consistent memory model ensures proper causal order

! Limited to shared memory architecture

Memory Model

Sequentially consistent does not mean instantaneous

Memory events are only guaranteed to be causally ordered

Is there a method to achieve sequentially consistent shared memory in a loosely coordinated, distributed environment?

GVT Algorithm Differences

Fujimoto

7 O’Clock

Mattern

Cost of Cut

Calculation

O(1) O(1)

Parallel /

Distributed

Global

Invariant

P

Shared

Memory Flag

P+D

Real Time

Clock

Independent of

Event Memory

N Y

* cost of algorithm much higher

O(N) or

O(log N)

P+D

Message

Passing

N

Samadi

O(N) or

O(log N) *

P+D

Message

Passing

N

Network Atomic Operations

Goal: each processor observes the

“start” of the GVT computation at the same instance of wall clock time

Definition : An NAO is an agreed upon frequency in wall clock time at which some event is logically observed to have happened across a distributed system.

Network Atomic Operations

Goal: each processor observes the “start” of the GVT computation at the same instance of wall clock time

Update

Tables

Update

Tables

Update

Tables

Update

Tables

Definition : An NAO is an agreed upon frequency in wall clock time at which some event is logically observed to have happened across a distributed system.

Update

Tables

Update

Tables

Update

Tables wall-clock time

Compute

GVT

Compute

GVT

Compute

GVT

Compute

GVT

Compute

GVT

Compute

GVT

Compute

GVT wall-clock time possible operations provided by a complete sequentially consistent memory model

Clock Synchronization

• Assumption: all processors share a highly accurate, common view of wall clock time.

• Basic building block: CPU timestamp counter

– Measures time in terms of clock cycles, so a gigahertz CPU clock has granularity of 10 9 secs

– Sending events across network is much larger granularity depending on tech: ~10 6 secs on

1000base/T

Clock Synchronization

• Issues: clock synchronization, drift and jitter

• Ostrovsky and Patt-Shamir:

– provably optimal clock synchronization

– clocks have drift and the message latency may be unbounded

• Well researched problem in distributed computing

– we used simplified approach

– simplified approach helpful in determining if system working properly

Max Send

D

t

• Definition : max_send_delta_t is maximum of

– worst case bound on the time to send an event through the network

– twice synchronization error

– twice max clock drift over simulation time

• add a small amount of time to the NAO expiration

– Similar to sequentially consistent memory

• Overcomes:

– Transient message problem, clock drift/jitter and clock synchronization error

Max Send

D

t: clock drift

• Clock drift causes CPU clocks to become unsynchronized

– Long running simulations may require multiple synchs

– Or, we account for it in the

NAO

• Max Send D t overcomes clock drift by ensuring no event “falls between the cracks”

Max Send

D

t

• What if clocks are not well synched?

– Let D

D max

– Let D

S max be the maximum clock drift.

be the maximum synchronization error.

• Solution: Re-define D t max as

LP

2

D t’ max

= max(

D t max

, 2*

D

D max

, 2*

D

S max

)

• In practice both D

D max and

D

S max small in comparison to

D t max

.

are very

1

D t max

D D max

D D max GVT

D t max

D D max

D D max

GVT wallclock time

Transient Message Problem

• Max Send D t: worst case bound on time to send event in network

– guarantees events are accounted for by either sender of receiver

Simultaneous Reporting Problem

• Problem arises when processors do not start

GVT computation simultaneously

• Seven O’Clock does start simultaneously across all CPUs, therefore, problem cannot occur

A

7

5

10

LVT: 7

9

LVT: min(5,9)

GVT: min(5,7)

LVT: 5

B C D E

7

5

10

LVT: 7

9

LVT: min(5,9)

GVT: min(5,7)

LVT: 5

Simulation:

Seven O’Clock GVT Algorithm

– Assumptions: – Properties:

• Each processor has a highly accurate clock

• A message passing interface w/o ack is available

• The worst case bound on the time to transmit a message through the network

D t max

D t max

GVT #1 is known.

7 12

LP

4

• a clock-based algorithm for distributed processors

• creates a sequentially consistent view of distributed memory

D t max

GVT #2

LVT=min(7,9)

LP

3

9

GVT=min(5,7) cut point

LP

2 10 LVT=min(5,9)

LP

1 5

NAO NAO NAO wallclock time

Limitations

• NAOs cannot be “forced”

– agreed upon intervals cannot change

• Simulation End Time

– worst-case, complete NAO and only one event remaining to process

– amortized over entire run-time, cost is O(1)

• Exhausted Event Pool

– requires tuning to ensure enough optimistic memory available

Uniqueness

• Only real-time based GVT algorithm

• Zero-cost consistent-cut  truly scalable

– O(1) cost  optimal

• Only algorithm which is entirely independent of available event memory

– Event memory loosely tied to GVT algorithm

Performance Analysis: Models

r-PHOLD

• PHOLD with reverse computation

• Modified to control percent remote events

(normally 75%)

• Destinations still decided using a uniform random number generator

 all

LPs possible destination

TCP-Tahoe

• TCP-TAHOE ring of

Campus Networks topology

• Same topology design as used by PDNS in

MASCOTS ’03

• Model limitations required us to increase the number of LAN routers in order to simulate the same network

Performance Analysis: Clusters

Itanium Cluster

Location: RPI

NetSim Cluster

Location: RPI

Sith Cluster

Location: Georgia Tech

Total Nodes: 4

Total CPU: 16

Total RAM: 64GB

Total Nodes: 40

Total CPU: 80

Total RAM: 20GB

CPU: Quad Itanium-2

1.3GHz

CPU: Dual Intel

800MHz

Network: Myrinet

1000base/T

Network: ½

100base/T, ½

1000base/T

Total Nodes: 30

Total CPU: 60

Total RAM: 180GB

CPU: Dual Itanium-2

900MHz

Network: ethernet

1000base/T

Itanium Cluster: r-PHOLD, CPUs allocated round-robin

Maximize distribution (round robin among nodes) VERSUS

Maximize parallelization (use all CPUs before using additional nodes)

NetSim Cluster: Comparing 10- and 25% remote events

(using 1 CPU per node)

NetSim Cluster: Comparing 10- and 25% remote events

(using 1 CPU per node)

TCP Model Topology

Single Campus 10 Campus Networks in a Ring

Our model contained 1,008 campus networks in a ring, simulating > 540,000 nodes.

Itanium Cluster: TCP results using 2- and 4-nodes

Sith Cluster: TCP Model using 1 CPU per node and 2 CPU per node

Future Work & Conclusions

• Investigate “power” of different models by computing spectral analysis

– GVT now in frequency domain

– Determine max length of rollbacks

• Investigate new ways of measuring performance

– Models too large to run sequentially

– Account for hardware affects (even in NOW there are fluctuations in HW performance)

– Account for model LP mapping

– Account for different cases, ie, 4 CPUs distributed across 1, 2, and 4 nodes

Download