Digital Video Cluster Simulation Martin Milkovits CS699 – Professional Seminar

advertisement
Digital Video Cluster
Simulation
Martin Milkovits
CS699 – Professional Seminar
April 26, 2005
Goal of Simulation


Build an accurate performance model of
the interconnecting fabrics in a Digital
Video cluster
Assumptions


RAID Controller would follow a triangular
distribution of I/O interarrival times
Gigabit Ethernet IP edge card would not
impress any backpressure on the I/Os
Fabrics Simulated
Table 1: Connection Technologies
Fabric
Type
Per Link /Bus
Actual
Bandwidth
(Gbps)
PCI 2.2
66MHz/ 64bit
Parallel
–
bridged
3.934
StarFabric (SF)
InfiniBand (IB)
Full
Duplex Serial
Full
Duplex Serial
1.77
2.0
Hardware
Device
Buffers
Ports
n/a
n/a
n/a
2–
StarFabric
1–
64/66 PCI
StarGen 2010
Bridge
Per SF Port
and PCI
StarGen 1010
Switch
Per SF Port
6–
StarFabric
Per IB port
and PCI
8 – 1X
InfiniBand
1–
64/66 PCI
Mellanox
21108
Bridge /
Switch
Digital Video Cluster
Digital Video Node
Modules, Connections and
Messages

Messages represent data packets AND are used to
control the model



Modules handle message processing and routing



For data packets – have a non-zero length parameter
Contain routing and source information
By and large represent hardware in the system
PCI Bus module – not actual hardware, but necessary to
simulate a bus architecture
Connections allow messages to flow between modules



represent links/busses
Independent connections for data vs. control messages
May be configured with a data rate value to simulate
transmission delay
Managing Buffer/Bus access
RWM
Queue empty?
yes
Send rqst
message
no
Enqueue RWM
rqst
rqst
Send rqst to
Buffer module
Pop RWM
From queue
Send RWM
Before
transferring a
data message
(RWM)
Need to gain
access to
transfer link/bus
and destination
buffer
PCI Bus Challenges



Maintain Bus fairness
Allow multiple PCI bus masters to
interleave transactions (account for retry
overhead)
Allow bursting if only one master
PCI Bus Module Components





Queue – pending RWM’s
pciBus[maxDevices] array – utilization key
reqArray[maxDevices] – pending rqst messages
Work area – manages RWM actually being
transferred by the PCI bus
3 Message types to handle



rqst messages from PCI bus masters
RMW messages
qCheck self-messages
Handling rqst and RWM messages
rqst
pciRes[devNum]
utilized
no
Set
pciRes[devNum]
utilized
Return rqst
yes
store rqst in
reqArray[devNum]
RWM
Work area
Busy?
no
Move RW M to
Work area
Schedule
qCheck msg
yes
Enqueue RW M
To Queue

When RWM finally hits the work area


Set RMW.transfer value = length of message (1024)
Schedule qCheck self-message to fire in 240ns (time
to transfer 128bits)
Handling qCheck Messages
qCheck
Message
Decrement work
RWM->transfer
RW M->transfer
== 0?
yes
Send RW M
reqArray[devNum]->
Exist?
yes
return
reqArray[devNum]
no
no
Increment
pciBus[devNum]
yes
queue empty?
no
no
Swap work and
queue RW M
(add PCI overhead)
queue empty?
Move RW M from
queue to work
Restart qCheck
Message + 240ns
yes
Delete
qCheck
Determining Max Bandwidth
RAID Bandwidth Comparison under Target Bandwidth
range of (90-130MBps)
135
TARGET BANDWIDTH
Node0RAID0
Bandwidth (MBps)
130
125
Node0RAID1
Node1RAID0
120
Node1RAID1
Node2RAID0
115
Node2RAID1
Node3RAID0
110
Node3RAID1
Node4RAID0
105
100
Node4RAID1
Node5RAID0
95
90
90
100
110
120
125
Target Bandwidth (MBps)
130
Node5RAID1
Node6RAID0
Node6RAID0
Simulation Ramp-up
Ramp-up Calculation Graph - w = 10
122
RAID0 Node0
RAID1 Node0
RAID0 Node1
RAID1 Node1
RAID0 Node2
RAID1 Node2
RAID0 Node3
RAID1 Node3
RAID0 Node4
RAID1 Node4
RAID0 Node5
RAID1 Node5
RAID0 Node6
RAID1 Node6
121.5
121
120.5
120
119.5
119
118.5
118
1
6
11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106
105Second @ 120MBps Results
Node
RAID
0
1
2
3
4
5
6
Sample Mean
90 % Confidence Interval
0
119.997
0.002
1
119.999
0.002
0
120.001
0.004
1
119.998
0.002
0
120.000
0.005
1
120.002
0.004
0
120.005
0.005
1
120.007
0.009
0
119.999
0.002
1
120.000
0.002
0
119.998
0.004
1
120.002
0.002
0
119.999
0.005
1
120.001
0.004
Contention / Utilization / Capacity
60.00%
600
50.00%
500
40.00%
400
30.00%
300
20.00%
200
10.00%
100
RA
I
D
SG
0
-S
20
G
10
20
0
SG
- S 10 0
10
G
10
10
1
- S 10 1
SG G2
01
20
10 0 7
7
-I
B
1
IB
IB
1
-I
SG
1
B
-S
20
1
G
10
20
7
SG
10
-S
10
7
G
10
10
1
10
SG - S
1
G
20
20
10
10
0
0
-G
ig
E
0
0.00%
0
Bandwidth (MBps)
Percent Utilized/Contended
Utilization/Contention/Capacity on the data path from Node0
RAID0 to Node1 GigE0
Utilization
Contention
Capacity
Learning Experiences

PCI Contention


First as a link like any other maintained by the
StarGen chip
Buffer contention and access


Originally used retry loops – like actual system
- way too much processing time!
Retry messages that are returned are a
natural design given the language of
messages and connections.
Conclusion / Future Work




Simulation performed within 7% of actual
system performance
PCI bus between IB and StarGen potential
hotspot
Complete more iterations with minor system
modifications (dualDMA, scheduling)
Submitted paper to the Winter Simulation
Conference
Download