Digital Video Cluster Simulation Martin Milkovits CS699 – Professional Seminar April 26, 2005 Goal of Simulation Build an accurate performance model of the interconnecting fabrics in a Digital Video cluster Assumptions RAID Controller would follow a triangular distribution of I/O interarrival times Gigabit Ethernet IP edge card would not impress any backpressure on the I/Os Fabrics Simulated Table 1: Connection Technologies Fabric Type Per Link /Bus Actual Bandwidth (Gbps) PCI 2.2 66MHz/ 64bit Parallel – bridged 3.934 StarFabric (SF) InfiniBand (IB) Full Duplex Serial Full Duplex Serial 1.77 2.0 Hardware Device Buffers Ports n/a n/a n/a 2– StarFabric 1– 64/66 PCI StarGen 2010 Bridge Per SF Port and PCI StarGen 1010 Switch Per SF Port 6– StarFabric Per IB port and PCI 8 – 1X InfiniBand 1– 64/66 PCI Mellanox 21108 Bridge / Switch Digital Video Cluster Digital Video Node Modules, Connections and Messages Messages represent data packets AND are used to control the model Modules handle message processing and routing For data packets – have a non-zero length parameter Contain routing and source information By and large represent hardware in the system PCI Bus module – not actual hardware, but necessary to simulate a bus architecture Connections allow messages to flow between modules represent links/busses Independent connections for data vs. control messages May be configured with a data rate value to simulate transmission delay Managing Buffer/Bus access RWM Queue empty? yes Send rqst message no Enqueue RWM rqst rqst Send rqst to Buffer module Pop RWM From queue Send RWM Before transferring a data message (RWM) Need to gain access to transfer link/bus and destination buffer PCI Bus Challenges Maintain Bus fairness Allow multiple PCI bus masters to interleave transactions (account for retry overhead) Allow bursting if only one master PCI Bus Module Components Queue – pending RWM’s pciBus[maxDevices] array – utilization key reqArray[maxDevices] – pending rqst messages Work area – manages RWM actually being transferred by the PCI bus 3 Message types to handle rqst messages from PCI bus masters RMW messages qCheck self-messages Handling rqst and RWM messages rqst pciRes[devNum] utilized no Set pciRes[devNum] utilized Return rqst yes store rqst in reqArray[devNum] RWM Work area Busy? no Move RW M to Work area Schedule qCheck msg yes Enqueue RW M To Queue When RWM finally hits the work area Set RMW.transfer value = length of message (1024) Schedule qCheck self-message to fire in 240ns (time to transfer 128bits) Handling qCheck Messages qCheck Message Decrement work RWM->transfer RW M->transfer == 0? yes Send RW M reqArray[devNum]-> Exist? yes return reqArray[devNum] no no Increment pciBus[devNum] yes queue empty? no no Swap work and queue RW M (add PCI overhead) queue empty? Move RW M from queue to work Restart qCheck Message + 240ns yes Delete qCheck Determining Max Bandwidth RAID Bandwidth Comparison under Target Bandwidth range of (90-130MBps) 135 TARGET BANDWIDTH Node0RAID0 Bandwidth (MBps) 130 125 Node0RAID1 Node1RAID0 120 Node1RAID1 Node2RAID0 115 Node2RAID1 Node3RAID0 110 Node3RAID1 Node4RAID0 105 100 Node4RAID1 Node5RAID0 95 90 90 100 110 120 125 Target Bandwidth (MBps) 130 Node5RAID1 Node6RAID0 Node6RAID0 Simulation Ramp-up Ramp-up Calculation Graph - w = 10 122 RAID0 Node0 RAID1 Node0 RAID0 Node1 RAID1 Node1 RAID0 Node2 RAID1 Node2 RAID0 Node3 RAID1 Node3 RAID0 Node4 RAID1 Node4 RAID0 Node5 RAID1 Node5 RAID0 Node6 RAID1 Node6 121.5 121 120.5 120 119.5 119 118.5 118 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106 105Second @ 120MBps Results Node RAID 0 1 2 3 4 5 6 Sample Mean 90 % Confidence Interval 0 119.997 0.002 1 119.999 0.002 0 120.001 0.004 1 119.998 0.002 0 120.000 0.005 1 120.002 0.004 0 120.005 0.005 1 120.007 0.009 0 119.999 0.002 1 120.000 0.002 0 119.998 0.004 1 120.002 0.002 0 119.999 0.005 1 120.001 0.004 Contention / Utilization / Capacity 60.00% 600 50.00% 500 40.00% 400 30.00% 300 20.00% 200 10.00% 100 RA I D SG 0 -S 20 G 10 20 0 SG - S 10 0 10 G 10 10 1 - S 10 1 SG G2 01 20 10 0 7 7 -I B 1 IB IB 1 -I SG 1 B -S 20 1 G 10 20 7 SG 10 -S 10 7 G 10 10 1 10 SG - S 1 G 20 20 10 10 0 0 -G ig E 0 0.00% 0 Bandwidth (MBps) Percent Utilized/Contended Utilization/Contention/Capacity on the data path from Node0 RAID0 to Node1 GigE0 Utilization Contention Capacity Learning Experiences PCI Contention First as a link like any other maintained by the StarGen chip Buffer contention and access Originally used retry loops – like actual system - way too much processing time! Retry messages that are returned are a natural design given the language of messages and connections. Conclusion / Future Work Simulation performed within 7% of actual system performance PCI bus between IB and StarGen potential hotspot Complete more iterations with minor system modifications (dualDMA, scheduling) Submitted paper to the Winter Simulation Conference