LHC-OpticalPrivateNetwork NETWORK AT GRIDKA 10Gbit LAN/WAN evaluations Bruno Hoeft, Marc Garcia Marti, Forschungszentrum Karlsruhe, Germany The introduction will start from the core to the edge of the GridKa local area network installation, The address space of the inner core network is taken out of the private ip-address range. There are two reasons why we chose this approach. The first reason has to do with security. Communication initiated by the workernodes inside the private network towards the public address space is realised via NAT (network address translation). Communication to the other direction, from the outside towards the workernodes, is not possible. This address arrangement is providing a controlled access to the workernodes. The second reason are the costs since private addresses are free of charge. The GridKa core network is organised with one backbone to which the workernodes are aggregated through switches. 36 dual processor workernodes are aggregated per rack with a switch via trunked/ethernet-channels 2*1GE uplinks to the backbone. Each rack is separated in a vlan, keeping at least the broadcast/multicast messages from the backbone. Other than the workernodes each internal fileserver is directly connected to the backbone, in order to allow full utilization of the 1 GE interface through the backbone. GridKa-Internet Router 10GE Firewall … dcache2) pools Cache Headnode LoginServer ALICE/Atlas/CMS/LHCb LoginServer Babar/CDF/DZero/Compass dcache2) pools d LCG(se/ce/ui) LCG(RGMA/MON/BDII) LCG(RB/LFC/DC-SRM) intern A brief introduction to the network installation of GridKa FZK-Internet Router extern … … router private network FS FS FS FS NAS NAS … … ComputeNode ComputeNode ComputeNode SAN … … switch … FS FS NAS … Besides a brief overview of the GridKa private and public LAN network, the integration into the LHCOPN network as well as the links to the T2 sites will be presented in the view of the physical network layout as well as their higher protocol layer implementations. Results about the feasibility discussion of dynamical routes for all connections of FZK including all different types the LHC Network concerning links (LightPath, MPLS tunnels, routed IP) will be part of the presentation. An evaluation will show the quality and quantity of the current 10GE link from GridKa to CERN traversing a multi NREN backbone structure via a MPLS tunnel. The equipment of the first 10GE test setup is based on IBM/Intel HW at GridKa and HP/Chelsio at CERN. A study of the ability of the nodes at Gridka will be demonstrated, revealing their limitations and the benefit of TOE will be discussed. … Abstract ComputeNode Figure 1 : GridKa-LAN Some central service nodes are connected to the core network only, like “dCache head node”, but the majority are connected with one interface to the core and one interface to the edge network (LCG [cd, se, ui, bdii, mon, rb, ….], vo-box,….). However, the dCache fileserver can be grouped into two classes: - core dCache and - edge dCache. The core dCache is connected to the internal network and through the FC (fibre channel) network to the storage (SAN, gpfs filesystem). The edge dCache has an additional link to the external network. A pbr (policy based routing) was set up in order to get a firewall bypass, which offers a fast external data access to the WLCG sites within the LHC-OPN as well as connected Tier-2 centres. This bypass overcomes the firewall limitation of 320 Mbit/s (or 2.5Gbit/s after the planed upgrade). Besides this production network a completely independent network is built for administration purposes. Through this network the status of each server is collected. In case of an incident the administrators are informed (mail/sms) and at the server site automatic actions are taken to resolve the problem. RWTH-Aachen Uni-Dresden Fnal Uni-Freiburg MPI u. Uni-München SLAC Uni-Wuppertal CNAF SARA GSI / FZU Prague / Desy PSNC Poland (2007 10Gbit) (2007 10Gbit) CERN (2006/7/8 (2007 1/10Gbit) 1Gbit) } 2007 10Gbit 240Mbyte netto HEP T2 T2 T2 T1 T2 T0 T0‘ DFN X-WiN AR DWDM 10GE 10GE FZK-Internet Router GridKa-Internet Router 10GE Figure 1 : GridKa-WAN (lightpath) LHC-OPN network part at GridKa On one side the edge router is connected to the internet backbone of DFN and on the other side to the LHC-OPN. The LHC-OPN is a spanned layer-2 tunnel network between all Tier-1 centres within the LHC project and the Tier-0 centre at CERN. This LHC-OPN will be equipped at GridKa, besides the T0 lightpath between GridKa and CERN, with various additional lightpath “point to point” layer-2 links directly connecting GridKa to two Tier-1 sites (SARA and CNAF). This lightpath will be established during 2006. The T0’, the backup link between GridKa and CERN, will be added in 2007. For each link to one of these sites, a 10GE LAN-PHY port of the edge router at GridKa is connected to one port of the DWDM equipment of DFN with a matching capacity. 1/10GE lightpath “point to point” links to Tier-2 (Desy, FZU (Prague), Poznan (Poland)) centres are planed. WAN Routing The routing of the two edge router of Forschungszentrum Karlsruhe is already BGP based. The edge router of GridKa will join the autonomous region of FZK. The communication between the edge router of the Forschungszentrum Karlsruhe will be iBGP based and the communication with the edge router of other autonomous regions will be eBGP based. This will allow dynamic altering of the routing in case one link is not available. The evaluation of the 10GE link between GridKa and CERN showed a usable bandwidth of 7Gbit [3]. dfn.de1.de.geant.net 62.40.105.1 Geant Frankfurt de.fr1.fr.geant.net 62.40.96.50 Geant Paris cr-frankfurt1-po11-0.g-win.dfn.de 62.40.105.2 DFN Frankfurt ar-karlsruhe1-ge0-0.g-win.dfn.de 188.1.38.209 DFN Karlsruhe fr.ch1.ch.geant.net 62.40.96.29 Geant Geneva R-Grid2 192.108.46.1 cernh7-openlab 192.16.160.1 Gridka r-grid2 Cisco Catalyst 6509 Geant CERN Geneva oplapro71 oplapro72 … A higher throughput was prevented through packet losses. This evaluation took place on a 10GE shared multi NREN internet link (DFN/GEANT/Switch) ([2] GEANT Topology) shown in Fig.1. oplapro79 oplapro80 WN09 WN01 Fig. 1: 10 GE WAN network layout FS5-101 … FS5-107 FS10-107 … FS10-108 FS10-135 … FS10-136 All packets injected had to be marked with LBE (Less than Best Effort). This meant that our packets were the first to be dropped in case of congestion or if one interface queue at a router ran out of space. With a different type of service (BE – best effort), we could have reached a higher throughput, but we could not convince the NRENs to test at that time. The evaluation over the lightpath connecting GridKa with CERN is still outstanding and will be achieved as soon as we have first light. At this time the eBGP for the LHC-OPN will be implemented as well. 1GE versus 10GE The present study 1GE versus 10GE aims at picturing the state-of-art of the 10Gbps Ethernet technology and it is meant to, together with a financial analysis, help in deciding whether upgrading to 10Gbps in order to fulfil the extensive requirements for the Forschungszentrum Karlsruhe, the German Tier-1 centre for the LHC experiment at CERN. This evaluation aimed at coming up with the pros and cons of a network migration to the Ethernet 10Gbps technology of the already existing 1Gbbps infrastructure as an intermediate step to a fully supported 10Gbps environment. A 10Gbps environment would keep up with newer LHC requirements as well as with other experiments in the future. This document comes up as a complement to the “EXPERIMENTAL 10Gbps TECHNOLOGY RESULTS AT FORSCHUNGSZENTRUM KARLSRUHE” [1] for picturing the current stateof-art by bringing some results gathered with a different equipment. If the latter was developed with a IBM Xeon and Intel PRO/10GbE LR combination, here we report on a HP Itanium – Chelsio T210. Despite the intensive requirements inherent to the TCP protocol [1], the first toddlers on the 10Gbps technology were able to reach “high throughput” playing around with different parameters that although permissible are actually quite impracticable. Parameters like interrupt coalescence or maximum transfer unit [1] had to be tuned for sending beyond a few Gigabits per second with both end systems completely overwhelmed by the huge computing demands on the network subsystem[1]. In our first 10Gbps testbed, this “high throughput” was actually around one forth of a 10Gbps full duplex communication with both systems working at 99% of their capabilities; a intern Xeon Intel 10Gbps system, therefore, didn’t seem to be a suitable approach for a migration. On one hand, jumbo frames usage was not desired because of the resulting heterogeneous/uncompatible environment, besides of the impact on the WAN performance in lossy networks that still is to be studied. Same loss probability would lead to a worser performance. On the other hand, the extreme cpu usage described in [1], demonstrated on this poster also, was not desired either. A system totally overwhelmed dealing with IO or context switches due to networking tasks would not be able to cope with high-demanding cpu applications, nor will it keep up with future requirements. Standard mtu 8180 Bytes mtu 9216 Bytes mtu - PCI-X Bus: theoretical limit 8.5 Gbps approx. - tested max. rate 7.5 Gbps (linux kernel UDP packet generator) The HP Itanium – Chelsio T210 approach, as this poster demonstrates, seems much more attractive and effective for the Grid world. The benefits of this hardware combination could be summarized by: “Throughput close to the hardware limit with surprisingly low cpu usage”. Indeed, the iperf TCP goodput got slightly over 7Gbps with a total independence of the MTU setting and with a cpu usage around 20%. The real hardware capabilities were tested with linux-kernel packet generator showing a throughput slightly lower than 7.5Gbps, whereas the theoretical hardware limit is around 8.5Gbps (64bit times 133MHz). It is quite unfair to compare the results already presented in [1] and the ones reported on this document. It would not make much sense to say that Itanium is faster than Xeon, because that is public knowledge. Some tests have been carried out with the same Xeon systems presented in [1] with these Chelsio cards in them and the throughput increased up to around 6.5Gbps but the cpu usage raised up to 99% as well. This combination therefore, although increasing quite a lot the overall performance, still would do good enough to be worth the investment. One other aspect interesting to Grid computing is the latency. When actually a more in-depth study is required the temporally results gathered up to now suggest that this new Chelsio hardware presents a slightly higher latency than the Intel PRO/10GbE LR. Since this was not the aim of this study, a latency comparison gets proposed as a future work in order get a complete conclusion. Our environment sadly does not count with Opteron equipment but it could be really nice to bring face to face all three technologies in an even wider test and make then a decision where the performance and the money could be reflected on it. Maybe it is worth finding a trade-off between 32 and 64bits architecture. A comparison of the bus architecture PCI-X (2.0) and PCI-express should be included as well. References: [1]. GEANT2 PERTKB http://kb.pert.switch.ch/cgibin/twiki/view/PERTKB/TestsTenGbps [2]. GEANT Topology http://www.geant.net/upload/pdf/Topology_Oct_20 04.pdf [3]. The Obstacles of a 10Gbit Connection between GridKa and Openlab Slide 21 -- 7 Gbit/s via Multi NREN http://www2.twgrid.org/event/isgc2005/PPT/04272 005/0427%20Panel%20Discussion%20Infrastructur e%20Interoperation/04_GridKa-isgc2005.ppt