Paper - Indico

advertisement
LHC-OpticalPrivateNetwork NETWORK AT GRIDKA
10Gbit LAN/WAN evaluations
Bruno Hoeft, Marc Garcia Marti, Forschungszentrum Karlsruhe, Germany
The introduction will start from the core to the edge
of the GridKa local area network installation, The
address space of the inner core network is taken out
of the private ip-address range. There are two
reasons why we chose this approach. The first
reason has to do with security. Communication
initiated by the workernodes inside the private
network towards the public address space is
realised via NAT (network address translation).
Communication to the other direction, from the
outside towards the workernodes, is not possible.
This address arrangement is providing a controlled
access to the workernodes. The second reason are
the costs since private addresses are free of charge.
The GridKa core network is organised with one
backbone to which the workernodes are aggregated
through switches. 36 dual processor workernodes
are aggregated per rack with a switch via
trunked/ethernet-channels 2*1GE uplinks to the
backbone. Each rack is separated in a vlan, keeping
at least the broadcast/multicast messages from the
backbone. Other than the workernodes each internal
fileserver is directly connected to the backbone, in
order to allow full utilization of the
1 GE interface through the backbone.
GridKa-Internet
Router
10GE
Firewall
…
dcache2)
pools
Cache Headnode
LoginServer
ALICE/Atlas/CMS/LHCb
LoginServer
Babar/CDF/DZero/Compass
dcache2)
pools
d
LCG(se/ce/ui)
LCG(RGMA/MON/BDII)
LCG(RB/LFC/DC-SRM)
intern
A brief introduction to the network
installation of GridKa
FZK-Internet
Router
extern
…
…
router
private network
FS
FS
FS
FS
NAS
NAS
…
…
ComputeNode
ComputeNode
ComputeNode
SAN
…
…
switch
…
FS
FS
NAS
…
Besides a brief overview of the GridKa private and
public LAN network, the integration into the LHCOPN network as well as the links to the T2 sites
will be presented in the view of the physical
network layout as well as their higher protocol layer
implementations. Results about the feasibility
discussion of dynamical routes for all connections
of FZK including all different types the LHC
Network concerning links (LightPath, MPLS
tunnels, routed IP) will be part of the presentation.
An evaluation will show the quality and quantity of
the current 10GE link from GridKa to CERN
traversing a multi NREN backbone structure via a
MPLS tunnel. The equipment of the first 10GE test
setup is based on IBM/Intel HW at GridKa and
HP/Chelsio at CERN. A study of the ability of the
nodes at Gridka will be demonstrated, revealing
their limitations and the benefit of TOE will be
discussed.
…
Abstract
ComputeNode
Figure 1 : GridKa-LAN
Some central service nodes are connected to the
core network only, like “dCache head node”, but
the majority are connected with one interface to the
core and one interface to the edge network (LCG
[cd, se, ui, bdii, mon, rb, ….], vo-box,….).
However, the dCache fileserver can be grouped into
two classes:
- core dCache and
- edge dCache.
The core dCache is connected to the internal
network and through the FC (fibre channel)
network to the storage (SAN, gpfs filesystem). The
edge dCache has an additional link to the external
network. A pbr (policy based routing) was set up in
order to get a firewall bypass, which offers a fast
external data access to the WLCG sites within the
LHC-OPN as well as connected Tier-2 centres. This
bypass overcomes the firewall limitation of
320 Mbit/s (or 2.5Gbit/s after the planed upgrade).
Besides this production network a completely
independent network is built for administration
purposes. Through this network the status of each
server is collected. In case of an incident the
administrators are informed (mail/sms) and at the
server site automatic actions are taken to resolve the
problem.
RWTH-Aachen
Uni-Dresden
Fnal
Uni-Freiburg
MPI u. Uni-München
SLAC
Uni-Wuppertal
CNAF
SARA
GSI /
FZU Prague /
Desy PSNC Poland
(2007 10Gbit)
(2007 10Gbit)
CERN
(2006/7/8
(2007
1/10Gbit) 1Gbit)
}
2007 10Gbit
240Mbyte netto
HEP
T2
T2
T2
T1
T2
T0
T0‘
DFN
X-WiN
AR
DWDM
10GE
10GE
FZK-Internet
Router
GridKa-Internet
Router
10GE
Figure 1 : GridKa-WAN (lightpath)
LHC-OPN network part at GridKa
On one side the edge router is connected to the
internet backbone of DFN and on the other side to
the LHC-OPN. The LHC-OPN is a spanned layer-2
tunnel network between all Tier-1 centres within
the LHC project and the Tier-0 centre at CERN.
This LHC-OPN will be equipped at GridKa,
besides the T0 lightpath between GridKa and
CERN, with various additional lightpath “point to
point” layer-2 links directly connecting GridKa to
two Tier-1 sites (SARA and CNAF). This lightpath
will be established during 2006. The T0’, the
backup link between GridKa and CERN, will be
added in 2007. For each link to one of these sites, a
10GE LAN-PHY port of the edge router at GridKa
is connected to one port of the DWDM equipment
of DFN with a matching capacity. 1/10GE lightpath
“point to point” links to Tier-2 (Desy, FZU
(Prague), Poznan (Poland)) centres are planed.
WAN Routing
The routing of the two edge router of
Forschungszentrum Karlsruhe is already BGP
based. The edge router of GridKa will join the
autonomous region of FZK. The communication
between the edge router of the Forschungszentrum
Karlsruhe will be iBGP based and the
communication with the edge router of other
autonomous regions will be eBGP based. This will
allow dynamic altering of the routing in case one
link is not available.
The evaluation of the 10GE link between GridKa
and CERN showed a usable bandwidth of 7Gbit
[3].
dfn.de1.de.geant.net
62.40.105.1
Geant
Frankfurt
de.fr1.fr.geant.net
62.40.96.50
Geant
Paris
cr-frankfurt1-po11-0.g-win.dfn.de
62.40.105.2
DFN
Frankfurt
ar-karlsruhe1-ge0-0.g-win.dfn.de
188.1.38.209
DFN
Karlsruhe
fr.ch1.ch.geant.net
62.40.96.29
Geant
Geneva
R-Grid2
192.108.46.1
cernh7-openlab
192.16.160.1
Gridka r-grid2
Cisco Catalyst 6509
Geant
CERN
Geneva
oplapro71 oplapro72
…
A higher throughput was prevented through packet
losses. This evaluation took place on a 10GE shared
multi NREN internet link (DFN/GEANT/Switch)
([2] GEANT Topology) shown in Fig.1.
oplapro79 oplapro80
WN09
WN01
Fig. 1: 10 GE WAN network layout
FS5-101 … FS5-107 FS10-107 … FS10-108
FS10-135 … FS10-136
All packets injected had to be marked with LBE
(Less than Best Effort). This meant that our packets
were the first to be dropped in case of congestion or
if one interface queue at a router ran out of space.
With a different type of service (BE – best effort),
we could have reached a higher throughput, but we
could not convince the NRENs to test at that time.
The evaluation over the lightpath connecting
GridKa with CERN is still outstanding and will be
achieved as soon as we have first light. At this time
the eBGP for the LHC-OPN will be implemented as
well.
1GE versus 10GE
The present study 1GE versus 10GE aims at
picturing the state-of-art of the 10Gbps Ethernet
technology and it is meant to, together with a
financial analysis, help in deciding whether
upgrading to 10Gbps in order to fulfil the extensive
requirements for the Forschungszentrum Karlsruhe,
the German Tier-1 centre for the LHC experiment
at CERN.
This evaluation aimed at coming up with the pros
and cons of a network migration to the Ethernet
10Gbps technology of the already existing 1Gbbps
infrastructure as an intermediate step to a fully
supported 10Gbps environment. A 10Gbps
environment would keep up with newer LHC
requirements as well as with other experiments in
the future.
This document comes up as a complement to the
“EXPERIMENTAL 10Gbps TECHNOLOGY
RESULTS
AT
FORSCHUNGSZENTRUM
KARLSRUHE” [1] for picturing the current stateof-art by bringing some results gathered with a
different equipment. If the latter was developed
with a IBM Xeon and Intel PRO/10GbE LR
combination, here we report on a HP Itanium –
Chelsio T210.
Despite the intensive requirements inherent to the
TCP protocol [1], the first toddlers on the 10Gbps
technology were able to reach “high throughput”
playing around with different parameters that
although
permissible
are
actually
quite
impracticable.
Parameters
like
interrupt
coalescence or maximum transfer unit [1] had to be
tuned for sending beyond a few Gigabits per second
with both end systems completely overwhelmed by
the huge computing demands on the network
subsystem[1]. In our first 10Gbps testbed, this
“high throughput” was actually around one forth of
a 10Gbps full duplex communication with both
systems working at 99% of their capabilities; a
intern
Xeon Intel 10Gbps system, therefore, didn’t seem
to be a suitable approach for a migration.
On one hand, jumbo frames usage was not desired
because
of
the
resulting
heterogeneous/uncompatible environment, besides
of the impact on the WAN performance in lossy
networks that still is to be studied. Same loss
probability would lead to a worser performance.
On the other hand, the extreme cpu usage described
in [1], demonstrated on this poster also, was not
desired either. A system totally overwhelmed
dealing with IO or context switches due to
networking tasks would not be able to cope with
high-demanding cpu applications, nor will it keep
up with future requirements.
Standard mtu
8180 Bytes mtu
9216 Bytes mtu
- PCI-X Bus: theoretical limit 8.5 Gbps approx.
- tested max. rate 7.5 Gbps (linux kernel UDP packet generator)
The HP Itanium – Chelsio T210 approach, as this
poster demonstrates, seems much more attractive
and effective for the Grid world. The benefits of
this hardware combination could be summarized
by: “Throughput close to the hardware limit with
surprisingly low cpu usage”. Indeed, the iperf TCP
goodput got slightly over 7Gbps with a total
independence of the MTU setting and with a cpu
usage around 20%. The real hardware capabilities
were tested with linux-kernel packet generator
showing a throughput slightly lower than 7.5Gbps,
whereas the theoretical hardware limit is around
8.5Gbps (64bit times 133MHz).
It is quite unfair to compare the results already
presented in [1] and the ones reported on this
document. It would not make much sense to say
that Itanium is faster than Xeon, because that is
public knowledge. Some tests have been carried out
with the same Xeon systems presented in [1] with
these Chelsio cards in them and the throughput
increased up to around 6.5Gbps but the cpu usage
raised up to 99% as well. This combination
therefore, although increasing quite a lot the overall
performance, still would do good enough to be
worth the investment.
One other aspect interesting to Grid computing is
the latency. When actually a more in-depth study is
required the temporally results gathered up to now
suggest that this new Chelsio hardware presents a
slightly higher latency than the Intel PRO/10GbE
LR. Since this was not the aim of this study, a
latency comparison gets proposed as a future work
in order get a complete conclusion.
Our environment sadly does not count with Opteron
equipment but it could be really nice to bring face
to face all three technologies in an even wider test
and make then a decision where the performance
and the money could be reflected on it. Maybe it is
worth finding a trade-off between 32 and 64bits
architecture. A comparison of the bus architecture
PCI-X (2.0) and PCI-express should be included as
well.
References:
[1]. GEANT2 PERTKB
http://kb.pert.switch.ch/cgibin/twiki/view/PERTKB/TestsTenGbps
[2]. GEANT Topology
http://www.geant.net/upload/pdf/Topology_Oct_20
04.pdf
[3]. The Obstacles of a 10Gbit Connection between
GridKa and Openlab
Slide 21 -- 7 Gbit/s via Multi NREN
http://www2.twgrid.org/event/isgc2005/PPT/04272
005/0427%20Panel%20Discussion%20Infrastructur
e%20Interoperation/04_GridKa-isgc2005.ppt
Download