A unified configuration and conditions database for the

advertisement
Israel ATLAS TIERTIER-2
Status
April 2011
Lorne Levinson
NL Cloud Meeting, 5 April 2011
Israel ATLAS Tier2 Status
1
Israel HEP community
• ATLAS is the only LHC experiment in which we participate
– also Phenix (Heavy Ion @BNL), ILC, ZEUS
– Israel is “1.35% of ATLAS” (MoU pledge, authors, common fund)
– 25-30 people doing physics analysis
• 3 sites:
– Tel Aviv University, Tel Aviv (1956)
• a university
– The Technion Israel Institute of Technology, Haifa (1924)
• a university
– Weizmann Institute of Science, Rehovot (1934)
• a research institute for Biology, Chemistry, Physics, Math & CS)
with graduate school (no undergrads)
• longest travel is Weizmann  Technion 2 hours office-to-office
NL Cloud Meeting, 5 April 2011
Israel ATLAS Tier2 Status
2
Organization
• we are a distributed Tier2/Tier3
• each site combines Tier2 and Tier3 resources in the same cluster
– all resources shared flexibly between T2 and T3 (Lustre/Storm)
• single management and budget, single purchasing
• three sites as identical as possible
• Steering Committee for overall policy
• Management & Operations team for the three sites
• stable funding approved until 2012
NL Cloud Meeting, 5 April 2011
Israel ATLAS Tier2 Status
3
Storage
Continues to be the biggest reliability issue.
• Our hardware is now stable:
– replaced DDN 6620’s with DDN 9900
• Fully redundant, 300 disk slots, 8x8Gb/s FC ports  5GB/s
– two Lustre “OSS” servers
– WI servers with 10Gb/s to cluster,
TAU, Tech will install 10G in April
• Gave up on Thumpers+Lustre and Thumpers+iSCSI+Lustre.
– We NFS mount Thumpers with Solaris+ZFS for extra "archive"
storage, home directories or /opt/exp_soft
• Lustre + Storm
 problem is Storm team does not test new Storm releases on Lustre
– Storm-Lustre community must solve this
NL Cloud Meeting, 5 April 2011
Israel ATLAS Tier2 Status
4
Storm/Lustre
• Storm allows LCG SRM storage and our local global file name
space to share the same physical storage.
– No rigid boundary
– Jobs in cluster can do Linux file io to read SRM files
• Storm can run over Lustre (open source) or GPFS (IBM)
• Lustre:
– Object Storage Targets serve (stripes of) file data
– Meta-Data Server holds directories
• redundant failover of MDS’s will soon be supported
NL Cloud Meeting, 5 April 2011
Israel ATLAS Tier2 Status
5
Storage – installed SRM + local
capacity
Net TB
TAU
2010
2011 purchase
Total 2011
Technion
Total
240
192
288
720
96
144
144
384
336
336
432
1104
48
1152
Heavy Ion 3Q2011
NL Cloud Meeting, 5 April 2011
Weizmann
Israel ATLAS Tier2 Status
6
Group disks
• We are hosting four ATLASGROUPDISK areas
–
–
–
–
Muon performance (Technion)
Top (Weizmann)
Heavy Ion (Weizmann)
Standard Model (TAU) (empty)
NL Cloud Meeting, 5 April 2011
Israel ATLAS Tier2 Status
7
CPU
• Last purchase was dual Intel E5520 quad core
• May delivery purchase is dual Intel X5650 hex-core
– again 4 motherboards per 2U box with redundant power supply
cores
Tel Aviv
Technion
Weizmann
Total
Now
192
272
448
944
May
336
464
640
1440
We benefit a lot that some other groups place some cores in our cluster:
* Weizmann: ATLAS+Phenix/Heavy-Ion, HEP Theory, Condensed matter
* Technion: HEP Theory and Bio-informatics
* TAU includes: HEP Theory
NL Cloud Meeting, 5 April 2011
Israel ATLAS Tier2 Status
8
Services nodes
Virtualize most services
Service
gLite CE
Where
per site
• Two 8-core servers, 48GB
• Failover
• Easier management
gLite site-BDII
per site
gLite MON
per site
glite APEL
per site
ELOG electronic log book
WI
Zenoss fabric monitoring
per site
LDAP, DNS, DHCP, syslog
per site
Frontier DB cache
per site
VOMS (for Israel)
TAU
gLite WMS, LB (for Israel)
WI
gLite myproxy (for Israel)
WI
gLite Top-BDII (for Israel)
gLite NAGIOS for Israel grid
service monitoring
Mantis issue tracker
WI
Managers’ Wiki pages
Tech
–
–
–
–
VM images
Roll-back
Image sharing
Easier testing: temp machines
• May delivery of HW
• Deciding among: VMware,
Xen, Citrix, KVM
• SE not included
NL Cloud Meeting, 5 April 2011
Israel ATLAS Tier2 Status
WI
Tech
9
Networking
Our networking is not good
• Geant connection is 2 x 1.5G (subscribed on 2 x 2.5G infrastructure)
• “Political” limits: TAU 500M, Technion 350M, WI 400M
– Because a 1G line is shared with institute traffic and the shared
router is not really able to do 1G duplex
• We suspect that the gross mismatch with SARA/NIKHEF’s 10G
causes failed connections due to dropped packets.
– Lowering the # of files & streams to avoid dropped packets leaves
us with even worse net BW
• Expensive because it is an undersea fiber and one (Italian) company
owns the fibers.
– An Israeli competitor is installing another fiber now
NL Cloud Meeting, 5 April 2011
Israel ATLAS Tier2 Status
10
Networking
NL Cloud Meeting, 5 April 2011
Israel ATLAS Tier2 Status
11
GEANT
NL Cloud Meeting, 5 April 2011
Israel ATLAS Tier2 Status
12
Networking plans
May 2011(?):
• Increase international connection: from 3Gb/s to 4Gb/s.
– 5G might be possible later this year, but not budgeted.
• Replace old routers at entrances to institutes with 10G capable
equipment.
– This should increase our thru’put and reliability and allow us to
actually use a major share of the 1G BW to the sites
• Negotiating 10G academic backbone
• Could have 10G to Geant in spring 2012
NL Cloud Meeting, 5 April 2011
Israel ATLAS Tier2 Status
13
SAM/NAGIOS
SAM/
NAGIOS
• Our NGI did not take on the SAM/NAGIOS monitoring responsibility
• After the new NAGIOS tests replaced SAM tests, we received no
alerts on failed tests.
• This was a severe problem
• Finally in December it was agreed with EGI, our NGI and us that we
would deploy a NAGIOS test service for Israel, until our NGI
succeeded to do it.
– The only functioning grid sites in Israel are our 3 ATLAS sites
• Our NAGIOS service was up and running in January.
NL Cloud Meeting, 5 April 2011
Israel ATLAS Tier2 Status
14
Upcoming work
• Deploy Zenoss fabric and service monitor on all three clusters
– currently in-test at Weizmann
• Deploy Puppet configuration system on all three clusters
– We gave up on Quattor after having finally succeeded in getting
it to run,
• Clear that it was unsustainable
– Currently for work nodes at Weizmann
– Needs to include gLite nodes
• Virtualization of services (excl SE)
• Address Storm “untested new version” problem
NL Cloud Meeting, 5 April 2011
Israel ATLAS Tier2 Status
15
End
NL Cloud Meeting, 5 April 2011
Israel ATLAS Tier2 Status
16
Download