National Research Grid Initiative (NAREGI) Sub-Project Leader, NAREGI Project Visiting Professor, National Institute of Informatics Professor, GSIC, Tokyo Inst. Technology Satoshi Matsuoka Inter-university Computer Centers (excl. National Labs) circa 2002 Hokkaido University HITACHI SR8000 HP Exemplar V2500 HITACHI MP5800/160 Sun Ultra Enterprise 4000 University of Tsukuba FUJITSU VPP5000 CP-PACS 2048 (SR8000 proto) Kyoto University FUJITSU VPP800 FUJITSU GP7000F model 900 /32 FUJITSU GS8000 Tohoku University NEC SX-4/128H4(Soon SX-7) NEC TX7/AzusA Kyushu University FUJITSU VPP5000/64 HP GS320/32 FUJITSU GP7000F 900/64 University of Tokyo HITACHI SR8000 HITACHI SR8000/MPP Others (in institutes) Tokyo Inst. Technology (Titech) NEC SX-5/16, Origin2K/256 HP GS320/64 Osaka University NEC SX-5/128M8 HP Exemplar V2500/N Nagoya University FUJITSU VPP5000/64 FUJITSU GP7000F model 900/64 FUJITSU GP7000F model 600/12 Q: Grid to be a Ubiquitous National Research Computing Infrastructure---How? • Simply Extend the Campus Grid? – 100,000 users/machines, 1000kms Networking PetaFlops/Petabytes…Problems! • Grid Software Stack Deficiency – – – – – – – – Large scale resource management Large scale Grid programming User support tools – PSE, visualization, portals Packaging, distribution, troubleshooting High-Performance networking vs. firewalls Large scale security management “Grid-Enabling” applications Manufacturer experience and support National Research Grid Initiative (NAREGI) Project:Overview - A new Japanese MEXT National Grid R&D project ~$(US)17M FY’03 (similar until FY’07) + $45mil - One of two major Japanese Govt. Grid Projects -c.f. “BusinessGrid” - Collaboration of National Labs. Universities and Major Computing and Nanotechnology Industries - Acquisition of Computer Resources underway (FY2003) MEXT:Ministry of Education, Culture, Sports,Science and Technology National Research Grid Infrastructure (NAREGI) 2003-2007 • Petascale Grid Infrastructure R&D for Future Deployment – $45 mil (US) + $16 mil x 5 (2003-2007) = $125 mil total – Hosted by National Institute of Informatics (NII) and Institute of Molecular Science (IMS) – PL: Ken Miura (FujitsuÎNII) • SLs Sekiguchi(AIST), Matsuoka(Titech), Shimojo(Osaka-U), Hirata(IMS)… – Participation by multiple (>= 3) vendors – Resource Contributions by University Centers as well Various Partners Focused “Grand Challenge” Grid Apps Areas Nanotech Grid Apps “NanoGrid” IMS ~10TF (Biotech Grid Apps) (BioGrid RIKEN) (Other Apps) Other Inst. National Research Grid and Network Grid Middleware R&D Management Grid R&D Infrastr. Grid Middleware 15 TF-100TF SuperSINET NEC Osaka-U Titech AIST Fujitsu U-Tokyo U-Kyushu Hitachi National Research Grid Initiative (NAREGI) Project:Goals (1) R&D in Grid Middleware Î Grid Software Stack for “Petascale” Nation-wide “Research Grid” Deployment (2) Testbed validating 100+TFlop (2007) Grid Computing Environment for Nanoscience apps on Grid - Initially ~17 Teraflop, ~3000 CPU dedicated testbed - Super SINET (> 10Gbps Research AON backbone) (3) International Collaboration with similar projects (U.S., Europe, Asia-Pacific incl. Australia) (4) Standardization Activities, esp. within GGF NAREGI Research Organization and Collaboration MEXT Center for Grid Research & Development (National Institute of Informatics) Grid R&D Advisory Board AIST (GTRC) National Supercomputing Centers Grid R&D Progam Management Committee J Reseoint arch Coordination/ Deployment Universities Research Labs. Project Leader (K.Miura, NII) Grid Middleware and Upper Layer R&D Coordination in Network Research National Supercomputeing Centers Grid Networking R&D Group Leaders Group Leader R&D R&D R&D SuperSINET Technical Requirements Operations Technology Dev. Utilization of Network Network Technology Refinement Joint Research (Titech,Osaka-U, Kyushu-U. etc)) ITBLProject (JAIRI) n t io a iliz f ing Ut o putrces m u Co eso R ITBLProject Dir. Computational Nano-science Center (Institute for Molecular Science) Nano-science Applicatons Director(Dr. Hirata, IMS) Operations Operations R&D R&D Joint Research Consortium for Promotion of Grid Applications in Industry R&D of Grand-challenge Grid Applocations (ISSP,Tohoku-u,,AIST etc., Industrial Partners) Testbed Resources (Acquisition in FY2003) NII: ~5Tflop/s IMS: ~11Tflop/s Participating Organizations • National Institute of Informatics (NII) (Center for Grid Research & Development) • Institute for Molecular Science (IMS) (Computational Nano‐science Center) • Universities and National Labs (Joint R&D) (AIST Grid Tech. Center, Titech GSIC, Osaka-U Cybermedia, Kyushu-U, Kyushu Inst. Tech., etc.) • Project Collaborations (ITBL Project, SC Center Grid Deployment Projects etc.) • Participating Vendors (IT and NanoTech) • Consortium for Promotion of Grid Applications in Industry NAREGI R&D Assumptions & Goals • Future Research Grid Metrics – 10s of Institutions/Centers, various Project VOs – > 100,000 users, > 100,000 CPUs/machines • Machines very heterogeneous, SCs, clusters, desktops – 24/7 usage, production deployment – Server Grid, Data Grid, Metacomputing… • Do not reeinvent the wheel – Build on, collaborate with, and contribute to the “Globus, Unicore, Condor” Trilogy – Scalability and dependability are the key • Win support of users – Application and experimental deployment essential – However not let the apps get a “free ride” – R&D for production quality (free) software NAREGI Work Packages • WP-1: National-Scale Grid Resource Management: Matsuoka (Titech), Kohno(ECU), Aida (Titech) • WP-2: Grid Programming: Sekiguchi(AIST), Ishikawa(AIST) • WP-3: User-Level Grid Tools & PSE: Miura (NII), Sato (Tsukuba-u), Kawata (Utsunomiya-u) • WP-4: Packaging and Configuration Management: Miura (NII) • WP-5: Networking, National-Scale Security & User Management Shimojo (Osaka-u), Oie ( Kyushu Tech.) • WP-6: Grid-Enabling Nanoscience Applications : Aoyagi (Kyushu-u) NAREGI Software Stack 100Tflops級のサイエンスグリッド環境 WP6: Grid-Enabled Apps WP3: Grid Visualization WP4: Packaging WP2: Grid Programming -Grid RPC -Grid MPI WP5: Grid PKI WP3: Grid PSE WP3: Grid Workflow WP1: SuperScheduler WP1: Grid Monitoring & Accounting (Globus,Condor,UNICOREÎOGSA) WP1: Grid VM WP5: High-Performance Grid Networking WP-1: National-Scale Grid Resource Management • Build on Unicore Ù Condor Ù Globus P Co nd or U -U ni co re -C Globus Univers e • SuperScheduler • Monitoring & Auditing/Accounting • Grid Virtual Machine • PKI and Grid Account Management (WP5) EU I GR Condo r- G – Bridge their gaps as well – OGSA in the future – Condor-U and Unicore-C WP1: SuperScheduler (Fujitsu) • Hierarchical SuperScheduling structure, scalable to 100,000s users, nodes, jobs among >20+ sites • Fault Tolerancy • Workflow Engine • NAREGI Resource Schema (joint w/Hitachi) • Resource Brokering w/resource policy, advanced reservation (NAREGI Broker) • Intially Prototyped on Unicore AJO/NJS/TSI – (OGSA in the future) WP1: SuperScheduler (Fujitsu) (Cont’d) (U): UNICORE; Uniform Interface to Computing Resources WP3 PSE (G): GRIP; Grid Interoperability Project UPL (Unicore Protocol Layer) over SSL Internet WP5 hNAREGI PKI [NEC] GATEWAY(U) Intranet Resource Discovery, Selection, Reservation UPL (Unicore Protocol Layer) For Super Scheduler For Local Scheduler NJS(U) Network Job Supervisor UUDB(U) … Broker NJS(U) EuroGrid NAREGI Broker BROKER-S [マン大] [Fujitsu] Resource Broker IF UPL (Unicore Protocol Layer) CheckQoS Execution NJS(U) CheckQoS & SubmitJob Execution NJS(U) TSI Connection IF FNTP (Fujitsu European Laboratories NJS to TSI Protocol) OGSI portType? Policy DB (Repository) Map Resource Requirements in RSL (or JSDL) onto CIM Policy Engine: “Ponder” Policy Description Lang. (as a Management App.) TSI(U) Monitoring [Hitachi] NAREGI BROKER-L [Fujitsu] GRIP(G) Globus TSI(U) TSI(U) Target System TSI(U) Target System Interface Target System Interface Interface Condor DRMAA ? Imperial College, London Analysis& Prediction CIM in XML over HTTP or CIM-to-LDAP CIMOM (CIM Object Manager) Condor CIM Provider Globus CIM Provider Batch Q A CIM Provider ClassAd MDS/GARA NQS Being Planned Target System Interface CheckQoS ? C.f. EuroGird [Manchester U] WP3: Workflow Description (convert to UNICORE DAG) CIM Indication (Event) GMA Sensor Ex. Queue change event Used in CGS-WG Demo at GGF7 TOG OpenPegasus (derived from SNIA CIMOM) Commercial Products: MS WMI (Windows Management Instrumentation), IBM Tivoli, SUN WBEM Services, etc. WP1: Grid Virtual Machine (NEC & Titech) • “Portable” and thin VM Access Control layer for the Grid &Virtualization • Various VM functions – Access Control, Access Secure Resource Transparency, FT Access Control Node Virtualization & Support, Resource Access Transparency Resource Usage Control, etc. Rate Control • Also provides coscheduling across clusters Co-Scheduling & l o Co-Allocation • Respects Grid r Checkpoint Support t standards, e.g., GSI, Job Control n o Job Migration OGSA (future) C e • Various prototypes on c Linux ur GridVM FT o pp Su rt o s Re WP1: Grid Monitoring & Auditing/Accounting (Hitachi & Titech) • Scalable Grid Monitoring, Accouting, Logging • Define CIMbased Unified Resource Schema • Distinguish End users vs. Administrators • Prototype based on GT3 Index Service, CIMON, etc. Admin Info Presentation End-User Info Presentation User-Dependent Presentation Real-time Monitoring Service Present Detailed Resource Info, Searching, Fault Analysis, etc. Grid Middleware Information Service Super Scheduler Secure Large-Scale Data Management Service Admin Operation (e.g. Account Mapping Service) Admin Policy RDB Directory Service UNICORE, Condor, Globus Unified Schema CIMOM (Pegasus) GMA Info GridVM Provider Resource Info Performance Monitor Predictions User Log OS log Event log Batch system * Self Configuring Monitoring (Titech) WP-2:Grid Programming • Grid Remote Procedure Call (RPC) – Ninf-G2 • Grid Message Passing Programming – GridMPI WP-2:Grid Programming – GridRPC/Ninf-G2 (AIST/GTRC) GridRPC Programming Model using RPC on the Grid High-level, taylored for Scientific Computing (c.f. SOAP-RPC) GridRPC API standardization by GGF GridRPC WG Ninf-G Version 2 A reference implementation of GridRPC API Implemented on top of Globus Toolkit 2.0 (3.0 experimental) Provides C and Java APIs IDL FILE Client 3. invoke Executable 1. interface request 2. interface reply Client side http://ninf.apgrid.org/ IDL Compiler 4. connect back GRAM MDS Server side Numerical Library generate Remote Executable fork retrieve Interface Information LDIF File DEMO is available at AIST/Titech Booth WP-2:Grid Programming-GridMPI (AIST and U-Tokyo) GridMPI • Provides users an environment to run MPI applications efficiently in the Grid. • Flexible and hterogeneous process invocation on each compute node • GridADI and Latency-aware communication topology, optimizing communication over non-uniform latency and hides the difference of various lower-level communication libraries. • Extremely efficient implementation based on MPI on Score (Not MPICHI-PM) MPI Core RIM SSH RSH GRAM Grid ADI Vendor MPI IMPI Latency-aware Communication TopologyOther Comm. P-to-P Communication Vendor Library MPI TCP/IP PMv2 Others WP-3: User-Level Grid Tools & PSE • Grid Workflow - Workflow Language Definition - GUI(Task Flow Representation) • Visualization Tools - Real-time volume visualization on the Grid • PSE /Portals - Multiphysics/Coupled Simulation - Application Pool - Collaboration with Nanotech Applicatons Group Server Simulation or Storage Raw Data 3D Object Generation Rendering 3D Objects Images Client Storage 3D Object Generation Rendering Problem Solving Environment PSE Portal PSE Toolkit PSE Appli-pool Information Service Workflow Super-Scheduler Application Server UI WP-4: Packaging and Configuration Management • Collaboration with WP1 management • Issues – Selection of packagers to use (RPM, GPTK?) – Interface with autonomous configuration management (WP1) – Test Procedure and Harness – Testing Infrastructurec.f. NSF NMI packaging and testing WP-5 Grid High Performance Networking • Traffic measurement on SuperSINET • Optimal Routing Algorithms for Grids • Robust TCP/IP Control for Grids • Grid CA/User Grid Account Management and Deployment • Collaboration with WP-1 WP-6:Adaptation of Nano-science Applications to Grid Environment • Analysis of Typical Nanoscience Applications - Parallel Structure - Granularity - Resource Requirement - Latency Tolerance • Development of Coupled Simulation • Data Exchange Format and Framework • Collaboration with IMS WP6 and Grid Nano-Science and Technology Applications Overview Participating Organizations: -Institute for Molecular Science -Institute for Solid State Physics -AIST -Tohoku University -Kyoto University -Industry (Materials, Nano-scale Devices) -Consortium for Promotion of Grid Applications in Industry Research Topics and Groups: -Electronic Structure -Magnetic Properties -Functional nano-molecules(CNT,Fullerene etc.) -Bio-molecules and Molecular Electronics -Simulation Software Integration Platform -Etc. Example: WP6 and IMS GridEnabled Nanotechnology • IMS RISM-FMO Grid coupled simulation – RISM: Reference Interaction Site Model – FMO: Fragment Molecular Orbital method • WP6 will develop the application-level middleware, including the “Mediator” component RISM FMO Solvent distribution Solute structure Mediator Mediator In-sphere correlation Cluster (Grid) SMP SC GridMPI etc. SuperSINET: AON Production Research Network (separate funding) ■ 10Gbps General Backbone ■ GbE Bridges for peerHokkaido U. Nano-Technology OC-48+ transmission connection For GRID for Radio Telescope Application DataGRID for ■ Very low latency – TitechHigh-energy Science Tsukuba 3-4ms roundtrip Tohoku U. ■ Operation of Photonic NIFS Kyoto U. NAO Middleware Cross Connect (PXC) for Wased for Computational a U. Bio-Informatics KEK Osaka fiber/wavelength switching GRID Tsukuba U. U. Operation U.(NII) of Tokyo ■ 6,000+km dark fiber, 100+ Kyushu U. Doshidha U. Tokyo Institute of Tech. Nagoya U. e-e lambda and 300+Gb/s NII R&D ISAS Okazaki NIG Research ■ Operational from January, Institutes 2002 until March, 2005 SuperSINET :Network Topology (10Gbps Photonic Backbone Network) Tohoku U Kyushu U Tsukuba U Hokkaido U Kyoto U U Tokyo KEK Kyoto U Uji IMS U Tokyo Osaka hub Osaka U NII Hitotsubashi Tokyo hub Doshisha U NII Chiba Nagoya hub Nagoya U NIFS NAO IMS (Okazaki) ISAS NIG Source:National Institute of Informatics NAREGI GRID R&D TITech Waseda U As of October, 2002 The NAREGI Phase 1 Testbed ($45mil, 1Q2004) • Total ~6500 procs, ~30TFlops Osaka-U BioGrid Titech Campus Grid ~1.8TFlops AIST SuperCluster ~11TFlops U-Tokyo SuperSINET (10Gbps MPLS) Small Test ~400km Note: NOT a App Clusters (x 6) production • ~3000 Procs, NII IMS Grid system ~17TFlops (Tokyo) (Okazaki) c.f. TeraGrid Computational Center for Grid R&D Nano-science Center ~ 5Tflops ~11TFlops Software Testbed Application Testbed NAREGI Software R&D Grid Testbed (Phase 1) • Under Procurement – Installation March 2004 – 3 SMPs, 128 procs total (64 + 32 + 32), SparcV +IA64+Power4 – 6 128-proc PC clusters – – – – – – • 2.8Ghz Dual Xeon + GbE (Blades) • 3.06Ghz Dual Xeon + Infiniband 10+37TB File Server Multi-gigabit networking to simulate Grid Env. NOT a production system (c.f. TeraGrid) > 5 Teraflops WAN Simulation To form a Grid with the IMS NAREGI application testbed infrastructure (> 10 Teraflops, March 2004), and other national centers via SuperSINET NAREGI R&D Grid Testbed @ NII グリッド基 盤ソフトウェア開 発システム構 成図 SuperSINET SuperSINET ネットワーク部分構成概要 ファイルサ ーバ Unix Unix OS OS SMP 20TB 10TB 外部 ネットワーク 接続装置 GbE 64ポート 64ポート以上 (10GbE (10GbE × 1 可能) 可能) 1 node ( Unix, 64bit processor) ( Linux) + 管 理ノード 高性能分散並列型演算サ ーバ 2 L2スイッチ (GbE) 75ポート以上 分散並列型演算サーバ1用 L2スイッチ (GbE) 75ポート以上 32CPU 分散並列型演算サーバ2用 L2スイッチ (GbE) 75ポート以上 相互結合網用 スイッチと共用可 (UNIX, 64bit processor) 1 node 分散並列型演算サーバ3用 L2スイッチ (GbE) 75ポート以上 メモリ共有型演算サ ーバ 3 分散並列型演算サーバ4 用 32CPU 1 node (LINUX, 64bit processor) ノード … Unix Unix OS1 OS1 ノード 性能 性能 0.33TF以上 0.33TF以上 メモ メモリ リ 64GB以上 64GB以上 テデ゙ィス ィスクク 73GB以上 73GB以上 L2 GbE 128プロセッ 128プロセッサ以上 スイッチ ノード 性能 性能 0.65TF以上 0.65TF以上 メモ メモリ リ 65GB以上 65GB以上 テデ゙ィス ィスクク 1.2TB以上 1.2TB以上 (Linux) + 管 理ノード 分散並列型演算サ ーバ 2 結合網 (1Gbps以上 ) 1 node (UNIX, 64bit processor) 高性能分散並列型演算サーバ2 用 メモリ共有型演算サ ーバ 2 ファイルサーバ (GbE× 4) 性能 性能 0.75TF以上 0.75TF以上 メモ メモリ リ 130GB以上 130GB以上 テデ゙ィス ィスクク 2.3TB以上 2.3TB以上 外部 性能 結合網 (4Gbps以上 ) 性能 0.75TF以上 0.75TF以上 ネットワーク メモ メモリ リ 65GB以上 65GB以上 … 接続装置 ノード ノード ノード テデ゙ィス ィスクク 2.3TB以上 2.3TB以上 L2 GbE 128プロセッ 128プロセッサ以上 (Linux) + 管 理ノード スイッチ 高性能分散並列型演算サーバ1用 L2スイッチ (GbE) 75ポート以上 メモリ共有型演算サーバ1 メモリ共有型演算サーバ2 メモリ共有型演算サーバ3 L2 GbE 128プロセッサ以上 外部 NW スイッチ 128プロセッ ノード 結合網 (1Gbps以上 ) メモリ共有型演算サ ーバ 1 64cpu L2スイッチ (GbE) 75ポート以上 … ノード 分散並列型演算サ ーバ 1 オフィス環境ネットワーク用 トランク(GbE× 8) 内部用 L3 スイッチ GbE メモ 16GB以上 メモリリ 16GB以上 テディス 10TB ゙ィスク ク 10TB(RAID5)以上 (RAID5)以上 ハバック ア ッフ ゚ 20TB以上 ゙ックア ップ 20TB以上 (8cpu) • GbE 4ポート以上 • (10GbE (10GbE × 2 可能) 可能) • 高速パケットフィルタ 結合網 (8Gbps以上 ) ノード SuperSINET SuperSINET 外部 NW 高性能分散並列型演算サ ーバ 1 Unix Unix OS2 OS2 性能 性能 0.17TF以上 0.17TF以上 メモ メモリ リ 32GB以上 32GB以上 テデ゙ィス ク ィス ク 73GB以上 73GB以上 ノード 内部 NW ノード L2 GbE 128プロセッ 128プロセッサ以上 スイッチ Unix Unix OS OS 33 ノード 性能 性能 0.65TF以上 0.65TF以上 メモ メモリ リ 65GB以上 65GB以上 テデ゙ィス ィスクク 1.2TB以上 1.2TB以上 ( Linux) + 管 理ノード 分散並列型演算サ ーバ 3 結合網 (1Gbps以上 ) ノード 性能 性能 0.17TF 0.17TF メモ メモリ リ 64GB 64GB テデ゙ィス ィスクク 73GB以上 73GB以上 … ノード … L2 GbE スイッチ 128プロセッ 128プロセッサ以上 L3 スイッチ GbE L2 GbE スイッチ ノード 性能 性能 0.65TF以上 0.65TF以上 メモ メモリ リ 65GB以上 65GB以上 テデ゙ィス ィスクク 1.2TB以上 1.2TB以上 ( Linux) + 管 理ノード 分散並列型演算サ ーバ 4 結合網 (1Gbps以上 ) ノード ノード 128プロセッ 128プロセッサ以上 … ノード ( Linux) + 管 理ノード 性能 性能 0.65TF以上 0.65TF以上 メモ メモリ リ 65GB以上 65GB以上 テデ゙ィス ィスクク 1.2TB以上 1.2TB以上 AIST (National Institute of Advanced Industrial Science & Technology) Supercluster • Challenge – Huge computing power to support various research including life science and nanotechnology within AIST • Solution – Linux Cluster IBM eServer 325 • P32: 2116 CPU AMD Opteron • M64: 520 CPU Intel Madison – Myrinet networking – SCore Cluster OS – Globus toolkit 3.0 to allow shared resources. • World’s most powerful Linuxbased supercomputer – more than 11 TFLOPS ranked as the third most powerful supercomputer in the world – Operational March, 2004 Grid Technology Life Science Nanotechnology Academia Government Corporations Collaborations LAN Advanced Computing Center. Internet Other Research Institute NII Center for Grid R&D (Jinbo-cho, Tokyo) Mitsui Office Bldg. 14th Floor Akihabara Imperial Palace Tokyo St. 700m2 office space (100m2 machine room) Towards Petascale Grid – a Proposal • Resource Diversity (松竹梅 “Shou-Chiku-Bai”) – 松(“shou” pine) – ES – like centers 40100Teraflops x (a few), 100-300 TeraFlops – 竹(“chiku” bamboo) – Medium-sized machines at SCs, 5-10 TeraFlops x 5, 25-50 TeraFlops Univ aggregate / Center, 250-500 TeraFlops total SCs – 梅(“bai” plumb) – small clusters and PCs spread out throughout campus in a campus Grid x 5k-10k, 50 -100 TeraFlops / Center, 500-1 PetaFlop total • Division of Labor between “Big” centers like ES and Univ. Centers, Large-medium-small resources • Utilize Grid sofwate stack developed by NAREGI and other Grid projects ES’s Collaboration Ideas • Data (Grid) – NAREGI deliberately does not handle data • Unicore components – “Unicondore” (Condor-U, Unicore-C) • NAREGI Middleware – GridRPC, GridMPI – Networking – Resource Management • e.g. CIM resource schema • International Testbed • Other ideas? – Application areas as well