A R D A Catalogue Access on the Grid Birger Koblitz for the ARDA project Grid Performance Workshop, Edinburgh, June 22nd, 2005 Overview ● Characteristics of Grid Catalogue Access ● How to Access Database on the Grid ● File catalogues: Comparing LFC and FiReMan ● AMGA the ARDA metadata server ● SOAP vs. TCP text streaming ● Conclusions 1 A R D A Grid Catalogues Most prominent catalogues on grid are ● ● File catalogues Metadata catalogues Both catalogues types normally have (relational) database back ends Special access pattern on grid (for HEP): ● ● ● Write once read many times Distinction between writers (production jobs) and readers (analysis jobs) Readers frequently read large amounts of catalogue data in HEP ( O(1k - 100k) entries ) ➔ Need fast, bulk read access to DBs 2 A R D A DB Access on a Grid API Application Client +Performance +Simple Implementation − Security, Monitoring − How do you authenticate? Server SQL SQL-DB DB-Service SOAP XML-RPC Text SQL-DB “Service”: RLS, AMI, RefDB, ... Server SQL via ODBC, JDBC proprietary Protocols “Traditional” Way: ODBC, RAL, ... There are 2 ways to access a DB remotely: API Application Client +Lightweight Client +Security: GSI, x509 − Performance − Implementation: State 3 A R D A LFC and FireMan Both are 2nd generation catalogues by CERN ● Support ACLs, GUIDs LFC is LCG-2 file catalogue for EGEE: ● ● ● C server with proprietary, binary RPCs Uses transactions and DB cursors via sessions No bulk operations FiReMan is gLite catalogue for EGEE: ● ● ● Uses SOAP, Java service in Axis No DB cursor → no data consistency between calls One call bulk operations as transactions, no sessions Test setup: ● ● Server: 2x Xeon @2.4GHz, 2GB RAM (DB + service) Client: PIII @800MHz, multi-threaded client 4 A R D A FC: Performance 1200 timeouts 250 200 150 1000 800 600 400 100 timeouts Inserts / Second 300 Reading timeouts 350 Entries Returned / Second Insertion Fireman - Single Entry Fireman - Bulk 100 LFC 50 0 1 2 5 50 20 10 Number of Threads 200 timeouts 100 0 1 2 5 10 20 Number Of Threads 50 ➔LFC faster for single ops, slower for many (bulk operations missing) ➔FiReMan has problems with many clients with C. Munro 5 A R D A FC: Protocol Analysis Study of protocols with authentication enabled: 120 200 80 Number of Packets 100 Number of Packets 180 RESPONSE GET NEXT RESPONSE READ DIR 60 AUTHENTICATE 160 READ DIR 140 120 AUTHENTICATE 100 GET SERVICE METADATA 80 40 GET STAT 20 0 AUTHENTICATE 0 5000 10000 15000 AUTHENTICATE 60 CHECK ENTRY EXISTS 40 AUTHENTICATE GET INTERFACE VERSION 20 20000 25000 AUTHENTICATE 0 0 20000 40000 Data Transferred (bytes) Both protocols have large overhead: ➔Several RPC needed ➔Authentication not persistent ➔SOAP blows up message by factor 5 60000 80000 100000 120000 Data Transferred (bytes) with C. Munro 6 A R D A Tools Main tool for Network tracing: Ethereal System tracing: strace, gdb 7 A R D A LFC & FiReMan Summary LFC: ● ● ● ● Fast for single entries Relatively small protocol overhead Transactions → Consistency No bulk operations → Slow for many entries FiReMan: ● ● ● ● Bulk operations → fast ops on many entries No transactions → no consistent reading Large protocol overhead Timeouts Both catalogues could reduce protocol overhead LFC should implement bulk operations 8 AMGA Server Server Server ODBC PostgreSQL MD-Server ODBC Asynchr. Buffer SOAP Command PostgreSQL Security wrapper GSI SSL GSI TEXT Implement Metadata server from what we learned: ● Multi-threaded C++ server for Text-streaming & SOAP ● Uses ODBC as RDBMS abstraction: Oracle, PostgreSQL, SQLite ● Sessions supported in SOAP & streaming ● DB cursors ● Streams responses asynchronously ● Iterators for SOAP TEXT A R D A Server Firewall SSL Security wrapper Java-API C++-API Python File Application Client 9 A R D A Interface: Retrieving data The Bulk transfers to client are done through iterators on the back end, resending query allows statelessness: ● int query(string query, MDResult &result) ● int nextQuery(string query, string token) ● int endQuery(string token) struct MDResult { Boolean last; String token; String query; DataChunk chunk; } AMGA has streamed versions: ● int getAttr(string pattern, list<string>keys) Returns values for all keys of the entries matching pattern: ➔ Client knows semantic ● int find(string pattern, string query, Handler &handle) Returns all entries (no collections) matching pattern and fulfilling query with gLite DM team 10 A R D A Performance Extensive performance tests done on LAN: read 60 attributes of 1000 entries TCP-S, no KA TCP-S, KA gSOAP, no KA gSOAP, KA Ping Average throughput [entries/sec] Average throughput [calls/sec] 1000 ping operations 10000 out of Sockets 1000 1 10 # clients 100 TCP-S, Single TCP-S, Bulk gSOAP, Single gSOAP, Bulk getAttr 1000 100 1 10 100 # clients No sessions used, no SSL ➔TCP Streaming in general 2-5 times faster than SAOP ➔Performance very promising ➔Importance of bulk transfers evident with N. Santos 11 A R D A LAN and WAN Comparisons of gSOAP and TCP streaming on LAN and WAN 25 1000 ops, LAN (0.8ms latency) 1400 Raw no KA Raw KA gSOAP no KA gSOAP KA Multiplied by 5 1200 Execution Time [s] 15 10 { 20 Execution Time [s] 1000 ops, WAN (300ms latency) 1000 800 600 400 5 200 0 ping add get get Bulk 0 ping add get get Bulk (x5) Single clients: ➔TCP streaming always fastest, but SOAP not bad ➔Times on WAN dominated by latency ➔Streaming dramatically faster on WAN with N. Santos 12 x5 SOAP Toolkit Performance 1000 ping operations on LAN Execution Time [s] A R D A 25 TCP-S no KA TCP-S KA SOAP no KA SOAP KA 20 SOAP toolkits: C++: gSOAP Java: Axis Python: ZSI 15 10 5 0 C++ Java Python SOAP toolkit quality varies widely: ● Took 2 weeks to write SOAP clients in 3 languages ● Toolkits incompatible, only hand-written WSDL works ● SOAP APIs differ ↔ BSD sockets standard with N. Santos 13 A R D A Summary Existing Catalogues can still be improved SOAP problematic protocol for DB access: ● ● ● No sessions -> No DB cursors Large overhead No streaming Protocol should be tailored to task ● ● Streaming very promising for DB access, necessary on WAN Statefullness needed? LFC & FiReMan 2nd generation FCs ● Still several issues with bulk ops, sessions, DB cursors, large protocol overhead... 14 A R D A SSL and Sessions SSL can dramatically reduce performance if no sessions are used: Pings per second [1/s] 10 clients, 100 pings each Connection Session Mult. Connections 10000 1000 100 10 TCP-S TCP-S w. SSL gSOAP gSOAP w. SSL 15