Team 4: IJAMS 17-654: Analysis of Software Artifacts 18-846: Dependability Analysis of Middleware IJAMS (Integrated Job Applicants Management System) Daehyun Jung // Inhong Kim // Jungjin Suh Hyoungiel Park // Huiyeong Choi Electrical &Computer ENGINEERING Team 4 Members dj un g DaeHyun Jung @ an dr ew .cm u. ed u ihk@andrew.cmu.edu hip@andrew.cmu.edu Hyoungiel Park InHong Kim an dr ew .cm u. ed u h@ jjs u JungJun Suh Huiyeong Choi hchoi1@andrew.cmu.edu Team 4 Homepage: http://www.ece.cmu.edu/~ece846/team4/index.html 2 1 Application Overview What does the application do? • Integrated Job Applicants Management System (IJAMS) • IJAMS connects registered headhunting agencies to integrated HR database What makes it interesting? • LDAP (Lightweight Directory Access Protocol) Server as Database • Heavy data transmission per transaction ( 100 applicants entries searched for query) • Remote LDAP (in Korea) and Local LDAP (in MSE Cave) LDAP & RDBMS • Reading time is faster than RDBMS • Less resources are used (memory resource) • Connection to external DB is easier. • http://www.openldap.org/ 3 Development Environment Middleware : CORBA 2.3.1 (embedded in JDK 1.4.2, idlj 3.1) Light-weight (No administrative privilege required for handling server, less time is taken for restart and less resource consumed at runtime compared to EJB server) Language : Java 1.4.2 API : Netscape Directory SDK 4.0 for Java Platform :Linux ECE cluster (ssh is used for building replication manager) Main Database : SunOne LDAP 5.1 Back-end data tier, high performance in data retrieving Checkpointing Database : MySQL 4.0 (Sun Solaris 2.9) (for FT-Baseline) 4 2 Baseline Architecture Client Look up Headhunter Client 11 Agency Headhunter Client Agency2 2 Naming Service Bind Requ est CORBA SERVER Headhunter Agency n LDAP Query LEGEND Client Naming Service CORBA SERVER LDAP Look up Request Bind LDAP LDAP Query 5 Fault-Tolerance Strategies Passive Replication – – Replicating the middle tier on different machines in ECE cluster State information in CORBA servant • Saved to MySQL for checkpointing (user id/password, user level, transaction id, operation flag) – On the Sacred Machine: • CORBA Naming Service, LDAP Server, Replication Manager The elements of fault-tolerance framework – – – Replication Manager: Main process Fault detector and automatic recovery : Thread Fault injector : Thread 6 3 FT-Baseline Architecture (1) Data Retrieving Operation 2 Primary Replica Fault Injection 1 3 Fault Injector MySQL LDAP Client Naming Service 4 Fault Detector Checkpointing Replication Manager Process Backup Replica 5 Fault injection point is not determined for this scenario because whenever an error occurs during this operation, the operation is re-tried from scratch 7 FT-Baseline Architecture (2) Data Updating Operation 2 Fault Injection 3 Primary Replica 1 Fault Injector 4 MySQL Client Naming Service Checkpointing 5 LDAP Fault Detector Replication Manager Process Backup Replica Simplified a set of <transaction> Client Server Backend Since data update operation affects data on LDAP server, the operation must not be duplicated. Recovery is possible by having new primary replica continue processing without starting from scratch 8 4 Mechanisms for Fail-Over Fault Detection Client obtains the names of the replica reference when it starts We use one of CORBA Exceptions: COMM_FAILURE The client gets a new CORBA replica from Naming Service – – – Fail-over Backup replica waits to take over The client retries the operation with the new replica The user of the client is reauthenticated on the new replica with the user data in the checkpointing DB If fault occurs, then the backup replica takes over with saved checkpointing information – – – – 9 Fail-Over Measurements 3500 3000 2000 Phase I 1500 1000 1065 989 1027 951 913 875 837 799 761 723 685 647 609 571 533 495 457 419 381 343 305 267 229 191 153 77 1 0 115 500 39 msec 2500 Trials 15% 22% 14% 1.Fault Detection 0% 2.Get Replica Location 3.Reconnection 4.LDAP Query 5.Parse 49% 10 5 RT-FT-Performance Strategy The primary reason of fail-over “spike” – TCP/IP reconnection between client and server RT-FT strategy is to reduce the time of TCP/IP reconnection – – – Client pre-establishes connections with both replicas 3 Replicas ( Primary and 2Backup) Locator : Choose replica which is most likely to have established its connection. » » – Send dummy request to the replicas for adding number which represents the age of replica. Oldest replica is prepared for fail-over CORBA Object Factory : Return CORBA object reference to client The object reference of the oldest replica 11 RT-FT Baseline Architecture Primary Replica Client Naming Service Check pointing CORBA Object Factory Thread Locator (Replica Age Checker) Thread. LDAP Fault Detector Replication Manager Process Backup Replica Client Process ★ Fault Injector dummy request for adding replica age CORBA Object Factory Object Locator : Decide which replica is more likely to have established its connection. Increase replica age by sending dummy request to the replica periodically. ★CORBA Object Factory : Returns CORBA object reference to the client 12 6 RT-FT-Performance Measurements 3500 Response Ti me (msec) 3000 2500 2000 Phase II 1500 1000 500 0 1 45 89 133 177 221 265 309 353 397 441 485 529 573 617 661 705 749 793 837 881 925 969 1013 1057 Tr i a l s <Contribution Factors for Failover (msec)> 91.2 604 881.2 527.6 262.4 Fault Detection Get Replica Location Reconnection LDAP Query Parse 13 Bounded “Real-Time” Fail-Over Measurements 1200 Highlight of RT-FT Performance Process Time (msec) 1000 800 600 400 200 0 Fault Detection Get Replica Location Reconnection LDAP Query Phase I Parse Phase II Reconnection time is reduced by 49.5 % • LDAP query time : Different location of LDAP server (Phase1 : In Korea, Phase2 :Local) Fault Detection Time : Client side load caused from background TCP/IP connection Parsing time : Affected by variability of system environment. • • 14 7 Performance Strategy Load balancing Number of clients connected to a server CPU load of a server – – Test Object 1: CPU Based Load balancing 2: Request Based Load balancing Server Naming Service Load Balanced Connection allocation Server Client Worker Thread Client Worker Client Worker Client Worker Thread. Thread Client Worker Client Worker Thread Thread Thread (Concurrent Users) Server <Legend> LDAP Load Balancing Process Data Flow Client Process Client Worker threads 15 Performance Measurements Total Response Time (msec) 3500 18000 3000 평균:Tim e 16000 ~10 14000 2500 12000 2000 데이터 평균:msec Count of msec 1500 10000 Total 8000 6000 1000 ~5 500 4000 2000 0 0 2 12 22 37 No. of Clients 49 1 No. of Clients(Threads) 11 21 31 41 51 61 101 111 121 131 141 151 162 173 183 194 Chart Title 25 ~15 14000. 00 91 Pitch Size 평균:msec 16000. 00 81 NofClient Total 18000. 00 71 Two servers Single server Average of Pitch 20 12000. 00 15 y = 4.7571x R2 = 0.9947 10000. 00 Total 8000. 00 Total 10 6000. 00 4000. 00 5 2000. 00 0 0. 00 1 11 21 31 41 51 62 72 82 94 104 114 124 134 144 154 164 174 184 194 No. of Clients Three servers 1 2 3 No. of Server Scalability Trend No. of Server 16 8 Superimposed Performance Data <Data for 1,2, and 3 Servers> 7000.00 Response Time (millisecond) 6000.00 5000.00 Average of msec One Server 4000.00 Average of msec Two (CPU) Average of msec Two (Req) Average of msec Three (CPU) 3000.00 Average of msec Three (Req) 2000.00 1000.00 0.00 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 No. of Clients (Threads) 17 Insights from Measurements Fault Tolerance – RT-FT – TCP/IP connection is most significant factor for latency in fault recovery Background connection pre-establishment helps to reduce failover time Performance – Thread pool and bottleneck tier must be considered 18 9 Open Issues Issues – – What is the exact source of the bottleneck ? We suspect the backend database Additional features if we have more time – FT • Replication of sacred services such as Replication Manager or Naming Service • Active replication – RT-FT • Saving LDAP connection object as checkpointing for saving time for reauthentication. • Optimization of CORBA server • CORBA persistent reference to reduce Naming Service load – Performance • Reducing delay time for checking CPU load using “ssh” command 19 Conclusions Accomplishments – – – – Lessons learned – – FT: Passive replication strategy and selection criteria RT-FT: Background connection pre-establishment strategy Performance: Load balancing strategy (CPU load and # of connections) RT and Performance analysis Identifying exact points of the bottleneck Checkpointing strategy Considerations if we restarted afresh – – Set replication point after a complete performance analysis at each tier Analyze WAN vs LAN impact 20 10