Scalability Test results for Phase 1 SSS Resource Management Suite Target Environment: For scalability testing the following target hardware, interconnect, and environment are envisioned by the first half of 2006: Test Architecture: x86, RH 7.3 Workstation: 10 GHz Processor 10 Gbit/s Network (comparable latency) 16 GB RAM 1 TB Disk Compute Node: 10 GHz Processor 10 Gbit/s Network (comparable latency) 16 GB RAM 1 TB Disk Supercomputer: Nodes: 4096 Processors: 32768 Batch Environment: Active Jobs: 1024 Idle Jobs: 1024 Users: 1024 Actual test environment The tests were performed on the ORNL test cluster with the head node being Xtorcsss.csm.ornl.gov, and 64 compute nodes named node1 through node64. The head node had 1GB RAM, 100 GB Disk, single 1.7 GHz Processor, Gigabit interconnect The compute nodes have 750 MB RAM, 36GB Disk, single 2 GHz Processor, Gigabit interconnect Testing started 29 JAN 2003 and ended 26 FEB 2003. There were three tiers of testing performed: Component level tests, Simulation tests, and System tests. Component Tests: Accounting and Allocation Manager: For both QBank and Gold, the tests measured the time that it took to respond to 1000 requests each for withdrawals, reservations, quotations (QBank only), balance checks and default account lookups. QBank tests were performed by running the appropriate qbank perl clients serially in a for loop. Gold tests were performed by generating a gold command script and invoking the gold client taking stdin from the script. This was done to reduce the initial java startup overhead in the client -- making it more realistic to the timings that will likely occur over the SSS wire-level interface. Test 1000 Withdrawals 1000 Withdrawals 1000 Withdrawals 1000 Withdrawals 1000 Reservations 1000 Reservations 1000 Withdrawals 1000 Quotations 1000 Balance Checks 1000 Balance Checks 1000 Balance Checks 1000 Default Acct Lookups 1000 Default Acct Lookups Variation Component Result Units local db, qwithdraw QBank 138seconds local db, qbank (make_withdrawal) QBank 108seconds local db, gold (job charge) Gold 124.6667seconds remote db, qwithdraw QBank 143seconds local db, qmkres QBank 145seconds local db, qbank (make_reservation) QBank 119seconds local db, gold (job reserve) Gold 97.66667seconds local db, qmkquote QBank 131seconds local db,local client, qbalance QBank 111seconds local db,remote client,qbalance QBank 93seconds local db,gold (allocation balance) Gold 110seconds local db,qbank get_users QBank 85seconds local db,gold (user query) Gold 30.33333seconds As the results show, times were on the order of 1/8th of a second per call. The times between QBank (written in Perl) and Gold (written in Java) were pretty comparable, with the result that Gold commands generally ran a little faster. On the QBank side, a speedup was seen when less parsing had to be done by the client and the lower-level subroutines were invoked directly via the qbank client command. Withdrawals and Reservations take the longest since they must perform many more database transactions and require greater synchronization. Using a remote database was shown to increase the time by about .02 seconds per transaction, but this is quite a small difference considering all of the transactions that occur during a simple withdrawal. Issuing the qbank clients from a host remote from the server unexpectedly yielded quicker times, perhaps because it load balanced cpu activity -- but regardless showed that the performance impact is not large. Interpretation .138 s/withdrawal .108 s/withdrawal .125 s/withdrawal .143 s/withdrawal .145 s/reservation .119 s/reservation .098 s/reservation .131 s/quotation .111 s/balance .093 s/balance .110 s/lookup .085 s/lookup .030 s/lookup Note that the results obtained occurred on a fresh install of QBank and Gold. The backend databases were vacuumed frequently (between each test run) in order to reclaim storage. I discovered that if this was not done, the timings would incrementally increase each time I ran the tests. Simulation Tests: Accounting and Allocation Manager Tests were carried out in Maui’s simulation mode to measure the effect of using the allocation manager function within the resource management system. In effect, an endless supply of 1 second jobs were submitted and the number of jobs completing in the first 5 minutes were counted for the cases of Maui alone, and the case where qbank was included to provide accounting and allocation management. With only Maui running, jobs were scheduled and completed at the rate of about 18 jobs per second. When allocation management was added, jobs were scheduled and completed at the rate of about 6 jobs per second. This includes a reservation and withdrawal interaction for each job. Using the newly built-in interface statistical diagnosis capabilities in Maui, we were able to measure that the average allocation management connection took 0.11 seconds for QBank and 0.135 seconds for Gold. Test Jobs scheduled in 5 minutes Jobs scheduled in 5 minutes Allocation Manager Transaction Times Allocation Manager Transaction Times Variation 4096x8 nodes, 5500 jobs 4096*8 nodes, 5500 jobs 64x1 nodes, 205 jobs, qbank 64x1 nodes, 205 jobs, gold Components maui maui,qbank maui,qbank maui,gold Result Units 5375Jobs 1790Jobs 0.11seconds 0.135seconds Inte Abo Abo .11 .135 System Tests: Accounting and Allocation Manager In these system tests, real jobs ran on a cluster of 64 nodes. These jobs were simple hostname jobs designed to finish immediately and test the overhead associated with scheduling, accounting and resource management under various scenarios and component combinations. In general the metric was to determine how long it takes to run the test compared across the various configurations. There were two tests conducted. In the first test, 1000 single node /bin/hostname jobs were submitted in a for loop to the 64-node cluster. In the second test, 5 64-way jobs were submitted in a for loop to the same cluster. The first test allowed some parallelism to occur, while the second test necessarily ran the jobs serially and measured the overhead involved in scheduling, as well as the predominant overhead in initializing and cleaning up a parallel communication environment using mpiexec from mpich2. Test Variation Components Result 1000 1-way hostname jobs 1000 1-way hostname jobs 1000 1-way hostname jobs 1000 1-way hostname jobs 1000 1-way hostname jobs 1000 1-way hostname jobs 1000 1-way hostname jobs 1000 1-way hostname jobs 1000 1-way hostname jobs 1000 1-way hostname jobs /bin/hostname /bin/hostname /bin/hostname /bin/hostname /bin/hostname /bin/hostname /bin/hostname /bin/hostname /bin/hostname /bin/hostname openpbs-oscar,pbs_sched openpbs-oscar,maui OpenPBS_sss,pbs_sched OpenPBS_sss,maui OpenPBS_sss,maui,pbs_xml_server OpenPBS_sss,maui,qbank OpenPBS_sss,maui,gold maui,QM,PM,EM,DS maui,QM,PM,EM,DS,nmd maui,QM,PM,EM,DS,nmd,gold 174.33 168.2 181.67 164.33 1024 165 230 1044 5 64-way hostname jobs 5 64-way hostname jobs 5 64-way hostname jobs 5 64-way hostname jobs 5 64-way hostname jobs 5 64-way hostname jobs 5 64-way hostname jobs 5 64-way hostname jobs 5 64-way hostname jobs 5 64-way hostname jobs mpich2-0.91,mpiexec hostname mpich2-0.91,mpiexec hostname mpich2-0.91,mpiexec hostname mpich2-0.91,mpiexec hostname mpich2-0.91,mpiexec hostname mpich2-0.91,mpiexec hostname mpich2-0.91,mpiexec hostname mpich2-0.91,mpiexec hostname mpich2-0.91,mpiexec hostname mpich2-0.91,mpiexec hostname openpbs-oscar,pbs_sched openpbs-oscar,maui OpenPBS_sss,pbs_sched OpenPBS_sss,maui OpenPBS_sss,maui,pbs_xml_server OpenPBS_sss,maui,qbank OpenPBS_sss,maui,gold maui,QM,PM,EM,DS maui,QM,PM,EM,DS,nmd maui,QM,PM,EM,DS,nmd,gold 38 42 39 41.75 165.33 42.667 44.333 79.333 79.333 For the 1000 1-way jobs. With the PBS jobs there was a lot of variety in timings because PBS would hang or jobs would stick around in the Exiting state for a period. The times recorded were strictly for the runs that did not have these problems so that a comparison could be made that measured overhead. See the excel document for more detailed notes. It was seen that when Maui was used as the scheduler, the jobs were scheduled in a more efficient manner than when using the PBS FIFO scheduler. Using the patched version of PBS did not significantly improve the performance at this stage. Using the pbs_xml_server to frontend Maui so PBS could speak the SSS wire-level protocol suffered pretty severely in this test because with the front-end in place, Maui was not able to take advantage of the events from PBS since this is not yet supported at this stage of the design (though it will be later). For a more regular job load with longer running jobs, this effect will be negligible (essentially added 1 second per job). The effect of adding QBank accounting and allocation management to the mix was unnoticeable. For some reason, using Gold added about .065 seconds per job (which is pretty small). Using the new Queue Management system brought us back to about 1 second per job. We later found a sleep 1 that remained in Maui that might have brought these tests timings down significantly closer to the other timings. There is certainly a lot of improvement that can be achieved on all fronts. In general, very poor parallelization was achieved. At most on the 64-way system we saw 4-6 jobs running simultaneously and in the cases without events (and the sleep) only 1. For the 5 64-way jobs: This scenario was quite different because there was many more nodes and parameters to optimize, jobs could only run one at a time and because MPI was used, there was a substantial startup and cleanup time involved. In this test, Maui and PBS_sched performance winners were reversed. The PBS FIFO scheduler performed faster. I attribute this to Maui requiring more CPU time to take into account fairness and packing algorithm optimizations where in this simple case would not have resulted in improvements. Again, using the patched PBS did not seem to result in any performance gains. Inserting the pbs_xml_server in the mix introduced very poor degradation which will have to be investigated and corrected. Adding allocation accounting added about a half second per job with QBank being the better performer. Changing the system to use the prototype QM and PM components severely reduced performance (partly related to the absence of events), but not as badly as with the pbs_xml_server. Including the cluster monitor (nmd) in the mix did not significantly affect the timings, but this is largely because the timings were already heavily skewed by some other unknown factors. These tests give us a starting point from which improvements can be made and enhancements can be compared and were a very worthwhile exercize.