Scalability Test Results SSS RMS Phase 1

advertisement
Scalability Test results for Phase 1 SSS Resource Management Suite
Target Environment:
For scalability testing the following target hardware, interconnect, and environment are
envisioned by the first half of 2006:
Test Architecture:
x86, RH 7.3
Workstation:
10 GHz Processor
10 Gbit/s Network (comparable latency)
16 GB RAM
1 TB Disk
Compute Node:
10 GHz Processor
10 Gbit/s Network (comparable latency)
16 GB RAM
1 TB Disk
Supercomputer:
Nodes:
4096
Processors: 32768
Batch Environment:
Active Jobs: 1024
Idle Jobs: 1024
Users:
1024
Actual test environment
The tests were performed on the ORNL test cluster with the head node being Xtorcsss.csm.ornl.gov, and 64 compute nodes named node1 through node64.
The head node had 1GB RAM, 100 GB Disk, single 1.7 GHz Processor, Gigabit
interconnect
The compute nodes have 750 MB RAM, 36GB Disk, single 2 GHz Processor, Gigabit
interconnect
Testing started 29 JAN 2003 and ended 26 FEB 2003.
There were three tiers of testing performed:
Component level tests, Simulation tests, and System tests.
Component Tests:
Accounting and Allocation Manager:
For both QBank and Gold, the tests measured the time that it took to respond to 1000
requests each for withdrawals, reservations, quotations (QBank only), balance checks and
default account lookups. QBank tests were performed by running the appropriate qbank
perl clients serially in a for loop. Gold tests were performed by generating a gold
command script and invoking the gold client taking stdin from the script. This was done
to reduce the initial java startup overhead in the client -- making it more realistic to the
timings that will likely occur over the SSS wire-level interface.
Test
1000 Withdrawals
1000 Withdrawals
1000 Withdrawals
1000 Withdrawals
1000 Reservations
1000 Reservations
1000 Withdrawals
1000 Quotations
1000 Balance Checks
1000 Balance Checks
1000 Balance Checks
1000 Default Acct Lookups
1000 Default Acct Lookups
Variation
Component Result
Units
local db, qwithdraw
QBank
138seconds
local db, qbank (make_withdrawal) QBank
108seconds
local db, gold (job charge)
Gold
124.6667seconds
remote db, qwithdraw
QBank
143seconds
local db, qmkres
QBank
145seconds
local db, qbank (make_reservation) QBank
119seconds
local db, gold (job reserve)
Gold
97.66667seconds
local db, qmkquote
QBank
131seconds
local db,local client, qbalance
QBank
111seconds
local db,remote client,qbalance
QBank
93seconds
local db,gold (allocation balance) Gold
110seconds
local db,qbank get_users
QBank
85seconds
local db,gold (user query)
Gold
30.33333seconds
As the results show, times were on the order of 1/8th of a second per call. The times
between QBank (written in Perl) and Gold (written in Java) were pretty comparable, with
the result that Gold commands generally ran a little faster. On the QBank side, a speedup
was seen when less parsing had to be done by the client and the lower-level subroutines
were invoked directly via the qbank client command. Withdrawals and Reservations take
the longest since they must perform many more database transactions and require greater
synchronization. Using a remote database was shown to increase the time by about .02
seconds per transaction, but this is quite a small difference considering all of the
transactions that occur during a simple withdrawal. Issuing the qbank clients from a host
remote from the server unexpectedly yielded quicker times, perhaps because it load
balanced cpu activity -- but regardless showed that the performance impact is not large.
Interpretation
.138 s/withdrawal
.108 s/withdrawal
.125 s/withdrawal
.143 s/withdrawal
.145 s/reservation
.119 s/reservation
.098 s/reservation
.131 s/quotation
.111 s/balance
.093 s/balance
.110 s/lookup
.085 s/lookup
.030 s/lookup
Note that the results obtained occurred on a fresh install of QBank and Gold. The
backend databases were vacuumed frequently (between each test run) in order to reclaim
storage. I discovered that if this was not done, the timings would incrementally increase
each time I ran the tests.
Simulation Tests:
Accounting and Allocation Manager
Tests were carried out in Maui’s simulation mode to measure the effect of using the
allocation manager function within the resource management system. In effect, an
endless supply of 1 second jobs were submitted and the number of jobs completing in the
first 5 minutes were counted for the cases of Maui alone, and the case where qbank was
included to provide accounting and allocation management. With only Maui running,
jobs were scheduled and completed at the rate of about 18 jobs per second. When
allocation management was added, jobs were scheduled and completed at the rate of
about 6 jobs per second. This includes a reservation and withdrawal interaction for each
job. Using the newly built-in interface statistical diagnosis capabilities in Maui, we were
able to measure that the average allocation management connection took 0.11 seconds for
QBank and 0.135 seconds for Gold.
Test
Jobs scheduled in 5 minutes
Jobs scheduled in 5 minutes
Allocation Manager Transaction Times
Allocation Manager Transaction Times
Variation
4096x8 nodes, 5500 jobs
4096*8 nodes, 5500 jobs
64x1 nodes, 205 jobs, qbank
64x1 nodes, 205 jobs, gold
Components
maui
maui,qbank
maui,qbank
maui,gold
Result Units
5375Jobs
1790Jobs
0.11seconds
0.135seconds
Inte
Abo
Abo
.11
.135
System Tests:
Accounting and Allocation Manager
In these system tests, real jobs ran on a cluster of 64 nodes. These jobs were simple
hostname jobs designed to finish immediately and test the overhead associated with
scheduling, accounting and resource management under various scenarios and component
combinations. In general the metric was to determine how long it takes to run the test
compared across the various configurations.
There were two tests conducted. In the first test, 1000 single node /bin/hostname jobs
were submitted in a for loop to the 64-node cluster. In the second test, 5 64-way jobs
were submitted in a for loop to the same cluster. The first test allowed some parallelism
to occur, while the second test necessarily ran the jobs serially and measured the
overhead involved in scheduling, as well as the predominant overhead in initializing and
cleaning up a parallel communication environment using mpiexec from mpich2.
Test
Variation
Components
Result
1000 1-way hostname jobs
1000 1-way hostname jobs
1000 1-way hostname jobs
1000 1-way hostname jobs
1000 1-way hostname jobs
1000 1-way hostname jobs
1000 1-way hostname jobs
1000 1-way hostname jobs
1000 1-way hostname jobs
1000 1-way hostname jobs
/bin/hostname
/bin/hostname
/bin/hostname
/bin/hostname
/bin/hostname
/bin/hostname
/bin/hostname
/bin/hostname
/bin/hostname
/bin/hostname
openpbs-oscar,pbs_sched
openpbs-oscar,maui
OpenPBS_sss,pbs_sched
OpenPBS_sss,maui
OpenPBS_sss,maui,pbs_xml_server
OpenPBS_sss,maui,qbank
OpenPBS_sss,maui,gold
maui,QM,PM,EM,DS
maui,QM,PM,EM,DS,nmd
maui,QM,PM,EM,DS,nmd,gold
174.33
168.2
181.67
164.33
1024
165
230
1044
5 64-way hostname jobs
5 64-way hostname jobs
5 64-way hostname jobs
5 64-way hostname jobs
5 64-way hostname jobs
5 64-way hostname jobs
5 64-way hostname jobs
5 64-way hostname jobs
5 64-way hostname jobs
5 64-way hostname jobs
mpich2-0.91,mpiexec hostname
mpich2-0.91,mpiexec hostname
mpich2-0.91,mpiexec hostname
mpich2-0.91,mpiexec hostname
mpich2-0.91,mpiexec hostname
mpich2-0.91,mpiexec hostname
mpich2-0.91,mpiexec hostname
mpich2-0.91,mpiexec hostname
mpich2-0.91,mpiexec hostname
mpich2-0.91,mpiexec hostname
openpbs-oscar,pbs_sched
openpbs-oscar,maui
OpenPBS_sss,pbs_sched
OpenPBS_sss,maui
OpenPBS_sss,maui,pbs_xml_server
OpenPBS_sss,maui,qbank
OpenPBS_sss,maui,gold
maui,QM,PM,EM,DS
maui,QM,PM,EM,DS,nmd
maui,QM,PM,EM,DS,nmd,gold
38
42
39
41.75
165.33
42.667
44.333
79.333
79.333
For the 1000 1-way jobs.
With the PBS jobs there was a lot of variety in timings because PBS would hang or jobs
would stick around in the Exiting state for a period. The times recorded were strictly for
the runs that did not have these problems so that a comparison could be made that
measured overhead. See the excel document for more detailed notes.
It was seen that when Maui was used as the scheduler, the jobs were scheduled in a more
efficient manner than when using the PBS FIFO scheduler. Using the patched version of
PBS did not significantly improve the performance at this stage. Using the
pbs_xml_server to frontend Maui so PBS could speak the SSS wire-level protocol
suffered pretty severely in this test because with the front-end in place, Maui was not able
to take advantage of the events from PBS since this is not yet supported at this stage of
the design (though it will be later). For a more regular job load with longer running jobs,
this effect will be negligible (essentially added 1 second per job). The effect of adding
QBank accounting and allocation management to the mix was unnoticeable. For some
reason, using Gold added about .065 seconds per job (which is pretty small). Using the
new Queue Management system brought us back to about 1 second per job. We later
found a sleep 1 that remained in Maui that might have brought these tests timings down
significantly closer to the other timings. There is certainly a lot of improvement that can
be achieved on all fronts. In general, very poor parallelization was achieved. At most on
the 64-way system we saw 4-6 jobs running simultaneously and in the cases without
events (and the sleep) only 1.
For the 5 64-way jobs:
This scenario was quite different because there was many more nodes and parameters to
optimize, jobs could only run one at a time and because MPI was used, there was a
substantial startup and cleanup time involved.
In this test, Maui and PBS_sched performance winners were reversed. The PBS FIFO
scheduler performed faster. I attribute this to Maui requiring more CPU time to take into
account fairness and packing algorithm optimizations where in this simple case would not
have resulted in improvements. Again, using the patched PBS did not seem to result in
any performance gains. Inserting the pbs_xml_server in the mix introduced very poor
degradation which will have to be investigated and corrected. Adding allocation
accounting added about a half second per job with QBank being the better performer.
Changing the system to use the prototype QM and PM components severely reduced
performance (partly related to the absence of events), but not as badly as with the
pbs_xml_server. Including the cluster monitor (nmd) in the mix did not significantly
affect the timings, but this is largely because the timings were already heavily skewed by
some other unknown factors.
These tests give us a starting point from which improvements can be made and
enhancements can be compared and were a very worthwhile exercize.
Download