Optimizing Virtual CPU Configurations for

advertisement
Optimizing Virtual CPU Configurations for
Multithreading in VMWare ESXi
Arseniy Avery
Evan Hutto
ABSTRACT – We propose methods for
optimizing virtual CPU layouts in a VMWare
ESXi environment for running multithreading
workloads.
I. INTRODUCTION
VMWare ESXi is a bare metal hypervisor
that facilitates running multiple virtual servers on one
physical machine. The VMWare environment can run
the CPU in a configuration either matches the
physical hardware specifications or in a virtual layout
where any number of combinations are possible. On a
server with one octocore processor (that is a server
that has one physical CPU with eight processing
cores), users can setup a virtual machine that
virtualizes the CPU in configurations ranging from
one socket with eight cores to eight sockets with a
single core or any configuration in between. With this
range of socket configurations it poses the question
which configuration will be optimal for a workload.
II. TESTING ENVIORMENT
Out testing was done on a server with an
Intel Xeon E3 1245 CPU with 8 processing cores and
32GB of memory. We used VMWare ESXi 6.0 as the
bare metal hypervisor on the machine and there were
a total of four CentOS virtual machines with identical
configurations on the server with the exception of
CPU configurations which are as follows:
NAME
VM1
VM2
VM3
VM4
SOCKETS
1
2
4
8
CORES/SOCKET
8
4
2
1
Our test program was written in C++ and
computed the prime numbers from 1 to 80 million.
The program used the Miller Rabin Primality Test
which can compute the primality of a number without
previous knowledge prime numbers that were
computed beforehand. While this is not the fastest
way to compute prime numbers, we chose this
method because workloads could be more easily
distributed between threads.
Joshua Kim
Because of the way virtual servers are often
implemented (many virtual servers running
simultaneously on one physical machine) we ran
three types of tests. The first test ran two virtual
machines at a time, running the same workload with
the same start times. The second test shut off all but
one virtual machine and ran the workload
individually. Finally, the last test ran all four virtual
machines at the same time with identical start times.
In addition we tested two types of applications, one
with multithreading enabled, the other with
multithreading disabled.
III. APPLICATION DESIGN
Our application was designed with speed in
mind and therefore a design was implemented with
the goal of eliminating the need for semaphores. The
program consisted of eight CPU specific array’s,
each carrying a size of ten million and individual
methods for every array that would call a function to
determine the primality of each number in the array
in a loop. If a prime was determined in the array then
it would be added to the CPU specific array that
would store that number. At the end of computation
the total number of primes would be found from the
individual arrays then reported to the user.
The program ran in two sections: an
overhead, which performed all the tasks outside of
actual computation such as generating the numbers
and adding them to the correct array and then
performing post computational work by adding up a
prime count for every individual array; additionally,
there was the computational section where each
thread would be assigned an array and prime
numbers for that array would be computed by the
CPU.
To give us a perspective on the benefit of
multithreading such a workload we also built a very
similar program that ran without multithreading, and
instead of computing from eight arrays the program
used one central array for computation. It should be
noted that for this program the computation was
identical to that of the former program with
multithreading.
All tests were timed in two ways, one was
timed based off of CPU time and the other was timed
in real-time, with an external script reporting on the
start and end time of the script.
IV. FINDINGS
To begin with CPU times, during the first
test (running two virtual machines), the fastest
computation was by the virtual machine with a
virtual CPU configuration of eight sockets and one
core per socket*. The rest were fastest based on the
number of sockets they had versus processing cores
(four sockets with two cores, two sockets with four
cores and one socket with eight cores)*. In the
subsequent tests, running only one virtual machine at
a time and running all four concurrently, the same
results were achieved in terms of which virtual CPU
configuration ran the fastest**.
One interesting cravat was that when the
same tests were run, substituting the multithreaded
application with a non-multithreaded application,
CPU times dropped drastically, in the case the virtual
CPU configuration of eight sockets and one core,
CPU time was cut in half when running the nonmultithreaded application and the CPU time seemed
to favor virtual CPU configuration that were either
configured with eight sockets with one core per
socket or the opposite, one socket with eight cores
per socket. This was true for all tests except the last
one where all four virtual machines were running
side by side, in this case the two fastest CPU
configurations were eight sockets with one core per
socket and four sockets with two cores per socket.
When testing in real time (having an
external program keep track of run times), our test
results yielded information* that could not be used in
determining optimal configuration for CPU sockets
due to precision discrepancies. However this
information was not useless. In our testing we found
that there was a large disparity between the
multithreaded program and its non-multithreaded
counterpart. In our tests we found that with
multithreading, the program was able to process
prime numbers exponentially faster, in some cases
upwards of a minute quicker than the nonmultithreaded counterpart*. While the data was not
as precise as the CPU time, the time difference was
big enough that we could safely assume that the
discrepancy did not affect the outcome
V. CONCLUSION
From our findings we can conclude that
configuring virtual CPU’s with more virtual sockets
and fewer threads per sockets will results in better
CPU time results than arranging virtual CPU’s with
fewer sockets and more cores per socket. This finding
does come with some cravats however.
While we attempted to be very
comprehensive in our testing there are still some
unanswered questions and most testing that can be
done on this topic. One in particular is the correlation
between CPU time and actual time. We found that
the single threaded application performed better in
CPU time but suffered in actual time. With more
research we would like to make a correlation between
the two and explain the discrepancy.
In our testing there were several variables
that we think might have affected the testing process.
One aspect was our virtual machine configuration,
while the machines were identical in configuration,
we could have done more in configuring those virtual
machines to run better in the ESXi environment.
Another aspect was the physical environment, we
noticed that the room was heating up due to the
physical server putting out heat, although unlikely,
there is a possibility that performance was impacted
towards the end of our testing due to the heat
produced by the machine. In future testing these are
two areas we think we should address more carefully
in our testing.
Additionally, it would like to expand our
research in the future to include other processor
configurations, testing in non-virtualized
environments, using other processor architectures and
running testing again with different types of
workloads.
*Detailed time information can be found on page 3-6
**They ran in the same order as the first test in terms of speed, actual times varied.
Download