Optimizing Virtual CPU Configurations for Multithreading in VMWare ESXi Arseniy Avery Evan Hutto ABSTRACT – We propose methods for optimizing virtual CPU layouts in a VMWare ESXi environment for running multithreading workloads. I. INTRODUCTION VMWare ESXi is a bare metal hypervisor that facilitates running multiple virtual servers on one physical machine. The VMWare environment can run the CPU in a configuration either matches the physical hardware specifications or in a virtual layout where any number of combinations are possible. On a server with one octocore processor (that is a server that has one physical CPU with eight processing cores), users can setup a virtual machine that virtualizes the CPU in configurations ranging from one socket with eight cores to eight sockets with a single core or any configuration in between. With this range of socket configurations it poses the question which configuration will be optimal for a workload. II. TESTING ENVIORMENT Out testing was done on a server with an Intel Xeon E3 1245 CPU with 8 processing cores and 32GB of memory. We used VMWare ESXi 6.0 as the bare metal hypervisor on the machine and there were a total of four CentOS virtual machines with identical configurations on the server with the exception of CPU configurations which are as follows: NAME VM1 VM2 VM3 VM4 SOCKETS 1 2 4 8 CORES/SOCKET 8 4 2 1 Our test program was written in C++ and computed the prime numbers from 1 to 80 million. The program used the Miller Rabin Primality Test which can compute the primality of a number without previous knowledge prime numbers that were computed beforehand. While this is not the fastest way to compute prime numbers, we chose this method because workloads could be more easily distributed between threads. Joshua Kim Because of the way virtual servers are often implemented (many virtual servers running simultaneously on one physical machine) we ran three types of tests. The first test ran two virtual machines at a time, running the same workload with the same start times. The second test shut off all but one virtual machine and ran the workload individually. Finally, the last test ran all four virtual machines at the same time with identical start times. In addition we tested two types of applications, one with multithreading enabled, the other with multithreading disabled. III. APPLICATION DESIGN Our application was designed with speed in mind and therefore a design was implemented with the goal of eliminating the need for semaphores. The program consisted of eight CPU specific array’s, each carrying a size of ten million and individual methods for every array that would call a function to determine the primality of each number in the array in a loop. If a prime was determined in the array then it would be added to the CPU specific array that would store that number. At the end of computation the total number of primes would be found from the individual arrays then reported to the user. The program ran in two sections: an overhead, which performed all the tasks outside of actual computation such as generating the numbers and adding them to the correct array and then performing post computational work by adding up a prime count for every individual array; additionally, there was the computational section where each thread would be assigned an array and prime numbers for that array would be computed by the CPU. To give us a perspective on the benefit of multithreading such a workload we also built a very similar program that ran without multithreading, and instead of computing from eight arrays the program used one central array for computation. It should be noted that for this program the computation was identical to that of the former program with multithreading. All tests were timed in two ways, one was timed based off of CPU time and the other was timed in real-time, with an external script reporting on the start and end time of the script. IV. FINDINGS To begin with CPU times, during the first test (running two virtual machines), the fastest computation was by the virtual machine with a virtual CPU configuration of eight sockets and one core per socket*. The rest were fastest based on the number of sockets they had versus processing cores (four sockets with two cores, two sockets with four cores and one socket with eight cores)*. In the subsequent tests, running only one virtual machine at a time and running all four concurrently, the same results were achieved in terms of which virtual CPU configuration ran the fastest**. One interesting cravat was that when the same tests were run, substituting the multithreaded application with a non-multithreaded application, CPU times dropped drastically, in the case the virtual CPU configuration of eight sockets and one core, CPU time was cut in half when running the nonmultithreaded application and the CPU time seemed to favor virtual CPU configuration that were either configured with eight sockets with one core per socket or the opposite, one socket with eight cores per socket. This was true for all tests except the last one where all four virtual machines were running side by side, in this case the two fastest CPU configurations were eight sockets with one core per socket and four sockets with two cores per socket. When testing in real time (having an external program keep track of run times), our test results yielded information* that could not be used in determining optimal configuration for CPU sockets due to precision discrepancies. However this information was not useless. In our testing we found that there was a large disparity between the multithreaded program and its non-multithreaded counterpart. In our tests we found that with multithreading, the program was able to process prime numbers exponentially faster, in some cases upwards of a minute quicker than the nonmultithreaded counterpart*. While the data was not as precise as the CPU time, the time difference was big enough that we could safely assume that the discrepancy did not affect the outcome V. CONCLUSION From our findings we can conclude that configuring virtual CPU’s with more virtual sockets and fewer threads per sockets will results in better CPU time results than arranging virtual CPU’s with fewer sockets and more cores per socket. This finding does come with some cravats however. While we attempted to be very comprehensive in our testing there are still some unanswered questions and most testing that can be done on this topic. One in particular is the correlation between CPU time and actual time. We found that the single threaded application performed better in CPU time but suffered in actual time. With more research we would like to make a correlation between the two and explain the discrepancy. In our testing there were several variables that we think might have affected the testing process. One aspect was our virtual machine configuration, while the machines were identical in configuration, we could have done more in configuring those virtual machines to run better in the ESXi environment. Another aspect was the physical environment, we noticed that the room was heating up due to the physical server putting out heat, although unlikely, there is a possibility that performance was impacted towards the end of our testing due to the heat produced by the machine. In future testing these are two areas we think we should address more carefully in our testing. Additionally, it would like to expand our research in the future to include other processor configurations, testing in non-virtualized environments, using other processor architectures and running testing again with different types of workloads. *Detailed time information can be found on page 3-6 **They ran in the same order as the first test in terms of speed, actual times varied.