HOW ADDING PROCESSORS AND ENABLING HYPER-THREADING AFFECT USER CAPACITY OF METAFRAME XP SERVERS By John D’Agati and Gagan Singh Citrix Systems Inc. Notice The information in this publication is subject to change without notice. THIS PUBLICATION IS PROVIDED “AS IS” WITHOUT WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT. CITRIX SYSTEMS, INC. (“CITRIX”), SHALL NOT BE LIABLE FOR TECHNICAL OR EDITORIAL ERRORS OR OMISSIONS CONTAINED HEREIN, NOR FOR DIRECT, INCIDENTAL, CONSEQUENTIAL, OR ANY OTHER DAMAGES RESULTING FROM THE FURNISHING, PERFORMANCE, OR USE OF THIS PUBLICATION, EVEN IF CITRIX HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES IN ADVANCE. This publication contains information protected by copyright. Except for internal distribution, no part of this publication may be photocopied or reproduced in any form without prior written consent from Citrix. The exclusive warranty for any Citrix products discussed in this publication, if any, is stated in the product documentation accompanying such products. Citrix does not warrant products other than its own. Product names mentioned herein may be trademarks and/or registered trademarks of their respective companies. © 2003 Citrix Systems, Inc. All rights reserved. Printed in the U.S.A. Version History January 2003 John D’Agati Revisions March 2003 John D’Agati Revisions ii Overview ................................................................................................................................................ 1 Citrix ICAMark ....................................................................................................................................... 1 Number of CPUs’ Effect on User Capacity ........................................................................................ 2 Performance Monitor Statistics for Single processor system: ........................................................ 5 Performance Monitor Statistics for Dual processor system: .......................................................... 6 Performance Monitor Statistics for Quad processor system: ......................................................... 7 Hyper-Threading’s Effect on User Capacity ...................................................................................... 9 Dell 2650 Results .................................................................................................................................. 9 Performance Monitor Statistics for Dell 2650 Hyper-Threading Enabled: ..................................11 Performance Monitor Statistics for Dell 2650 Hyper-Threading Disabled:..................................12 Dell 6650 Results .................................................................................................................................14 Performance Monitor Statistics for Dell 6650 Hyper-Threading Enabled: ..................................16 Performance Monitor Statistics for Dell 6650 Hyper-Threading Disabled:..................................17 Summary ..............................................................................................................................................19 iii Overview The number of users that a server can support depends on several factors including: The MetaFrame server’s hardware specifications The applications that are being run (because of the applications’ CPU and memory requirements) The amount of user input being processed by the applications The maximum desired resource usage on the server, for example, 90% CPU usage or 80% memory usage This section discusses the increase in user capacity when more CPUs are added, as well as the effect of Hyper-Threading in the processor. First, the Citrix benchmarking test for user capacity, known as ICAMark, is described. Citrix ICAMark Citrix ICAMark is an internal tool which is based on the Citrix Server Test Kit (CSTK) and used by Citrix Engineering for benchmarking purposes to quantify the optimal number of simulated client sessions that can be connected to a MetaFrame server with acceptable performance. Extending the number of concurrent simulated users beyond the optimal results will cause a decreased performance and may impact end user experience. The test simulates users constantly typing and performing actions in Microsoft Excel 97, Microsoft Access 97, and Microsoft PowerPoint 97. Other applications can utilize more or less memory and CPU than Microsoft Office 97 and therefore could produce different results. Note, that the simulated users in this test are constantly typing into these applications at 40 words per minute and may be considered more “rigorous” than normal users. In this test, a step size of “number of users” is defined as 5. During the course of the test, after the first 5 users are logged in, ICAMark launches simulated user scripts on all 5 sessions. Each script opens Microsoft Excel and simulates the creation of a spreadsheet, including calculations and charts. Once the Excel phase is complete, Excel is closed and Microsoft Access is opened. The script then simulates the creation of an Access database, including a table, query, and form, with data manipulation. Once the Access phase is complete, a Microsoft PowerPoint presentation is created of 6 slides, including spell checking, font changes, and slide copies and deletions. When a script is finished it remains idle until the scripts on all sessions are complete. The next iteration is then launched, adding 5 more sessions to the test and the process begins again. Based on how long the scripts take to complete, an ICAMark score is calculated. For this test, a score of 80 has been determined as the optimal load for a server. The ICAMark score is calculated by comparing a calibration value of the script time with the time gathered during the iteration. The calibration value was determined by running the scripts on a calibration machine. This machine is considered to perform at the level we expect from a stand alone workstation. Each script on the calibration machine was run locally, and the data recorded. An ICAMark score of 80 means that the server has enough additional CPU and memory resources to handle spikes in performance. When the test iteration 1 score drops below 80, additional users added to the server consume more resources, producing lower test scores and slower performance. Number of CPUs’ Effect on User Capacity The benchmark test was performed with the following: Server: Dell PowerEdge 6650 Quad Processor - 1.6GHz Xeon with 256 KB L2 and 1 MB L3 Cache 400MHz Front Side Bus Hyper threading is enabled 35 GB HDD with Dell PERC 3/DC Raid Controller 3.5 GB RAM 4 GB Page File Citrix MetaFrame XP Feature Release 2/Service Pack 2 Microsoft Windows 2000 Advanced Server with Service Pack 2 Microsoft Office 97 Clients: Dual Pentium P3 667 w/256 kb Cache 256 MB RAM 9 GB HDD with Adaptec SCSI Controller Citrix ICA Program Neighborhood Client version 6.30.1050 Microsoft Windows 2000 Service Pack 2 Tests were performed by keeping the hardware static and disabling processors on the server. Results were collected on the following configurations: Dell 6650 with 1 processor enabled Dell 6650 with 2 processors enabled Dell 6650 with 4 processors enabled 2 The following results were collected: # of CPUs # of Simulated Users % Performance Increase # of Users per Processor 1 70 ± 1 N. A. 70 ± 1 2 126 ± 1 80% 63 ± 1 4 160 ± 1 27% 40 ± 1 User Capacity Benchmark 4 Processors 2 Processors 1 Processor 120 ICAMark Score 100 80 60 40 20 0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 # of Simulated Users The results conclude that the performance of the Dell PowerEdge 6650 with 4 processors enabled and 160 concurrent simulated users, is equivalent to the performance of 2 processors enabled with 126 concurrent simulated users, which is equivalent to the performance of 1 processor enabled with 70 concurrent simulated users. In other words, the user experiences on each system with the number of simulated users shown above are equivalent. As CPUs are added to the server, the increase in performance is not linear thus allowing fewer users per processor as they are added to the system. At optimal load, the following counters were noted: 3 # of Processors 1 2 4 % Processor Utilization 65% 65% 60% Average PQL 70 120 55 2,332,816 1,531,280 1,174,824 1 3 1 Available Memory in KBytes Current DQL When the benchmark test ended, the following PTE information was noted: # of Processors # of Users at End Available PTEs 1 2 4 100 150 200 87,554 55,544 23,941 On the single and dual processor systems, the bottleneck is the processor. This is evident by the sustained 65% processor utilization and the high processor queue lengths. On the quad processor system, the bottleneck is believed to be the system bus. By looking at the graphs for the quad system, the processor utilization is 60% and the processor queue length is 55, lower than the single and dual systems at the failure point. As more processors are added to the system the bandwidth of the system bus becomes contentious resulting in the non-linear scalability shown here. A bottleneck in the system bus would cause contention between the processors and threads in memory causing slower script execution time. 4 Performance Monitor Statistics for Single processor system: Single Pr oce s s or Sys te m 100 90 Processor Queue Length 80 70 60 50 40 30 20 10 10 0 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0 # of Sim ulate d Us e r s Single Pr oce s s or Sys te m 100 90 Processor Utilization 80 70 60 50 40 30 20 10 10 0 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0 # of Sim ulate d Us e r s Single Pr oce s s or Sys te m 3500000 2500000 2000000 1500000 1000000 500000 10 0 90 95 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 0 10 Memory Available KBytes 3000000 # of Sim ulate d Us e r s 5 Single Pr oce s s or Sys te m 12 11 Current Disk Queue Length 10 9 8 7 6 5 4 3 2 1 95 10 0 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0 # of Sim ulate d Us e r s Performance Monitor Statistics for Dual processor system: 75 80 85 90 9 105 100 115 110 125 120 135 0 13 5 14 0 14 155 150 5 65 70 55 60 40 45 50 5 10 15 20 25 30 35 Processor Queue Length Dual Processor System 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0 # of Sim ulate d Us e rs Dual Processor System 100 90 70 60 50 40 30 20 10 70 75 80 85 90 9 10 5 100 115 110 125 120 135 0 13 5 14 140 155 150 5 60 65 0 5 10 15 20 25 30 35 40 45 50 55 % Processor Time 80 # of Sim ulate d Us e rs 6 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 10 100 115 110 125 120 135 0 13 5 14 140 5 15 150 5 16 160 175 170 5 18 0 18 195 190 205 0 Processor Queue Length 75 80 85 90 9 105 100 115 110 125 120 135 0 13 5 14 0 14 155 150 5 65 70 55 60 40 45 50 5 10 15 20 25 30 35 Current Disk Queue Length 70 75 80 85 90 9 105 100 115 110 125 120 135 130 5 14 140 155 150 5 60 65 5 10 15 20 25 30 35 40 45 50 55 Memory Available KBytes 4000000 Dual Processor System 3500000 3000000 2500000 2000000 1500000 1000000 500000 0 # of Sim ulate d Us e rs Dual Processor System 18 16 14 12 10 8 6 4 2 0 # of Sim ulate d Us e rs Performance Monitor Statistics for Quad processor system: 100 Quad Processor System 90 80 70 60 50 40 30 20 10 0 # of SIm ulated Users 7 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 9 105 100 115 110 125 120 135 0 13 5 14 140 5 15 150 5 16 160 175 170 5 18 0 18 195 190 205 0 Current Disk Queue Length 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 9 105 100 115 110 125 120 135 130 145 140 155 150 5 16 160 175 170 5 18 0 18 195 190 205 0 Memory Available KBytes 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 9 10 5 100 115 110 125 120 135 0 13 5 14 0 14 155 150 5 16 160 175 170 5 18 0 18 195 190 205 0 % Processor Time 100 Quad Processor System 90 80 70 60 50 40 30 20 10 0 # of SIm ulated Users Quad Processor System 4000000 3500000 3000000 2500000 2000000 1500000 1000000 500000 0 # of Sim ulated Users 25 Quad Processor System 20 15 10 5 0 # of Sim ulated Users 8 Hyper-Threading’s Effect on User Capacity Hyper-Threading is technology developed by Intel that enables a single physical processor to appear as two logical processors. This technology was introduced in the Pentium IV line of processors. Hyper-Threading allows multi-threaded programs to take advantage of extra execution units on the processor resulting in as much as a 30% performance increase to some applications. Note that the benefit of Hyper-Threading is only seen with applications that are multithreaded. MetaFrame and the applications it hosts can also benefit from Hyper-Threading. Note that the increase in performance from Hyper-Threading is highly dependent on the type of application that is running on the server. The benchmark test was performed on a Dell 2650 dual processor and a Dell 6650 quad processor with Hyper-Threading capable Pentium IV processors. Dell 2650 Results This benchmark was performed on the following: Server Hardware Configurations Dell PowerEdge 2650 Dual Processor - 2.2GHz Xeon with 512KB L2 Cache 400MHz Front Side Bus 16 GB HDD with Dell PERC 3/Di Raid Controller 4 GB RAM 4 GB Page File Citrix MetaFrame XP Feature Release 2/Service Pack 2 Microsoft Windows 2000 Advanced Server with Service Pack 2 Microsoft Office 97 Clients: Dual Pentium P3 667 w/256 kb Cache 256 MB RAM 9 GB HDD with Adaptec SCSI Controller Citrix ICA Program Neighborhood Client version 6.30.1050 Microsoft Windows 2000 Service Pack 2 9 Hyper-threading # of Simulated Users % Performance Increase Off 116 ± 1 N. A. On 131 ± 1 11.5% User Capacity Benchmark 120 Hyper-threading Enabled Hyper-threading Disabled ICAMark Score 100 80 60 40 20 0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 # of Simulated Users The results conclude that the performance of the Dell PowerEdge 2650 with Hyper-Threading enabled servicing 131 concurrent simulated users is equivalent to the performance of the Dell PowerEdge 2650 with Hyper-Threading disabled servicing 116 concurrent users. The lower Processor Queue Length on the system with Hyper-Threading enabled allows for faster execution of the scripts and accounts for the 11.5% increase in capacity. At optimal load, the following counters were noted: Hyper-Threading % Processor Utilization Average PQL Available Memory in KBytes Current DQL On Off 70% 65% 70 115 1,876,108 1,680,124 3 1 When the benchmark test ended, the following PTE information was noted: 10 160 Hyper-Threading On Off # of Users at End 150 150 52,201 52,608 Available PTEs Performance Monitor Statistics for Dell 2650 Hyper-Threading Enabled: Dell 2650 Hyper-Threading On 100 Processor Queue Length 90 80 70 60 50 40 30 20 10 13 0 13 5 14 140 155 0 55 60 65 70 75 80 85 90 9 10 5 100 115 0 11 5 12 120 5 45 50 35 40 25 30 10 15 20 5 0 # of Sim ulate d Us e rs Dell 2650 Hyper-Threading On 100 90 70 60 50 40 30 20 10 55 60 65 70 75 80 85 90 9 10 5 100 115 0 11 5 12 120 5 13 0 13 5 14 140 155 0 45 50 35 40 25 30 10 15 20 0 5 % Processor Time 80 # of Sim ulate d Us e rs 11 Dell 2650 Hyper-Threading On 4000000 Available Memory KBytes 3500000 3000000 2500000 2000000 1500000 1000000 500000 45 50 55 60 65 70 75 80 85 90 9 105 100 115 0 11 5 12 120 5 13 130 5 14 140 155 0 30 35 40 10 15 20 25 5 0 # of Sim ulate d Us e rs Dell 2650 Hyper-Threading On Current Disk Queue Length 25 20 15 10 5 13 0 13 5 14 140 5 65 70 75 80 85 90 9 105 100 115 0 11 5 12 120 5 55 60 45 50 35 40 25 30 5 10 15 20 0 # of Sim ulate d Us e rs Performance Monitor Statistics for Dell 2650 Hyper-Threading Disabled: Dell 2650 Hyper-Threading Off 120 80 60 40 20 70 65 80 85 90 95 10 100 115 110 5 12 0 12 5 13 130 145 0 14 155 0 60 65 50 55 40 45 30 35 20 25 10 15 0 5 Processor Queue Length 100 # of Sim ulate d Us e rs 12 70 65 80 85 90 95 10 100 115 110 5 12 0 12 5 13 130 5 14 0 14 155 0 60 65 50 55 40 45 30 35 20 25 5 10 15 Current Disk Queue Length 50 55 60 65 70 65 80 85 90 95 10 100 115 110 125 0 12 135 130 145 0 14 155 0 40 45 30 35 5 10 15 20 25 Memory Available KBytes 70 65 80 85 90 95 10 100 115 110 5 12 0 12 5 13 130 5 14 0 14 155 0 60 65 50 55 40 45 30 35 20 25 5 10 15 % Processor Time Dell 2650 Hyper-Threading Off 100 90 80 70 60 50 40 30 20 10 0 # of Sim ulate d Us e rs 4000000 Dell 2650 Hyper-Threading Off 3500000 3000000 2500000 2000000 1500000 1000000 500000 0 # of Sim ulate d Us e rs 25 Dell 2650 Hyper-Threading Off 20 15 10 5 0 # of Sim ulate d Us e rs 13 Dell 6650 Results This benchmark was performed on the following: Server Hardware Configurations Dell PowerEdge 6650 Quad Processor - 1.6GHz Xeon with 256 KB L2 and 1 MB L3 Cache 400 MHZ Front Side Bus 35 GB HDD with Dell PERC 3/DC Raid Controller 3.5 GB RAM 4 GB Page File Citrix MetaFrame XP Feature Release 2/Service Pack 2 Microsoft Windows 2000 Advanced Server with Service Pack 2 Microsoft Office 97 Clients: Dual Pentium P3 667 w/256 kb Cache 256 MB RAM 9 GB HDD with Adaptec SCSI Controller Citrix ICA Program Neighborhood Client version 6.30.1050 Microsoft Windows 2000 Service Pack 2 Hyper-threading # of Simulated Users % Performance Increase Off 158 ± 1 N. A. On 160 ± 1 1% 14 User Capacity Benchmark - Dell 6650 120 Hyper-threading Enabled Hyper-threading Disabled ICAMark Score 100 80 60 40 20 0 0 20 40 60 80 100 120 140 160 180 200 # of Simulated Users The results conclude that the performance of the Dell PowerEdge 6650 with Hyper-Threading enabled servicing 160 concurrent users is equivalent to the performance of the Dell PowerEdge 6650 with Hyper-Threading disabled servicing 158 concurrent users At optimal load, the following counters were noted: Hyper-Threading % Processor Utilization Average PQL Available Memory in KBytes Current DQL On Off 60% 65% 55 115 1,174,824 1,122,072 1 3 When the benchmark test ended, the following PTE information was noted: Hyper-Threading On Off # of Users at End 200 180 23,941 37,168 Available PTEs With Hyper-Threading enabled on a quad processor system, there are eight logical CPUs. In this instance, the benefit of Hyper-Threading is no longer 15 evident. As the processing capacity of the server increases, less of a bottleneck is placed on the processor’s execution resources and more on the system bus speed, thus resulting in only a 1% increase in performance with Hyper-Threading enabled. Performance Monitor Statistics for Dell 6650 Hyper-Threading Enabled: Dell 6650 Hyper-Threading On 100 Processor Queue Length 90 80 70 60 50 40 30 20 10 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 10 100 115 110 125 120 135 0 13 5 14 140 5 15 150 5 16 160 175 170 5 18 0 18 195 190 205 0 0 # of SIm ulated Users Dell 6650 Hyper-Threading On 100 90 70 60 50 40 30 20 10 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 9 105 100 115 110 125 120 135 0 13 5 14 0 14 155 150 5 16 160 175 170 5 18 0 18 195 190 205 0 % Processor Time 80 # of SIm ulated Users 16 50 55 60 65 70 75 80 85 90 9 105 0 10 115 110 125 120 5 13 130 145 0 14 5 15 150 5 16 0 16 175 170 185 0 35 40 45 15 20 25 30 5 10 Processor Queue Length 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 9 105 100 115 110 125 120 135 0 13 5 14 140 5 15 150 5 16 160 175 170 5 18 0 18 195 190 205 0 Current Disk Queue Length 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 9 105 100 115 110 125 120 135 130 145 140 155 150 5 16 160 175 170 5 18 0 18 195 190 205 0 Memory Available KBytes 4000000 Dell 6650 Hyper-Threading On 3500000 3000000 2500000 2000000 1500000 1000000 500000 0 # of Sim ulated Users 25 Dell 6650 Hyper-Threading On 20 15 10 5 0 # of Sim ulated Users Performance Monitor Statistics for Dell 6650 Hyper-Threading Disabled: Dell 6650 Hyper-Threading Off 140 120 100 80 60 40 20 0 # of Sim ulate d Us e rs 17 65 70 75 80 85 90 9 105 0 10 115 110 125 120 5 13 130 145 0 14 5 15 150 5 16 160 175 170 185 0 50 55 60 15 20 25 30 35 40 45 5 10 Current Disk Queue Length 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 9 105 100 115 110 125 120 135 130 145 0 14 5 15 150 5 16 160 175 170 185 0 Memory Available KBytes # of Sim ulate d Us e rs 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 9 10 5 100 115 110 125 120 5 13 130 145 0 14 5 15 150 5 16 0 16 175 170 185 0 5 10 % Processor Time 100 Dell 6650 Hyper-Threading Off 90 80 70 60 50 40 30 20 10 0 4000000 Dell 6650 Hyper-Threading Off 3500000 3000000 2500000 2000000 1500000 1000000 500000 0 # of Sim ulate d Us e rs 25 Dell 6650 Hyper-Threading Off 20 15 10 5 0 # of Sim ulate d Us e rs 18 Summary In conclusion, user capacity does not scale linearly when adding processors to the system. There is an 80% increase in performance when moving from a single to a dual processor system. There is a 27% increase in performance when comparing a dual to a quad processor system. The increase in performance from Hyper-Threading is highly dependent on the type of applications running on the server. The true benefit of Hyper-Threading is seen on the dual processor system where the processor is the bottleneck, thus allowing an additional 11.5% increase in capacity when enabled. Note when sizing MetaFrame XP servers, the number of actual users per server varies based on the type of applications deployed. 19