R. Hashemian 1 , D. Krishnamurthy 1 , M. Arlitt 2 , N. Carlsson 3 1. University of Calgary 2. HP Labs 3. Linköping University by: Raoufehsadat Hashemian The 4th ACM/SPEC International Conference on Performance Engineering ICPE 2013 OUTLINE • Introduction • Scalability Evaluation • Scalability Enhancement Approach • Validation • Conclusion 2 Improving the Scalability of a Multi-core Web Server ICPE13 INTRODUCTION PROBLEM DESCRIPTION • Enterprise applications Response Time • Performance: Improving QoS • e.g. Lower response times • Cost: Less money spent on hardware • e.g. Improving effective utilization 40 30 20 10 0 0 50 CPU Utilization (%) 100 • Goal: Higher utilization and acceptable response time • How to achieve this “Goal” for Web servers running on Multicore hardware? 3 Improving the Scalability of a Multi-core Web Server ICPE13 INTRODUCTION BACKGROUND • Web servers before multi-core • Mature topic, wide-ranging discussions • Multi-core architecture • Most research on batch (non-interactive) workload • Web servers running on Multi-core • BUS problem in UMA system (Veal et al.`07) • Multiple Web server instances: 1 instance per processor (Scogland et al.`09, Boyd et.al,10 Gaud et. al,11) 5 Improving the Scalability of a Multi-core Web Server ICPE13 SCALABILITY EVALUATION SCALABILITY MEASUREMENT • Measure Web server scalability for two workloads • Evaluate the effectiveness of multiple Web server approach in the server’s scalability • Scalability • Maximum Achievable Throughput (MAT) 4 Improving the Scalability of a Multi-core Web Server ICPE13 SCALABILITY EVALUATION EXPERIMENTAL SETUP • 2 x 4 core Intel Xeon E5620 processors NUMA Architecture Processor 0 Microarch. Nehalem Frequency Processor 1 2.4 GHz C 0 C 2 C 4 C 6 C 0 C 2 C 4 C 6 L1 Cache 32K IC - 32K DC L1 L1 L1 L1 L1 L1 L1 L1 L2 Cache 256K L2 L2 L2 L2 L2 L2 L2 L2 L3 Cache 12M (Inclusive) Inter-conn. QPI -5.86 GT/s Memory 16GB - DDR3-1333 • OS: Linux, kernel 3, Ubuntu L3 Memory Bank 0 L3 Memory Bank 1 • Webserver: Lighttpd • Application Server: php (FastCGI module) 6 Improving the Scalability of a Multi-core Web Server ICPE13 SCALABILITY EVALUATION WORKLOADS • TCP/IP Intensive workload • High TCP connection rate • Processing: low user level & high kernel level • 1 KB static file, up to 155,000 requests/second • SPECweb Support workload • Both static requests and php requests • Wider range of request types • Processing: high user level & moderate kernel level 7 Improving the Scalability of a Multi-core Web Server ICPE13 SCALABILITY EVALUATION CONFIGURATION TUNING • Change default lighttpd recommendation (1 Lighttpd worker process per core) • Disable default Linux scheduling (use affinity) • Distribute interrupt handling load • Improved MAT up to 69% • Balanced utilization levels for the eight cores • Fully utilized the server 8 Improving the Scalability of a Multi-core Web Server ICPE13 SCALABILITY EVALUATION RESULTS • TCP/IP Intensive workload Maximum Achievable Throughput Scalability • Sub-linear 146,000 req/sec • SPECweb Support workload Maximum Achievable Throughput Scalability • Almost linear 23,000 req/sec Number of Cores 9 Improving the Scalability of a Multi-core Web Server ICPE13 SCALABILITY EVALUATION RESPONSE DISTRIBUTION ANALYSIS 1 Response time vs. Core Count 0.8 • “Low response time” requests P [ X <= x ] • Static requests • Performance degrades • “High response time” requests 1 • Dynamic requests • Performance improves 0.98 0.6 0.96 0.4 0.94 0.92 0.2 0.9 0 1 Core 0 -1 10 Knowing this behavior, how can we improve the scalability? 10 2 10 0 10 2 Core 1 10 10 4 Core 8 Core 2 3 10 10 x = Response time (msec) CDF of Response times 80% CPU Utilization SPECweb Support Workload Improving the Scalability of a Multi-core Web Server ICPE13 SCALABILITY ENHANCEMENT MULTIPLE WEBSITE REPLICAS • Approach: Use 1 Web server instance per processor • Goal: Reduce inter-processor data migration Single Replica Process NIC1 Queue NIC 2 Queue Replica 1 Process Replica 2 Process NIC 1 Queue NIC 2 Queue Processor 0 Processor 1 Processor 0 Processor 1 NIC 1 NIC 2 NIC 1 NIC 2 Original Configuration with one replica 11 Alternative Configuration with two replicas Improving the Scalability of a Multi-core Web Server ICPE13 SCALABILITY ENHANCEMENT Response time (ms) Response time (ms) EVALUATING NEW CONFIGURATION Request rate (req/sec) Request rate (req/sec) TCP/IP Intensive Workload SPECweb Support Workload Scalability Improvement Scalability Degradation MAT increment: 12.3% MAT decrement: 10% 12 Improving the Scalability of a Multi-core Web Server ICPE13 SCALABILITY ENHANCEMENT EVALUATING NEW CONFIGURATION • The response time inflation for Dynamic requests dominates the improvement achieved for Static requests • Mean and 99.9th percentile response times increase with 2-replicas • Hypothesis: CDF of Response times 80% CPU Utilization 22,000 req/sec SPECweb Support workload 1 P [X <= x] • Cache contention with 2replicas due to the larger working set size of dynamic requests 0.95 0.9 0 10 2 10 4 10 Response Time (ms) 13 Improving the Scalability of a Multi-core Web Server ICPE13 VALIDATION INTER-CONNECT TRAFFIC • Inter-connect traffic decreased • No significant decrement • Improved performance for Static significantly • Improved performance Inter-connect Traffic (Bytes/sec) Inter-connect Traffic (Bytes/sec) requests Request Rate (req/sec) TCP/IP Intensive Workload 14 Request Rate (req/sec) SPECweb Support Workload Improving the Scalability of a Multi-core Web Server ICPE13 VALIDATION LAST LEVEL CACHE • Last Level cache (LLC) HIT ratio degrades with 2-replica configuration L3 Cache HIT Ratio Confirms the cache contention hypothesis Request Rate (req/sec) SPECweb Support Workload 15 Improving the Scalability of a Multi-core Web Server ICPE13 CONCLUSIONS • Multi-core Web server: scalable after tuning • 80% utilization with acceptable response time • Multiple Website Replicas • The effect on the scalability is workload dependent • Dynamic requests trigger LLC contention • Contention may be architecture and application dependent • Future plan: • Design and develop an automatic, workload adaptive technique which decides about best configuration 16 Improving the Scalability of a Multi-core Web Server ICPE13 Raoufeh Hashemian University of Calgary, Canada rhashem@ucalgary.ca This work is financially supported by: 17 Improving the Scalability of a Multi-core Web Server ICPE13 REFERENCES • Cherkasova et al.`00:Characterizing Temporal Locality and its Impact on Web Server Performance, International Conference on Computer Communications and Networks’00, Cherkasova; Ciardo; HP Labs • Elnozahy et al.`03: Energy Conservation Policies for Web Servers, USITS '03, Elnozahy; Kistler; Ramakrishnan; IBM • Majo et al.`12: Matching Memory Access Patterns and Data Placement for NUMA Systems, GC’12, Majo; Gross; ETH • Blagodurov et al.`11: A case for NUMA-aware contention management on multicore systems, USENIX ATC'11, Blagodurov; Zhuravlev; Dashti; Fedorova; SFU • Veal et al.`07: Performance scalability of a multi-core web server. ACM/IEEE ANCS’07,Veal; Foong; Intel • Scogland et al.`09: Asymmetric interactions in symmetric multi-core systems: Analysis, enhancements and evaluation. ACM/IEEE SC’08, Scogland; Balaji; Feng; Narayanaswamy, • Boyd et.al`10: An analysis of linux scalability to many cores, USENIX OSDI’10, Boyd-Wickizer; Clements; Mao; Pesterev; Kaashoek; Morris; Zeldovich, MIT • Gaud et. al`11: Application-level optimizations on numa multicore architectures: the apache case study, RRLIG-011, Gaud; Lachaize; Lepers; Muller; Quema. -2 Improving the Scalability of a Multi-core Web Server ICPE13 SCALABILITY EVALUATION CONFIGURATION TUNING • Network interrupt handling • 4 RSS queue per NIC port • Each queue bind to one core Response time (msec) 2.0 Before Distributing Int. Load After Distributing Int. Load 1.5 1.0 0.5 0.0 0 50,000 100,000 150,000 200,000 Rate (req/sec) -3 Improving the Scalability of a Multi-core Web Server ICPE13 SCALABILITY EVALUATION CONFIGURATION TUNING • OS scheduling Response time (msec) • Binding each lighttpd process to 1 core 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 No Affinity With affinity 0 -4 50000 100000 150000 Rate (req/sec) 200000 Improving the Scalability of a Multi-core Web Server ICPE13 SCALABILITY EVALUATION WEB TIER VS. APPLICATION TIER • Static: Requests with lower response time • • Processed only in Web tier (lighttpd) Dynamic: Requests with higher response time Processed only in Web and application tiers (lighttpd and php) Response time (ms) • -5 File size (Byte) Improving the Scalability of a Multi-core Web Server ICPE13 SCALABILITY EVALUATION EXPERIMENTAL SETUP -6 Improving the Scalability of a Multi-core Web Server ICPE13