Malaysia Open Source Conference 2016 25 - 27 . MAY . UKM . BANGI . VERTICAL SCALING & PERFORMANCE TUNING Adzmely Mansor ABOUT ME CONSULTANT/FOUNDER OF NEXOPRIMA SDN. BHD. Adzmely Mansor 1. INTRO TO SCALING 2. THE BASICS ‣ CPU ‣ MEMORY ‣ NETWORK 3. EXPERIENCES SHARING AGENDA 1 INTRO TO SCALING INTRO TO SCALING SCALABILITY? WIKIPEDIA ▸ handle a growing amount of work ▸ in capable manner ▸ accommodate max growth PERFORMANCE TUNING? WIKIPEDIA ▸ tune a system to handle a higher load INTRO TO SCALING VERTICALLY? WIKIPEDIA ▸ add more resources in single node PERFORMANCE TUNING - VERTICALLY? ▸ ??????? INTRO TO SCALING fully optimised all available resources for maximum possible load Performance Tuning - Vertically INTRO TO SCALING HORIZONTALLY? ▸ when vertical / performance tuning already maximise ▸ why? ▸ INTRO TO SCALING How to determine that your servers actually required more: CPU / Processing Power RAM etc Vertical Scaling INTRO TO SCALING You Need to Know & Understand The Basics ▸ how various components works ▸ preferences in processes ▸ how IO interupts are handled ▸ how memory management works ▸ how network layer implemented ▸ meaning of the information given ▸ basics tools INTRO TO SCALING In general - Four subsystems that need to be monitored ▸ CPU ▸ Memory ▸ IO ▸ Network 2 THE BASICS CPU - four subsystem to be monitored: - run queue - context switch - cpu utilisation - load average the basics THE BASICS CPU ▸ CPU utilisation depend on accessed resources ▸ Linux Kernel has a scheduler, and scheduler give priorities to the different resources: ▸ scheduling two kind of resources: ▸ interrupts ▸ threads THE BASICS CPU - INTERRUPTS REQUEST (IRQ) HANDLING ▸ IRQ is a signal for an immediate attention sent from hardware to processor ▸ each device is assigned one or more IRQ numbers ▸ allowing to send unique interrupts ▸ a processor that receives an interrupts request will immediately pause execution of the current application thread in order to address the request THE BASICS CPU - SCHEDULER ▸ smallest unit of process execution called “thread” ▸ the system scheduler: ▸ determines which processor run a thread ▸ and for how long the threads run ▸ however the scheduler have priorities THE BASICS CPU - SCHEDULER PRIORITIES ▸ Scheduler Priorities: ▸ Hardware interrupts (highest priority) ▸ by hardware on the system to process data ▸ eg: ▸ by disk when completed IO transaction ▸ by “NIC” when packet has been received THE BASICS CPU - SCHEDULER PRIORITIES ▸ Scheduler Priorities: ▸ Soft interrupts (softirq) - related to maintenance of the kernel itself ▸ Real Time Thread - parallel processing / real time programming ▸ Kernel Threads - all kernel processing ▸ User Threads - a.k.a. “user space”. ▸ All applications run in the user space / lowest priority of all THE BASICS CPU - CORES ▸ Linux consider / view each core on n-way Hyper Threaded processor as an: ▸ INDEPENDENT Processor ▸ eg: Dual Core Processor = two independent processor THE BASICS CPU - CONTEXT SWITCHES ▸ each threads alloted a time quantum to spend on the processor ▸ passed alloted time / pre-empted by something higher priority, the thread: ▸ place back to queue ▸ higher priority / next in queue thread is placed on the processor: ▸ switched of thread = Context Switch THE BASICS CPU - THE RUN QUEUE ▸ each CPU maintain a RUN QUEUE of threads ▸ process threads are either: ▸ runnable (in run queue) ▸ sleep state (blocked and waiting for IO - not in run queue) ▸ CPU heavily utilised: ▸ longer run queue ▸ the longer it take for process threads to execute THE BASICS CPU - THE RUN QUEUE ▸ “Load” ▸ describe the state of the Run Queue ▸ “System Load” ▸ equal to amount of process threads currently executing “+” amoung of threads in the CPU Run Queue ▸ “top” command report “load averages” over the course of 1, 5 and 15 minutes THE BASICS CPU - UTILISATION ▸ defined as the percentage of usage of a CPU ▸ mostly CPU utilisation falls under following categories: ▸ User Time: percentage of time CPU spends executing threads in the User Space ▸ System Time: percentage of time CPU spends in executing kernel threads and interrupts ▸ Wait IO: the percentage of time a CPU spends idle because all process threads are blocked waiting for IO requests to complete ▸ Idle: the percentage of time a processor spends in completely idle state THE BASICS CPU - TIME SLICING ▸ a numeric value represent how long a task can run until it pre-empted ▸ scheduler policy dictate the default timeslice ▸ too long time slice = poor interactive performance ▸ time slice too short = significant amount of processor time been wasted because of overhead of switching process from one process to another (context switching) CPU - Performance Monitoring - a matter of interpreting performance of: - run queue - utilisation - context switching the basics THE BASICS CPU - PERFORMANCE MONITORING ▸ General Expectations: ▸ Run Queues: a run queue should have no more than 1 - 3 threads queued per CPU ▸ eg: a dual processor should not have more than 2 threads in queue (ideally) or 6 to the max THE BASICS CPU - PERFORMANCE MONITORING ▸ General Expectations: ▸ CPU Utilisation: if a CPU is fully utilised, ideally the following balance of utilisation should be achieved: ▸ 65% - 70%: User Time ▸ 30% - 35%: System Time ▸ 0% - 5%: Idle Time THE BASICS CPU - PERFORMANCE MONITORING ▸ General Expectations: ▸ Context Switches: ▸ high amount of context switches is acceptable if: ▸ CPU utilisation stays within previously mentioned balance /proc/$pid/status | grep ctxt /usr/bin/time -v ls | grep context vmstat THE BASICS CPU - PERFORMANCE MONITORING TOOLS ▸ must be low overhead tool ▸ still practical having it running under heavily loaded system ▸ able to monitor the health of the system at glance THE BASICS CPU - PERFORMANCE MONITORING top THE BASICS CPU - PERFORMANCE MONITORING vmstat THE BASICS CPU - PERFORMANCE MONITORING mpstat Memory - Physical Memory - Virtual Memory the basics THE BASICS VIRTUAL MEMORY ▸ Virtual Memory = SWAP space on disk + RAM/physical memory ▸ virtual memory divided into pages ▸ on x86 architecture VM pages = 4kb ▸ when writing from memory to disk, it write memory in “Pages” THE BASICS VIRTUAL MEMORY ▸ when application starts, it request Virtual Memory Size (VSZ) ▸ the kernel either grants or denies VSZ request ▸ as application use the requested memory, that memory mapped into physical memory (RSS) ▸ RSS (resident memory size) is amount of physical memory a task is using ▸ most case application use less RSS than what it requested (VSZ) THE BASICS VIRTUAL MEMORY THE BASICS VIRTUAL MEMORY THE BASICS VIRTUAL MEMORY - PAGES ▸ Virtual Memory divided into “Pages” ▸ on x86 architecture VM Pages = 4kb ▸ when writing from memory to disk, it write memory in “Pages” ▸ when “Pages” in memory are modified by running process, they become “dirty” ▸ when reach defined percentage (vm.dirty_ratio) it will be written to disk THE BASICS VIRTUAL MEMORY - DIRTY PAGES ▸ vm.dirty_ratio (kernel param) ▸ defined the maximum amount of memory for a process that can be filled with dirty pages before they get flushed to disk ▸ flushing will stop all IOs ▸ higher value = remain longer in memory ▸ better performance but high risk ▸ lower value = get flushed to disk more often ▸ slower performance but less risk ▸ default value = 20 THE BASICS VIRTUAL MEMORY - DIRTY PAGES ▸ vm.dirty_background_ratio (kernel param) ▸ defined the maximum amount of memory that can be filled with dirty pages before they get flushed to disk by the kernel flusher threads ▸ pdflush/flush/kdmflush ▸ default value = 10 ▸ eg: RAM 64G, only 6.4G data can be sitting in RAM before kernel flusher daemon kicks in. THE BASICS VIRTUAL MEMORY - DIRTY PAGES ▸ vm.dirty_bytes & vm.dirty_background_bytes ▸ same as dirty_ratios and dirty_background_ratios but in bytes ▸ setting ratios value/s, bytes param/s will become 0 and vice-verse THE BASICS VIRTUAL MEMORY - DIRTY PAGES ▸ vm.dirty_expire_sentisecs ▸ how long something can be cache before it needs to be written ▸ default value = 3000 sentisecs = 30 seconds ▸ if dirty pages older than default value, it will be written asynchronously to disk ▸ safe guard against data lost THE BASICS VIRTUAL MEMORY - DIRTY PAGES ▸ vm.dirty_writeback_sentisecs ▸ how often pdflush/flush/kdmflush process wake up and check if works need to be done ▸ default value = 500 sentisecs = 5 seconds THE BASICS VIRTUAL MEMORY - SWAPPINESS ▸ vm.swappiness ▸ how aggressive linux should be when swapping active pages in memory to disk ▸ default value = 60 ▸ when reaching 40% of memory used, it will start to consider for swapping ▸ lower value meaning closer to memory max size - discourage linux from swapping ▸ 60 value recommended for most desktop use ▸ 10 recommended to improve performance in general THE BASICS VIRTUAL MEMORY - CACHES ▸ /proc/sys/vm/drop_caches ▸ 0 - do nothing state ▸ 1 - free page cache ▸ 2 - free reclaimable slab objects ▸ slab allocation - used to retain allocated memory that contains a data object of certain types for reuse of another allocation of the same request (deallocation/allocation) ▸ 3 - free both types ▸ this is not a destructive operation, it will not destruct any dirty objects THE BASICS MEMORY - PERFORMANCE MONITORING vmstat Network - NIC Ring Buffer - Hard IRQ - Soft IRQ - Networking Tools the basics THE BASICS App N App 1 NETWORK - NIC RING BUFFER Packet to Forward IP STACK SKB SKB Queuing Disciplines SKB NIC Driver Queue a.k.a Ring Buffer THE BASICS NETWORK - NIC RING BUFFER ▸ Ring Buffer ▸ implemented as First in First Out (FIFO) ring buffer ▸ does not contain the packet data ▸ descriptor that point to the other data structures called Socket Kernel Buffers (SKB) ▸ input source for Ring Buffer is “IP Stack” ▸ dequeued by Hardware Driver and sent to NIC via data bus THE BASICS NETWORK - NIC RING BUFFER ▸ “ethtool” - command to show/ change values of the ring buffer ▸ -g : display ▸ -G : change ▸ with introduction of Byte Queue Limit (BQL) there is no longer any need to modify the driver queue size - a self tuning algorithm THE BASICS NETWORK - QUEUING DISCIPLINES (QDISC) ▸ QDISC ▸ Linux abstraction layer for traffic queues ▸ carry out complex queue management behaviours ▸ traffic classification ▸ prioritisation ▸ rate shaping ▸ configured through “traffic control - tc” command THE BASICS NETWORK - HARD IRQ ▸ Hardware IRQ ▸ also known as top half interrupts ▸ when NIC receives incoming data: ▸ it copies the data to RX ring buffers ▸ the NIC notifies the kernel of this incoming data by raising a hardware interrupt ▸ cat /proc/interrupts | grep em1 THE BASICS NETWORK - SOFT IRQ ▸ Software IRQ ▸ also known as bottom half interrupts ▸ purpose is to drain the NIC receive ring buffers ▸ can be seen in process monitoring tools such as ps and top ▸ ksoftirqd/cpu-num THE BASICS NETWORK - MONITORING TOOLS ▸ ss/netstat - dump/shows socket statistics ▸ dropwatch - monitors packet freed from memory by the kernel ▸ ip - for managing IP and monitoring routes, devices, policy routing and tunnels ▸ ethtool - for displaying and changing NIC settings THE BASICS NETWORK - SOME TUNING PARAMETERS ▸ SoftIRQ Misses ▸ net.core.netdev_budget ▸ if softirq don’t run long enough, the rate of incoming data could exceed the kernel’s capability to drain the buffer fast enough ▸ default value = 300 ▸ this will allow softirq to only process/drain out 300 messages from the NIC before been booted off the CPU THE BASICS NETWORK - SOME TUNING PARAMETERS ▸ SoftIRQ Misses ▸ when to increase “net.core.netdev_budget” value? ▸ monitor /proc/net/softnet_stat third column ▸ if third column keep on increasing - doubled the value THE BASICS NETWORK - SOME TUNING PARAMETERS ▸ increase max open files descriptors - default 1024 ▸ /etc/security/limits.conf ▸ * soft nofile 100000 ▸ * hard nofile 100000 ▸ ulimit -n : to view current file desc limit THE BASICS NETWORK - SOME TUNING PARAMETERS ▸ decrease the time socket stays in TIME_WAIT state ▸ by lowering tcp_fin_timeout ▸ default 60 ▸ lowering too low can run into socket close errors in network with lots of jitter ▸ set tcp_tw_reuse = 1 : this tell kernel it can reuse socket in the TIME_WAIT state THE BASICS NETWORK - TCP TIMEWAIT HACK by EA Faisal - NexoPrima ▸ patch kernel to introduce tcp_timewait_len kernel parameter ▸ new entry in /proc FS: ▸ /proc/sys/net/ipv4/tcp_timewait_len ▸ able to use sysctl for configuration: ▸ net.ipv4.tcp_timewait_len ▸ https://github.com/efaisal/linuxtcptw THE BASICS NETWORK - SOME TUNING PARAMETERS ▸ increase the port range for ephemeral outgoing ports ▸ cat /proc/sys/net/ipv4/ip_local_port_range ▸ default minimum port 32768 ▸ change to 10000 ▸ default maximum port 61000 ▸ change to 65000 THE BASICS NETWORK - TCP CONGESTION WINDOW ▸ throughput of a communication is limited by two windows: ▸ congestion windows - maximum segment size (MSS) allowed on that connection BrowserClient SYN SY CK N, A RWI N 65 k Server/Web SS *M IN 3 RW ACK GET /… ▸ maintained by the sender ▸ receive window - maximum amount of data before acknowledge sender ▸ maintained by the receiver ▸ MSS = MTU - (TCP Header) ~= 1460 ACK ke pac a t 3 da ACK s … ts CWIN = 3 THE BASICS NETWORK - TCP CONGESTION WINDOW ▸ TCP uses a mechanism called “slow start” to increase the congestion window after a connection is initialised and after timeout ▸ It will start a window of two times maximum segment size (MSS) depend on CWIN value ▸ starting from kernel 2.6.39 CWIN default value increased to 10 ▸ for every packet acknowledged, congestion window increase by 1 MSS ▸ when exceeds sshtresh threshold it will enter congestion avoidance mode ▸ sshtresh value automatically update at every end of “slow start” THE BASICS NETWORK - TCP CONGESTION WINDOW ▸ a server does not necessarily adhere to the client’s RWIN (receiver advertised window size) ▸ if CWIN size is a lot smaller/lower than receiver RWIN - the initial transfer might not be optimal BrowserClient SYN SY CK N, A RWI N 65 k Server/Web SS *M IN 3 RW ACK GET /… ACK CWIN = 5 kets c a ta p 5 da … THE BASICS NETWORK - TCP CONGESTION WINDOW ▸ net.ipv4.tcp_slow_start_after_idle ▸ tells either to start the default congestion window size for existing TCP connections that have been idle for too long ▸ on persistent connections you will likely end up in this state ▸ default value = 1 ▸ recommended value for performance - change to 0 THE BASICS NETWORK - TCP CONGESTION WINDOW ▸ changing CWIN and RWIN values in Linux ▸ ip route show ▸ ip route change default via 192.168.1.1 dev em1 proto static initcwnd 10 ▸ ip route change default via 192.168.1.1 dev em1 proto static initrwnd 10 THE BASICS NETWORK - SOME TUNING PARAMETERS ▸ tuned : adaptive system tuning daemon ▸ apply tuning settings which enable most desirable performance 3 EXPERIENCES SHARING EXPERIENCES SHARING LOAD STRESS TEST INFRA - ONLINE UNIV APPLICATION EXPERIENCES SHARING LOAD STRESS TEST - ONLINE UNIV APPLICATION untuned tuned INTERNAL EXPERIENCES SHARING DB CONNECTION LATENCY Q&A THANK YOU adzmely@nexoprima.com https://www.nexoprima.com ANNEX REFERENCES: ▸ https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/ Performance_Tuning_Guide ▸ https://wiki.mikejung.biz/Ubuntu_Performance_Tuning ▸ https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vmdirty_ratio/ ▸ https://www.kernel.org/doc/Documentation/sysctl/vm.txt ▸ https://access.redhat.com/sites/default/files/attachments/ 20150325_network_performance_tuning.pdf ▸ http://www.cdnplanet.com/blog/tune-tcp-initcwnd-for-optimum-performance/ ▸ https://www.wikipedia.org/ : for various topics related ▸ others that I might forgotten