Uploaded by Adzmely Mansor

VerticalPerfTuning-MOSC2016

advertisement
Malaysia Open Source Conference 2016
25 - 27 . MAY . UKM . BANGI .
VERTICAL SCALING &
PERFORMANCE TUNING
Adzmely Mansor
ABOUT ME
CONSULTANT/FOUNDER OF
NEXOPRIMA SDN. BHD.
Adzmely Mansor
1. INTRO TO SCALING
2. THE BASICS
‣ CPU
‣ MEMORY
‣ NETWORK
3. EXPERIENCES
SHARING
AGENDA
1
INTRO TO
SCALING
INTRO TO SCALING
SCALABILITY?
WIKIPEDIA
▸ handle a growing amount of work
▸ in capable manner
▸ accommodate max growth
PERFORMANCE TUNING?
WIKIPEDIA
▸ tune a system to handle a higher
load
INTRO TO SCALING
VERTICALLY?
WIKIPEDIA
▸ add more resources in single node
PERFORMANCE TUNING - VERTICALLY?
▸ ???????
INTRO TO SCALING
fully optimised all available
resources for maximum
possible load
Performance Tuning - Vertically
INTRO TO SCALING
HORIZONTALLY?
▸ when vertical / performance tuning
already maximise
▸ why?
▸
INTRO TO SCALING
How to determine that your
servers actually required more:
CPU / Processing Power RAM etc Vertical Scaling
INTRO TO SCALING
You Need to Know & Understand The Basics
▸ how various components works
▸ preferences in processes
▸ how IO interupts are handled
▸ how memory management works
▸ how network layer implemented
▸ meaning of the information given
▸ basics tools
INTRO TO SCALING
In general - Four subsystems that need to be
monitored
▸ CPU
▸ Memory
▸ IO
▸ Network
2
THE
BASICS
CPU - four subsystem to be
monitored:
- run queue
- context switch
- cpu utilisation
- load average
the basics
THE BASICS
CPU
▸ CPU utilisation depend on accessed resources
▸ Linux Kernel has a scheduler, and scheduler give priorities
to the different resources:
▸ scheduling two kind of resources:
▸ interrupts
▸ threads
THE BASICS
CPU - INTERRUPTS REQUEST (IRQ) HANDLING
▸ IRQ is a signal for an immediate attention sent from
hardware to processor
▸ each device is assigned one or more IRQ numbers
▸ allowing to send unique interrupts
▸ a processor that receives an interrupts request will
immediately pause execution of the current
application thread in order to address the request
THE BASICS
CPU - SCHEDULER
▸ smallest unit of process execution called “thread”
▸ the system scheduler:
▸ determines which processor run a thread
▸ and for how long the threads run
▸ however the scheduler have priorities
THE BASICS
CPU - SCHEDULER PRIORITIES
▸ Scheduler Priorities:
▸ Hardware interrupts (highest priority)
▸ by hardware on the system to process data
▸ eg:
▸ by disk when completed IO transaction
▸ by “NIC” when packet has been received
THE BASICS
CPU - SCHEDULER PRIORITIES
▸ Scheduler Priorities:
▸ Soft interrupts (softirq) - related to maintenance of the kernel
itself
▸ Real Time Thread - parallel processing / real time
programming
▸ Kernel Threads - all kernel processing
▸ User Threads - a.k.a. “user space”.
▸ All applications run in the user space / lowest priority of all
THE BASICS
CPU - CORES
▸ Linux consider / view each core on n-way Hyper Threaded
processor as an:
▸ INDEPENDENT Processor
▸ eg: Dual Core Processor = two independent processor
THE BASICS
CPU - CONTEXT SWITCHES
▸ each threads alloted a time quantum to spend on the
processor
▸ passed alloted time / pre-empted by something higher
priority, the thread:
▸ place back to queue
▸ higher priority / next in queue thread is placed on the
processor:
▸ switched of thread = Context Switch
THE BASICS
CPU - THE RUN QUEUE
▸ each CPU maintain a RUN QUEUE of threads
▸ process threads are either:
▸ runnable (in run queue)
▸ sleep state (blocked and waiting for IO - not in run queue)
▸ CPU heavily utilised:
▸ longer run queue
▸ the longer it take for process threads to execute
THE BASICS
CPU - THE RUN QUEUE
▸ “Load”
▸ describe the state of the Run Queue
▸ “System Load”
▸ equal to amount of process threads currently executing
“+” amoung of threads in the CPU Run Queue
▸ “top” command report “load averages” over the course
of 1, 5 and 15 minutes
THE BASICS
CPU - UTILISATION
▸ defined as the percentage of usage of a CPU
▸ mostly CPU utilisation falls under following categories:
▸ User Time: percentage of time CPU spends executing threads in the
User Space
▸ System Time: percentage of time CPU spends in executing kernel
threads and interrupts
▸ Wait IO: the percentage of time a CPU spends idle because all process
threads are blocked waiting for IO requests to complete
▸ Idle: the percentage of time a processor spends in completely idle
state
THE BASICS
CPU - TIME SLICING
▸ a numeric value represent how long a task can run until it
pre-empted
▸ scheduler policy dictate the default timeslice
▸ too long time slice = poor interactive performance
▸ time slice too short = significant amount of processor
time been wasted because of overhead of switching
process from one process to another (context
switching)
CPU - Performance
Monitoring - a matter of
interpreting performance of:
- run queue
- utilisation
- context switching
the basics
THE BASICS
CPU - PERFORMANCE MONITORING
▸ General Expectations:
▸ Run Queues: a run queue should have no more than
1 - 3 threads queued per CPU
▸ eg: a dual processor should not have more than 2
threads in queue (ideally) or 6 to the max
THE BASICS
CPU - PERFORMANCE MONITORING
▸ General Expectations:
▸ CPU Utilisation: if a CPU is fully utilised, ideally the
following balance of utilisation should be achieved:
▸ 65% - 70%: User Time
▸ 30% - 35%: System Time
▸ 0% - 5%: Idle Time
THE BASICS
CPU - PERFORMANCE MONITORING
▸ General Expectations:
▸ Context Switches:
▸ high amount of context switches is acceptable if:
▸ CPU utilisation stays within previously mentioned
balance
/proc/$pid/status | grep ctxt
/usr/bin/time -v ls | grep context
vmstat
THE BASICS
CPU - PERFORMANCE MONITORING TOOLS
▸ must be low overhead tool
▸ still practical having it running under heavily loaded
system
▸ able to monitor the health of the system at glance
THE BASICS
CPU - PERFORMANCE MONITORING
top
THE BASICS
CPU - PERFORMANCE MONITORING
vmstat
THE BASICS
CPU - PERFORMANCE MONITORING
mpstat
Memory
- Physical Memory
- Virtual Memory
the basics
THE BASICS
VIRTUAL MEMORY
▸ Virtual Memory = SWAP space on disk + RAM/physical
memory
▸ virtual memory divided into pages
▸ on x86 architecture VM pages = 4kb
▸ when writing from memory to disk, it write memory in
“Pages”
THE BASICS
VIRTUAL MEMORY
▸ when application starts, it request Virtual Memory Size (VSZ)
▸ the kernel either grants or denies VSZ request
▸ as application use the requested memory, that memory
mapped into physical memory (RSS)
▸ RSS (resident memory size) is amount of physical memory
a task is using
▸ most case application use less RSS than what it requested
(VSZ)
THE BASICS
VIRTUAL MEMORY
THE BASICS
VIRTUAL MEMORY
THE BASICS
VIRTUAL MEMORY - PAGES
▸ Virtual Memory divided into “Pages”
▸ on x86 architecture VM Pages = 4kb
▸ when writing from memory to disk, it write memory in
“Pages”
▸ when “Pages” in memory are modified by running process,
they become “dirty”
▸ when reach defined percentage (vm.dirty_ratio) it will be
written to disk
THE BASICS
VIRTUAL MEMORY - DIRTY PAGES
▸ vm.dirty_ratio (kernel param)
▸ defined the maximum amount of memory for a process that can be filled
with dirty pages before they get flushed to disk
▸ flushing will stop all IOs
▸ higher value = remain longer in memory
▸ better performance but high risk
▸ lower value = get flushed to disk more often
▸ slower performance but less risk
▸ default value = 20
THE BASICS
VIRTUAL MEMORY - DIRTY PAGES
▸ vm.dirty_background_ratio (kernel param)
▸ defined the maximum amount of memory that can be
filled with dirty pages before they get flushed to disk by
the kernel flusher threads
▸ pdflush/flush/kdmflush
▸ default value = 10
▸ eg: RAM 64G, only 6.4G data can be sitting in RAM
before kernel flusher daemon kicks in.
THE BASICS
VIRTUAL MEMORY - DIRTY PAGES
▸ vm.dirty_bytes & vm.dirty_background_bytes
▸ same as dirty_ratios and dirty_background_ratios but in
bytes
▸ setting ratios value/s, bytes param/s will become 0
and vice-verse
THE BASICS
VIRTUAL MEMORY - DIRTY PAGES
▸ vm.dirty_expire_sentisecs
▸ how long something can be cache before it needs to be
written
▸ default value = 3000 sentisecs = 30 seconds
▸ if dirty pages older than default value, it will be
written asynchronously to disk
▸ safe guard against data lost
THE BASICS
VIRTUAL MEMORY - DIRTY PAGES
▸ vm.dirty_writeback_sentisecs
▸ how often pdflush/flush/kdmflush process wake up and
check if works need to be done
▸ default value = 500 sentisecs = 5 seconds
THE BASICS
VIRTUAL MEMORY - SWAPPINESS
▸ vm.swappiness
▸ how aggressive linux should be when swapping active pages in memory to
disk
▸ default value = 60
▸ when reaching 40% of memory used, it will start to consider for swapping
▸ lower value meaning closer to memory max size - discourage linux from
swapping
▸ 60 value recommended for most desktop use
▸ 10 recommended to improve performance in general
THE BASICS
VIRTUAL MEMORY - CACHES
▸ /proc/sys/vm/drop_caches
▸ 0 - do nothing state
▸ 1 - free page cache
▸ 2 - free reclaimable slab objects
▸ slab allocation - used to retain allocated memory that contains a
data object of certain types for reuse of another allocation of the
same request (deallocation/allocation)
▸ 3 - free both types
▸ this is not a destructive operation, it will not destruct any dirty objects
THE BASICS
MEMORY - PERFORMANCE MONITORING
vmstat
Network
- NIC Ring Buffer
- Hard IRQ
- Soft IRQ
- Networking Tools
the basics
THE BASICS
App N
App 1
NETWORK - NIC RING BUFFER
Packet to
Forward
IP STACK
SKB
SKB
Queuing
Disciplines
SKB
NIC
Driver Queue
a.k.a
Ring Buffer
THE BASICS
NETWORK - NIC RING BUFFER
▸ Ring Buffer
▸ implemented as First in First Out (FIFO) ring buffer
▸ does not contain the packet data
▸ descriptor that point to the other data structures called
Socket Kernel Buffers (SKB)
▸ input source for Ring Buffer is “IP Stack”
▸ dequeued by Hardware Driver and sent to NIC via data bus
THE BASICS
NETWORK - NIC RING BUFFER
▸ “ethtool” - command to show/
change values of the ring buffer
▸ -g : display
▸ -G : change
▸ with introduction of Byte Queue
Limit (BQL) there is no longer any
need to modify the driver queue
size - a self tuning algorithm
THE BASICS
NETWORK - QUEUING DISCIPLINES (QDISC)
▸ QDISC
▸ Linux abstraction layer for traffic queues
▸ carry out complex queue management behaviours
▸ traffic classification
▸ prioritisation
▸ rate shaping
▸ configured through “traffic control - tc” command
THE BASICS
NETWORK - HARD IRQ
▸ Hardware IRQ
▸ also known as top half interrupts
▸ when NIC receives incoming data:
▸ it copies the data to RX ring buffers
▸ the NIC notifies the kernel of this incoming data by
raising a hardware interrupt
▸ cat /proc/interrupts | grep em1
THE BASICS
NETWORK - SOFT IRQ
▸ Software IRQ
▸ also known as bottom half interrupts
▸ purpose is to drain the NIC receive ring buffers
▸ can be seen in process monitoring tools such as ps and
top
▸ ksoftirqd/cpu-num
THE BASICS
NETWORK - MONITORING TOOLS
▸ ss/netstat - dump/shows socket statistics
▸ dropwatch - monitors packet freed from memory by the
kernel
▸ ip - for managing IP and monitoring routes, devices, policy
routing and tunnels
▸ ethtool - for displaying and changing NIC settings
THE BASICS
NETWORK - SOME TUNING PARAMETERS
▸ SoftIRQ Misses
▸ net.core.netdev_budget
▸ if softirq don’t run long enough, the rate of incoming data
could exceed the kernel’s capability to drain the buffer
fast enough
▸ default value = 300
▸ this will allow softirq to only process/drain out 300
messages from the NIC before been booted off the CPU
THE BASICS
NETWORK - SOME TUNING PARAMETERS
▸ SoftIRQ Misses
▸ when to increase “net.core.netdev_budget” value?
▸ monitor /proc/net/softnet_stat third column
▸ if third column keep on increasing - doubled the
value
THE BASICS
NETWORK - SOME TUNING PARAMETERS
▸ increase max open files descriptors - default 1024
▸ /etc/security/limits.conf
▸ * soft nofile 100000
▸ * hard nofile 100000
▸ ulimit -n : to view current file desc limit
THE BASICS
NETWORK - SOME TUNING PARAMETERS
▸ decrease the time socket stays in TIME_WAIT state
▸ by lowering tcp_fin_timeout
▸ default 60
▸ lowering too low can run into socket close errors in
network with lots of jitter
▸ set tcp_tw_reuse = 1 : this tell kernel it can reuse socket in
the TIME_WAIT state
THE BASICS
NETWORK - TCP TIMEWAIT HACK
by EA Faisal - NexoPrima
▸ patch kernel to introduce tcp_timewait_len kernel
parameter
▸ new entry in /proc FS:
▸ /proc/sys/net/ipv4/tcp_timewait_len
▸ able to use sysctl for configuration:
▸ net.ipv4.tcp_timewait_len
▸ https://github.com/efaisal/linuxtcptw
THE BASICS
NETWORK - SOME TUNING PARAMETERS
▸ increase the port range for ephemeral outgoing ports
▸ cat /proc/sys/net/ipv4/ip_local_port_range
▸ default minimum port 32768
▸ change to 10000
▸ default maximum port 61000
▸ change to 65000
THE BASICS
NETWORK - TCP CONGESTION WINDOW
▸ throughput of a communication is
limited by two windows:
▸ congestion windows - maximum
segment size (MSS) allowed on
that connection
BrowserClient
SYN
SY
CK
N, A
RWI
N 65
k
Server/Web
SS
*M
IN 3
RW
ACK
GET
/…
▸ maintained by the sender
▸ receive window - maximum
amount of data before
acknowledge sender
▸ maintained by the receiver
▸ MSS = MTU - (TCP Header) ~= 1460
ACK
ke
pac
a
t
3 da
ACK
s
…
ts
CWIN = 3
THE BASICS
NETWORK - TCP CONGESTION WINDOW
▸ TCP uses a mechanism called “slow start” to increase the congestion
window after a connection is initialised and after timeout
▸ It will start a window of two times maximum segment size (MSS) depend on CWIN value
▸ starting from kernel 2.6.39 CWIN default value increased to 10
▸ for every packet acknowledged, congestion window increase by 1 MSS
▸ when exceeds sshtresh threshold it will enter congestion avoidance
mode
▸ sshtresh value automatically update at every end of “slow start”
THE BASICS
NETWORK - TCP CONGESTION WINDOW
▸ a server does not
necessarily adhere to the
client’s RWIN (receiver
advertised window size)
▸ if CWIN size is a lot
smaller/lower than
receiver RWIN - the initial
transfer might not be
optimal
BrowserClient
SYN
SY
CK
N, A
RWI
N 65
k
Server/Web
SS
*M
IN 3
RW
ACK
GET
/…
ACK
CWIN = 5
kets
c
a
ta p
5 da
…
THE BASICS
NETWORK - TCP CONGESTION WINDOW
▸ net.ipv4.tcp_slow_start_after_idle
▸ tells either to start the default congestion window size for
existing TCP connections that have been idle for too long
▸ on persistent connections you will likely end up in this
state
▸ default value = 1
▸ recommended value for performance - change to 0
THE BASICS
NETWORK - TCP CONGESTION WINDOW
▸ changing CWIN and RWIN values in Linux
▸ ip route show
▸ ip route change default via 192.168.1.1 dev em1 proto
static initcwnd 10
▸ ip route change default via 192.168.1.1 dev em1 proto
static initrwnd 10
THE BASICS
NETWORK - SOME TUNING PARAMETERS
▸ tuned : adaptive system tuning daemon
▸ apply tuning settings which enable most desirable
performance
3
EXPERIENCES
SHARING
EXPERIENCES SHARING
LOAD STRESS TEST INFRA - ONLINE UNIV APPLICATION
EXPERIENCES SHARING
LOAD STRESS TEST - ONLINE UNIV APPLICATION
untuned
tuned
INTERNAL
EXPERIENCES SHARING
DB CONNECTION LATENCY
Q&A
THANK YOU
adzmely@nexoprima.com
https://www.nexoprima.com
ANNEX
REFERENCES:
▸ https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/
Performance_Tuning_Guide
▸ https://wiki.mikejung.biz/Ubuntu_Performance_Tuning
▸ https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vmdirty_ratio/
▸ https://www.kernel.org/doc/Documentation/sysctl/vm.txt
▸ https://access.redhat.com/sites/default/files/attachments/
20150325_network_performance_tuning.pdf
▸ http://www.cdnplanet.com/blog/tune-tcp-initcwnd-for-optimum-performance/
▸ https://www.wikipedia.org/ : for various topics related
▸ others that I might forgotten
Download