CloudStack Tuning - The Linux Foundation

advertisement
CloudStack Tuning
whoami
•  Name: Sudhansu Sahu
•  Current Role: Working as a SDE at Citrix R&D
India
•  Having 5 years of experience in cloud space,
since cloud.com days
•  At Citrix I was a developer in CPBM product,
then worked as a solution developer in worlwide
cloud services
•  Associated with apache cloudstack since 6
months
Goal
•  To understand various configurations (OS/
MySQL/Tomcat/Java/Management Server)
which has a direct impact on cloudstack
performance.
•  What will be the right value for these
configurations?
•  What is not configurable?
Overview
•  OS configurations
•  Management Server DB Configuration
•  Management Server Direct Agent Configurations
•  Management Server Indirect Agent Configurations
•  Secondary Storage Scalability Configuration and tuning
•  Management Server Jobs and their frequency
OS Configurations
"When I try to create 50 VMs for 50 accounts using cloudmonkey
async requests my installation after some operations end up with
stuck management-server - seems like it's working (logs are filling
with new rows), but at the same time it's doing nothing - no VM
creations, UI acting weird, API returns internal server error, routers
stuck in "starting” state, etc. Also I can't restart it in "normal" way only with killing the java process. When I add delay for a minute
between VMs deployment CS is doing much better and all routers
+ VMs are created successfully.”
"java.lang.OutOfMemoryError: unable to create new native thread"
To fix this, add the following lines to /etc/security/
limits.conf
cloud hard nofile 4096
cloud soft nofile 4096
To fix this, add the following lines to /etc/security/
limits.d/90-nproc.conf
cloud soft nproc 8192
cloud hard nproc 8192
root soft nproc -1
root hard nproc -1
Management Server DB Configuration
“A cloudstack Simulator based environment with 4K hosts, 4K
accounts, 12K VMs, 8K router VMs, 2 management server nodes,
8G heap size started showing hosts in alert and disconnected
state after 20 minutes of operation.”
“Caused by: org.apache.commons.dbcp.SQLNestedException:
Cannot get a connection, pool error Timeout waiting for idle
object”
The root cause was found to be less number of active database
connection
The configuration that determines the maximum active connection
is 'db.cloud.maxActive'.
The default value for this is 250. To address this issue we
configured this to 1000 (db.properties file). MySQL default configuration ‘max_connections’ is 214.
Total of ‘maxActive’ parameter setting across management
servers should not exceed MySQL's max_connections value.
If you have 2 management server nodes with ‘maxActive’ as 250
then max_connection should be atleast 500.
If maximum number of open files allowed is too small , default
1024 then the my.cnf changes (max_connections=500) will be
ignored.
Fix:
/etc/security/limits.conf
mysql hard nofile 4096
mysql soft nofile 4096
my.cnf
[mysqld]
open_files_limit = 4096
max_connections = 500
Management Server Direct Agent
Configurations
Direct Agent Configurations
•  direct.agent.load.size
•  direct.agent.scan.interval
•  direct.agent.pool.size
•  direct.agent.thread.cap
“Host reconnect taking too long. Takes 30 min to reconnect 300
hosts”
direct.agent.load.size
Default : 16
Purpose: Used for handling connect/disconnect for direct
agents. This is used when a new host (managed by direct
agents) gets added or removed and also when MS is
restarted
direct.agent.scan.interval
Default : 90
Purpose: Interval between scans to load agents
Every 90 sec, Agent scan task looks for 16 unmanaged hosts and
tries to reconnect. To make this faster we have 2 options.
•  Decrease scan interval
•  increase batch size
90 sec is decent scan interval so better to increase the batch size.
‘direct.agent.load.size’ should be increased to enable faster
reconnection of hosts on restarts.
Set this to a higher value as the number of hosts increases.
direct.agent.load.size Number of Hosts 16 Default <50 50 100-­‐500 100 500-­‐1000 500 1000-­‐5000 1000 5000-­‐20000 “Agents falling behind on ping for large number of hosts”
direct.agent.pool.size
Default : 500
Purpose: Used for sending commands directly to the HVs like
start/stop VMs etc. Also used for sending commands to other
resources like network providers etc
‘direct.agent.pool.size’ is the parameter to determine the thread
pool size for two thread pools used internally by MS.
•  direct agent thread pool
•  cron job thread pool
If you have 2K hosts then these thread pool will be exhausted,
resulting in delays and undesired behavior.
This needs to be cofigured appropriately based on load on
management server.
1K for 1K hosts. Increasing beyound 1K depends on OS
configuration.
“1 cluster issue is propagating time out across the entire cloud”
direct.agent.thread.cap
Default : 1
Purpose: Percentage (as a value between 0 and 1) of
direct.agent.pool.size to be used as upper thread cap for a
single direct agent to process requests
With default value ‘1’
•  No restriction on the number of threads a direct agent can
use.
Fine as long as the host is responding to requests in a reasonable
amount of time.
If threre is delay in getting response then
•  Threads remain blocked till MS gets response
•  More command to slow host blocks more number of
threads
•  Results in commands for healthy not getting processed as
thre are not enough threads.
Solution is to localize the impact of bad hosts, so entire
management server is not affected.
‘direct.agent.thread.cap’ is the % of the thread pool that can be
used by a direct agent
If ‘direct.agent.pool.size’ = 1000
and ‘direct.agent.thread.cap’ = 0.1 , then 100 thread can be used
by a direct agent
How to find the proper value for thread cap:
‘direct.agent.pool.size’ = 3000.Assume 8 hosts per cluster.
•  First determine an upper cap on the direct agent threads for a
given cluster. Say 1000 (1/3 of direct.agent.pool.size)
•  Based on the # of hosts in cluster, find out the per host/direct
agent limit. In this case 1000/8 = 125 threads per direct agent.
•  So base on the above, the thread cap comes to be 125/3000 =
0.04.
Management Server Indirect Agent
Configurations
workers
Default : 5
Purpose: Number of worker threads handling remote agent
connections.
The size of the pool is based on "workers" configuration
parameter.
The formula used for arriving at thread pool size is: size =
(workers + 10) * 5. Default value of "workers" parameter is 5
and so the default pool size is 75.
For large deployment increase ‘workers’.
Secondary Storage Scalability
Configuration and tuning
Atleast single SSVM per Zone
CloudStack uses two different configuration options to determine
when to scale up and add more.
•  secstorage.capacity.standby
minimal number of command execution sessions that
system is able to serve immediately.
•  secstorage.session.max
max number of command execution sessions that a
SSVM can handle
Management Server Jobs and their
frequency
Stats Collector
Management server runs different statscollector task to collect
Hosts, Vms, Router, Storage stats. Frequency depends on
following configuration
vm.stats.interval
Default: 60000
Purpose: The interval (in milliseconds) when vm stats are
retrieved from agents.
storage.stats.interval
Default: 60000
Purpose: The interval (in milliseconds) when storage stats
(per host) are retrieved from agents.
router.stats.interval
Default: 300
Purpose: Interval (in seconds) to report router statistics.
host.stats.interval
Default: 60000
Purpose: The interval (in milliseconds) when host stats are
retrieved from agents.
Q&A
Download