CloudStack Tuning whoami • Name: Sudhansu Sahu • Current Role: Working as a SDE at Citrix R&D India • Having 5 years of experience in cloud space, since cloud.com days • At Citrix I was a developer in CPBM product, then worked as a solution developer in worlwide cloud services • Associated with apache cloudstack since 6 months Goal • To understand various configurations (OS/ MySQL/Tomcat/Java/Management Server) which has a direct impact on cloudstack performance. • What will be the right value for these configurations? • What is not configurable? Overview • OS configurations • Management Server DB Configuration • Management Server Direct Agent Configurations • Management Server Indirect Agent Configurations • Secondary Storage Scalability Configuration and tuning • Management Server Jobs and their frequency OS Configurations "When I try to create 50 VMs for 50 accounts using cloudmonkey async requests my installation after some operations end up with stuck management-server - seems like it's working (logs are filling with new rows), but at the same time it's doing nothing - no VM creations, UI acting weird, API returns internal server error, routers stuck in "starting” state, etc. Also I can't restart it in "normal" way only with killing the java process. When I add delay for a minute between VMs deployment CS is doing much better and all routers + VMs are created successfully.” "java.lang.OutOfMemoryError: unable to create new native thread" To fix this, add the following lines to /etc/security/ limits.conf cloud hard nofile 4096 cloud soft nofile 4096 To fix this, add the following lines to /etc/security/ limits.d/90-nproc.conf cloud soft nproc 8192 cloud hard nproc 8192 root soft nproc -1 root hard nproc -1 Management Server DB Configuration “A cloudstack Simulator based environment with 4K hosts, 4K accounts, 12K VMs, 8K router VMs, 2 management server nodes, 8G heap size started showing hosts in alert and disconnected state after 20 minutes of operation.” “Caused by: org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object” The root cause was found to be less number of active database connection The configuration that determines the maximum active connection is 'db.cloud.maxActive'. The default value for this is 250. To address this issue we configured this to 1000 (db.properties file). MySQL default configuration ‘max_connections’ is 214. Total of ‘maxActive’ parameter setting across management servers should not exceed MySQL's max_connections value. If you have 2 management server nodes with ‘maxActive’ as 250 then max_connection should be atleast 500. If maximum number of open files allowed is too small , default 1024 then the my.cnf changes (max_connections=500) will be ignored. Fix: /etc/security/limits.conf mysql hard nofile 4096 mysql soft nofile 4096 my.cnf [mysqld] open_files_limit = 4096 max_connections = 500 Management Server Direct Agent Configurations Direct Agent Configurations • direct.agent.load.size • direct.agent.scan.interval • direct.agent.pool.size • direct.agent.thread.cap “Host reconnect taking too long. Takes 30 min to reconnect 300 hosts” direct.agent.load.size Default : 16 Purpose: Used for handling connect/disconnect for direct agents. This is used when a new host (managed by direct agents) gets added or removed and also when MS is restarted direct.agent.scan.interval Default : 90 Purpose: Interval between scans to load agents Every 90 sec, Agent scan task looks for 16 unmanaged hosts and tries to reconnect. To make this faster we have 2 options. • Decrease scan interval • increase batch size 90 sec is decent scan interval so better to increase the batch size. ‘direct.agent.load.size’ should be increased to enable faster reconnection of hosts on restarts. Set this to a higher value as the number of hosts increases. direct.agent.load.size Number of Hosts 16 Default <50 50 100-­‐500 100 500-­‐1000 500 1000-­‐5000 1000 5000-­‐20000 “Agents falling behind on ping for large number of hosts” direct.agent.pool.size Default : 500 Purpose: Used for sending commands directly to the HVs like start/stop VMs etc. Also used for sending commands to other resources like network providers etc ‘direct.agent.pool.size’ is the parameter to determine the thread pool size for two thread pools used internally by MS. • direct agent thread pool • cron job thread pool If you have 2K hosts then these thread pool will be exhausted, resulting in delays and undesired behavior. This needs to be cofigured appropriately based on load on management server. 1K for 1K hosts. Increasing beyound 1K depends on OS configuration. “1 cluster issue is propagating time out across the entire cloud” direct.agent.thread.cap Default : 1 Purpose: Percentage (as a value between 0 and 1) of direct.agent.pool.size to be used as upper thread cap for a single direct agent to process requests With default value ‘1’ • No restriction on the number of threads a direct agent can use. Fine as long as the host is responding to requests in a reasonable amount of time. If threre is delay in getting response then • Threads remain blocked till MS gets response • More command to slow host blocks more number of threads • Results in commands for healthy not getting processed as thre are not enough threads. Solution is to localize the impact of bad hosts, so entire management server is not affected. ‘direct.agent.thread.cap’ is the % of the thread pool that can be used by a direct agent If ‘direct.agent.pool.size’ = 1000 and ‘direct.agent.thread.cap’ = 0.1 , then 100 thread can be used by a direct agent How to find the proper value for thread cap: ‘direct.agent.pool.size’ = 3000.Assume 8 hosts per cluster. • First determine an upper cap on the direct agent threads for a given cluster. Say 1000 (1/3 of direct.agent.pool.size) • Based on the # of hosts in cluster, find out the per host/direct agent limit. In this case 1000/8 = 125 threads per direct agent. • So base on the above, the thread cap comes to be 125/3000 = 0.04. Management Server Indirect Agent Configurations workers Default : 5 Purpose: Number of worker threads handling remote agent connections. The size of the pool is based on "workers" configuration parameter. The formula used for arriving at thread pool size is: size = (workers + 10) * 5. Default value of "workers" parameter is 5 and so the default pool size is 75. For large deployment increase ‘workers’. Secondary Storage Scalability Configuration and tuning Atleast single SSVM per Zone CloudStack uses two different configuration options to determine when to scale up and add more. • secstorage.capacity.standby minimal number of command execution sessions that system is able to serve immediately. • secstorage.session.max max number of command execution sessions that a SSVM can handle Management Server Jobs and their frequency Stats Collector Management server runs different statscollector task to collect Hosts, Vms, Router, Storage stats. Frequency depends on following configuration vm.stats.interval Default: 60000 Purpose: The interval (in milliseconds) when vm stats are retrieved from agents. storage.stats.interval Default: 60000 Purpose: The interval (in milliseconds) when storage stats (per host) are retrieved from agents. router.stats.interval Default: 300 Purpose: Interval (in seconds) to report router statistics. host.stats.interval Default: 60000 Purpose: The interval (in milliseconds) when host stats are retrieved from agents. Q&A