Performance Management - vCenter and vRealize Operations

Performance Management
Iwan ‘e1’ Rahabok VCAP-DCD, TOGAF Certified, vExpert
Staff SE (Strategic Accounts) & CTO Ambassador
e1@vmware.com | 9119-9226 | Linkedin.com/in/e1ang | Tweeter: e1_ang
https://www.facebook.com/groups/vmware.users/
http://virtual-red-dot.info
© 2014 VMware Inc. All rights reserved.
Warm-up exercise
 You got an email from the app team, saying the main Intranet application was slow.
• The email was 1 hour ago. The email stated that it was slow for 1 hour, and it was ok after that.
• So it was slow between 1-2 hours ago, but ok now.
• You did a check. Everything is indeed ok in the past 1 hour.
• The application spans 10 VMs in 2 different clusters, 4 datastores and 1 RDM
• You are not familiar with the applications. You do not know what apps runs on each VM as you have no access to the Guest
OS.
• Your environment: 1 VC, 4 clusters, 30 hosts, 500 VM, 40 datastores, 1 midrange array, 10 GE, iSCSI storage
Test your vSphere knowledge!
How do you solve/approach this with just vSphere?
What do you do?
 A: Smile, as this will be a nice challenge for your TAM/BCS/MCS/RE 
 B: No sweat, you’re VCDX + CCIE + ITIL Master. You’re born for this.
 C: SMS your wive, “Honey, I’m staying overnight at the datacenter  “
 D: Take a blood pressure medicine so it won’t shoot up.
 E: Buy the app team very nice dinner, and tell them to keep quiet.
2
Performance: How do you know it’s optimised?
• What do you measure?
– Utilisation?
• Utilisation of 100% means it’s performing…?
• Utilisation of 5% means it’s performing…?
• Utilisation of 50% means it’s performing…? Really? 
– Something else?
• What is that something else? 
To understand this “something else”,
we need to go back to “fundamental”.
CONFIDENTIAL
3
What do we care at each layer?
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
1
We care if it is being served well by the platform.
Other VM is irrelevant from VM Owner point of
view. Make sure it is not contending for resource.
2
We check if it is sized properly.
If too small, increase its configuration.
If too big, right size it for better performance
1
We care if it is serving everyone well.
Make sure there is no contention for resource
among all the VMs in the platform.
2
We check for overall utilisation.
Too low, we are not investing wisely on hardware
Too high, we need to buy more hardware.
SDDC
Take Away: Contention and Utilisation
• Unlike physical DC, in virtual infrastructure….
– we use Contention, not Utilisation, for Performance Management
– we use Utilisation (short range) for Performance Management
– we use Utilisation (long range) for Capacity Management
• Contention is how you measure that the platform is performing
well.
• Sound good! But how do you measure “Contention”?
CONFIDENTIAL
5
Performance: The counters
• What counters prove that it is optimised?
– You need a technical fact to assure yourself
• Either that, or take a sleeping pill at night 
– You need a technical fact to show to your customers
• Your SLA must be based on something concrete, not subject to interpretation or “feeling of
the day”
– If you can’t prove it, how does anyone know it is optimised? ;-)
CONFIDENTIAL
6
Optimized Infrastructure Performance*
• CPU
• RAM
• Storage
• Network
* While keeping Cost in mind
CONFIDENTIAL
7
How a VM gets its resource
Provisioned
Limit
Contention
Entitlement
Usage
Demand
Reservation
0
VM CPU: The 4 States
VM CPU: What do you monitor?
• Contention
– Ready (ms)?
– Co-Stop (ms)?
– Latency (%)?
– Max Limited (ms)?
– Overlap (ms)?
– Swap Wait (ms)?
• Utilisation
– Used (ms)?
– Usage (%)?
– Demand (MHz)?
Quiz Time! 
What’s difference between Average, Summation and Latest?
How does timeline impact the value?
CONFIDENTIAL
10
VM CPU: What you should monitor
vCenter Operations
• Contention:
– Contention (%)
• Utilisation
– Workload (%)
vCenter
• Contention
– Latency (%)
– Max Limited (if applicable)
• Utilisation
– Usage (%)
– Demand (MHz)
Discussion Time!
What’s should the value be for an optimized environment?
CONFIDENTIAL
11
One more thing…
• Hypervisor does not have visibility inside the Guest OS.
• There is 1 particular CPU counter that you should get. It tells you
that there is not enough CPU to meet demand.
• vRealize Operations (via Hyperic) does not collect this counter
• Which counter is that?
CONFIDENTIAL
12
Enough about CPU.
Let’s move to RAM!
CONFIDENTIAL
13
Quiz Time! 
• Which of the following sentences are True:
– Ballooning is bad. You see a VM has balloon, that VM has memory performance problem.
– Ballooning happens before Compression, which happens before Swapping. If you see a VM
has Compressed memory but not Ballooned memory, that vCenter is buggy, or your eyes
are just tired.
– If all the VMs in the ESXi host has low Usage counter, then the ESXi must also be low.
– Turn on Large Page, and there goes all your TPS.
– To check if a VM has memory contention, check its CPU Swap Wait counter.
– Why are all the questions difficult?!
• Answer
– Ballooning indicates the ESXi has memory pressure. It does not mean the VM has memory
performance issue.
– Pages remain compressed or swap if they are not accessed.
– Usage counter is different in VM and ESXi! In VM, it is Active. In ESXi, it is Consumed. This is
due to 2 level memory concept.
– Yes, unless your ESXi is under heavy memory constraint.
CONFIDENTIAL
14
2 levels of Memory Hierarchy
OS
Hypervisor
• New hierarchy in VMware’s memory overcommit technology
• Transparent Page Sharing
• Ballooning
• Memory Compression
• Swap to Host Cache (SSD)
• Disk swapping
• Decompression is sub-ms compared to swap (15-20 ms)!
vSphere Memory Management
• 2 types of Memory Management
– Guest OS level
• Balloon
– Hypervisor level
• TPS
• Compression, Swap to disk, Swap to cache (SSD)
Volunteer Time!
Explain Balloon, TPS, Compression.
CONFIDENTIAL
16
VM RAM: What do you monitor?
• Contention
– Swapped?
– Balloon?
– Compressed?
– Latency?
– CPU Swap Wait?
• Utilisation
– Active?
– Usage?
– Consumed?
CONFIDENTIAL
17
VM RAM: What you should monitor
vCenter Operations
• Contention:
– RAM Contention (%)
• Utilisation
– Workload (%)
– Consumed (KB)
vCenter
• Contention
– Latency (%)
– CPU Swap Wait (ms)
• Utilisation
– Usage (%)
– Consumed (KB)
Discussion Time!
What’s should the value be for an optimized environment?
CONFIDENTIAL
18
One more thing…
• Hypervisor does not have visibility inside the Guest OS.
• There is 1 particular RAM counter that you should get. It tells
you that there is not enough RAM to meet demand.
• Which counter is that?
• You can monitor it Guest OS paging activity by separating the
page file into its own vmdk.
– You can then use vC Ops to analyse the pattern.
CONFIDENTIAL
19
Enough about RAM.
Let’s move to Storage!
CONFIDENTIAL
20
Quiz Time! 
• Which of the following sentences are True:
– The latency counter is the (Write Latency + Read Latency) / 2
– If you have RDM, vCenter does not track the latency.
– If the VM virtual disk counter showing 1000 IOPS, but the VM datastore counter
showing 2x IOPS, something is seriously wrong. Time to call your TAM!
– If all your VMs experiencing high latency, the first thing you do is check the
VMkernel queue
• Answer
– It is not. It takes into account the number of commands issued. It’s a weighted
average.
– It only tracks the latency at the latest data. It’s not including other data during the
collection period.
– Check for snapshot. Snapshot IOPS is transparent to virtual disk.
– The first thing you do is check the physical device queue and your storage array.
VMkernel queue rarely exceeds 1 ms.
CONFIDENTIAL
21
VM Storage: Where and what do you monitor?
Virtual Disk
Datastore
Disk
22
VM Storage: where to monitor
• For vmdk, use Datastore metric
VM
groups.
Disk 1
Disk 2
Disk 3
vDisk
vDisk
vDisk
scsi0:1
scsi0:0
scsi0:2
• For RDM, use Disk metric groups
• Disk metric group is naturally
not relevant for NFS (files)
VMFS
NFS
Datastore
Datastore
Disk
RDM
Disk
VM Storage: What do you monitor?
vCenter Operations
• Contention
– Latency (ms)
• Utilisation
vCenter
• Contention
– Latency (ms)
• Utilisation
– Commands per second
– Commands Issued
– Usage (KBps)
– Usage (KBps)
– Workload (%)
CONFIDENTIAL
24
VM Network
• Contentions
– Drop packets
– Packets retransmit
• Utilisation
– Network throughput
• Limitations
– We cannot monitor latency (e.g. between source and destination)
CONFIDENTIAL
25
Different Tiers, Different Optimization
• Business Logic:
– Tier 1 is optimised for Performance and Availability
– Tier 3 is optimised for Cost
• Do you allow Tier 1 VM on Tier 3 Storage?
– Or you map the Compute Tier to the Storage Tier?
• What distinguish Tier 1 from Tier 3?
– Availability
– Performance
– Monitoring
– Cost!
CONFIDENTIAL
26
Tiering: Considerations
• Compute
– No of spare host
– No of hosts
– Consolidation Ratio (VM:Host)
– vCPU:pCPU Oversubscribed
– vRAM:pRAM Oversubscribed
– Clustering (e.g. VCS)
• Storage
– IOPS per VM
• Monitoring
– Application availability monitoring
(e.g. AppHA)
– Application performance monitoring
(e.g. vC Ops Enterprise)
• Availability
– Automated DR (SRM)
– RPO
– RTO
– Latency
CONFIDENTIAL
27
3-Tiers Offering: Example
Tier 1
Tier 2
Tier 3
No of spare host
2
1
1
No of hosts
6
8
10
Consolidation Ratio (VM:Host)
10:1
20:1
40:1
vCPU:pCPU Oversubscribed
n/a
2.0x
4.0x
vRAM:pRAM Oversubscribed
n/a
1.5x
2.0x
IOPS per VM
400
200
100
<10 ms
15-20 ms
20-25 ms
Clustering (e.g. VCS)
Yes
Yes
No
Application monitoring (e.g. AppHA)
Yes
Yes
No
Apps
Yes
Yes
No
Automated DR (SRM)
Yes
Yes
Yes
RPO
5 minutes
1-2 hour
2-8 hours
RTO
1 hour
<2 hours
<4 hours
Latency
CONFIDENTIAL
28
Demystifying “Peak”
• There are 2 types of “Peak”
– Peak across time
– Peak across objects
• Impacts
– Peak across time can be too high if the burst is high
• VM is low for 24 hours, burst to 100% for 5 minutes, and you get 100% reported.
– Peak across time can be lower if the number of member objects is high.
• Peak of a cluster in the past 1 day is 70%. That means at least 1 host was >70%.
– Peak across objects can be too high is the load is unbalanced
• Happens when cluster utilisation is not high enough to trigger DRS orStorage DRS
CONFIDENTIAL
29
Sample SLA and Internal Threshold
Tier 1
Tier 2
Tier 3
CPU Contention
1%
3%
13%
RAM Contention
0%
5%
10%
10 ms
20 ms
30 ms
Tier 1
Tier 2
Tier 3
CPU Contention
0.5%
2%
10%
RAM Contention
0%
2%
8%
10 ms
15 ms
20 ms
Disk Latency
Disk Latency
SLA only applies to VM.
VM owner does not care about underlying platform
CONFIDENTIAL
30
Where to monitor at the Platform level?
• Compute
• Network
– Host?
– Standard Switch and port group? :-)
– Cluster?
– Host?
– Datacenter?
– Distributed Switch?
– vCenter?
– Distributed Port Group?
• Storage
– Host?
– Cluster?
– Datastore?
– Datastore Cluster?
– Datacenter?
– vCenter?
• Network
CONFIDENTIAL
31
Where to monitor
Not here
• Compute
• Compute
– Host
– Cluster
– Datacenter
• Storage
– Host
– Cluster
• Network
– Host
Monitor these
• Storage
– Datastore
– Datastore Cluster
• Network
– Distributed Switch.
– Distributed Port Group
DRS (and Storage DRS) will balance the cluster
CONFIDENTIAL
32
QoS in a shared environment
• QoS is mandatory in a shared environment
• Areas to control
– Compute
– Network
– Storage
• CPU and RAM
– Shares
– Reservation
– Limit?
– Resource Pool?
• Storage I/O Control
• Network I/O Control
CONFIDENTIAL
33
QoS: Compute
• When not to use Resource Pool?
• When to use Resource Pool?
• What’s the impact of Reservation?
– HA Slot Size. Unless you use %
– Boot time
– Oversubscribe ability. You cannot go beyond 100% reservation.
CONFIDENTIAL
34
QoS: Storage
• A single VM can hog storage
Without Storage IO Control
Actual Disk Resources utilized by each VM
are not in the correct ratio
throughput
– Just need to run IOmeter 
ESX Server
ESX Server
– Unfairly penalizes VMs on hosts with high
24
24
consolidation ratios
• Existing resource management
only works for VMs on the same
host
• SIOC calculates datastore latency
device queue depth
device queue depth
25 %
75%
0
to identify contention
100 %
0
– Latency is a normalized, average across VMs
Storage Array Queue
– IO size and IOPS included
38%
12%
50%
Storage
Congested
Without SIOC – Latency is Unbounded
With Storage IO Control
Actual Disk Resources utilized by each VM
Are in the correct ratio even across ESX Hosts
QoS: Storage
• SIOC enforces fairness when
VM A
1500
Shares
datastore latency crosses
threshold
ESX Server
– Dynamic threshold setting
VM C
500
Shares
VM B
500
Shares
ESX Server
24
– Fairness enforced by limiting VMs access
25 %
device queue depth
to queue slots
• What’s the limitation?
– No inter-datastore awareness
– Does not work on RDM
75%
6
100 %
– Non VM workload not included
0
0
• Work with your Storage team.
Storage Array Queue
– Auto-tiering array is supported
60%
20% 20%
Storage
Controlled
With SIOC – Latency is Controlled
Storage Queue
Throttled
Key Takeaways
• Optimization in SDDC has a lot more components than we
normally think
• Contention is 1st . Utilisation is 2nd
• SLA is at VM level, not Infrastructure level.
• Peak can be too low or too high.
• Anything else?
CONFIDENTIAL
37