Christophe Dubois / 3PAR Ninja team
20.03.2013
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
A discussion about performance of HP 3PAR StoreServ and basic troubleshooting techniques
2 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
RTFM Read the concept guide!
http://bizsupport1.austin.hp.com/bc/docs/support/SupportManual/c02986000/c02986000.
Or search 3PAR concept guide in your favourite search engine
3 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
When sizing for or troubleshooting performance, consider the following limits that will apply to any 3PAR system :
• How many IOPS the controller nodes can sustain
• How many IOPS the physical disks can sustain
• Block size matters!
• How much read bandwidth the controller nodes can sustain
• How much write bandwidth the controller nodes can sustain
• Software limitation of the write cache algorithm
• If using AO, the IO locality profile
4 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Depending on the type of IOs (read/write), whether they are random/sequential, the RAID type and the IO size, the ratio between front-end and back-end will vary
Cache hit
• Read cache hit is a read IO to a portion of data that is already in cache
• Write cache hit IO is an IO to a portion of data that is in write cache, but has not been destaged to disk yet
Note that when doing sequential read IOs, the system will report very high read hit ratios (99%+) because the pre-fetching algorithm puts the data in cache
5 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
How RAID write overhead is calculated
RAID 1 – Writes
D1 D1
1.
Write new data to 1 st mirror (1 IOP)
2.
Write new data to 2 nd mirror (1 IOP)
RAID 5 – Writes
D1 D2 D3 P
1.
Read old data block (1 IOP)
2.
Read old parity block (1 IOP)
3.
Calculate new parity block (0 IOP)
4.
Write new data block (1 IOP)
5.
Write new parity block (1 IOP)
1.
Complicated Process RAID 6 – Writes
A1 A2 A3 Ap
B1 B2 Bp B3
C1 p C2 C3
Dp D1 D2 D3
6 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Total IOPS per 1 New Write = 2
Total IOPS per 1 New Write = 4
Total IOPS per 1 New Write = 6.66
Depending on the type of IOs (read/write), whether they are random/sequential, the RAID type and the IO size, the ratio between front-end and back-end will vary
RAID1
- Random read IO : 1 front-end IO = 1 back-end IO
- Sequential reads : 1 KiB/s of front-end = at least 1 KiB/s of back-end
Do not look at IOPS when doing sequential workloads, as the system will aggregate multiple IOs when going to the backend. Use KiB/s instead
Because of prefetching there will almost always be more KB/s on the backend than on the front-end
- Random write IO : 1 front-end IO = 2 back-end IO
- Sequential writes : 1 KiB/s of front-end = 2 KiB/s of back-end
Do not look at IOPS when doing sequential workloads, as the system will aggregate multiple IOs when going to the backend. Use KiB/s instead
7 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
RAID5
- Random read IO : 1 front-end IO = 1 back-end IO
- Sequential reads : 1 KiB/s of front-end = at least 1 KiB/s of back-end
Do not look at IOPS when doing sequential workloads, as the system will aggregate multiple IOs when going to the backend. Use KiB/s instead
Because of prefetching there will almost always be more KB/s on the backend than on the front-end
- Random write IO : 1 front-end IO = 4 back-end IO
- Sequential writes : 1 KiB/s of front-end = 1 KiB/s * (setsize / (setsize – 1)) of back-end
Do not look at IOPS when doing sequential workloads, as the system will aggregate multiple IOs when going to the backend. Use KiB/s instead
8 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
RAID6
- Random read IO : 1 front-end IO = 1 back-end IO
- Sequential reads : 1 KiB/s of front-end = 1 KiB/s of back-end
Do not look at IOPS when doing sequential workloads, as the system will aggregate multiple IOs when going to the backend. Use KiB/s instead
Because of prefetching there will almost always be more KB/s on the backend than on the front-end
- Random write IO : 1 front-end IO = 6.66 back-end IO
- Sequential writes : 1 KiB/s of front-end = 1 KiB/s * (setsize / (setsize – 2)) of back-end
Do not look at IOPS when doing sequential workloads, as the system will aggregate multiple IOs when going to the backend. Use KiB/s instead
9 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
V-class
Approximately 120K backend IOPS per node pair
Note that with read IOs on V-class, the number does not scale linearly with the number of node pairs. It does scale linearly for write IOs
T/F-class
Approximately 64K backend IOPS per node pair
These numbers do not include cache IOPS.
Cache IOPS are not characterised, however during the SPC-1 benchmark for the V800 it is estimated that on top of the 120K backend IOPS, each node pair was doing 37K cache IOPS
10 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
The following limits are front-end :
V-class
• Reads : approximately 3250 MB/s per node pair
• Writes : 1500 MB/s if only 2 nodes, 2600 MB/s per node pair when using more than 2 nodes
T-class
• Reads : approximately 1400 MB/s per node pair
• Writes : 600 MB/s per node pair
F-class
• Reads : approximately 1300 MB/s per node pair
• Writes : 550 MB/s per node pair
11 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
12
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
13 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
14 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
7200 8KB Rand Read
IOPS
8KB Rand
Write IOPS
Seq. Read
MB/sec
15K HDD drive limited drive limited 2,500
SSD 150,000 75,000 2,800
7400 8KB Rand Read
IOPS
8KB Rand
Write IOPS
Seq. Read
MB/sec
15K HDD drive limited drive limited 4,800
SSD 320,000 160,000 4,800
Seq. Write
MB/sec
1,200
1,200
Seq. Write
MB/sec
2,400
2,400
Drive Limited = the performance depends on the number of drives and the drive capabilities and is not limited by the controllers or interconnects.
15 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Recommended Max drive IOPS
The following numbers are for small IOPS (8-16KB)
15K RPM disks = 200 backend IOPS per PD recommended. Can do 250 reasonably well
10K RPM disks = 150-170 backend IOPS
7.2K RPM disks (NL) = 75 backend IOPS per PD recommended
SSD = It depends!
• On type of IO
• On type of RAID (yes, even for back-end IOs)
• But 3000 IOPS per disk is a safe assumption
16 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
SSD 100/200 GB
Max IOPS per drive
100% Reads
70% /30%
50% / 50%
30% / 70 %
100% Writes
RAID 1, Aligned RAID 1,
Unaligned
8000 7200
6000 4500
5000
5000
5000
4000
4000
4000
RAID 5, Aligned RAID 5,
Unaligned
7500 7000
3300 1700
3000
2800
2800
1500
1400
1400 we recommend using a figure of 3000 IOPS when sizing with 100-200GB SSD
17 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
What is impacted significantly by DIF?
Most IOPS are not affected significantly by introduction of DIF (e.g., reads, R5, R6, FC15, SSDs, sequential)
RAID1 random writes on 10000 NL PDs are affected UNLESS the IOsize is a multiple of 16K AND the IO is aligned
ALIGNMENT
Usually ignored by 3PAR in the past. Secondary performance impact (load on PDs)
Increasingly an issue (e.g., DIF, SSDs)
18 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
All random IOPS numbers on 3PAR are given for small IOs : 8 KBs
When doing IOs larger than 8 KB, the number of backend IOs a system can sustain do may drop off significantly as IO size increases
For FC/NL disks, there is virtually no difference between 8 KB and 16 KB IOs.
Above 16KB, the number of IOs per PD degrades
For SSDs, since a cell is 8KB, any IO larger than 8KB will cause a performance degradation
Remember that when using large blocks, the bandwidth limitation can be reached faster than the
IOPS limitations!
19 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
With 32x 15K disks, RAID5 fully allocated VV
Maximum IOPS vs Block Size (8K = 100%)
120%
100%
80%
60%
40%
20%
0%
100% Reads
50% Read 50% Write
100% Writes
8KB
100%
100%
100%
16KB
98%
98%
98%
32KB
92%
81%
63%
64KB
72%
52%
47%
This graph is not 100% accurate, but is used to show a drop in IOPS when the block size is increased. Values may change 5-10%.
20 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
With 16 x 100GB SSDs, RAID5 fully allocated VV
Maximum IOPS vs Block Size (8K = 100%)
120%
100%
80%
60%
40%
20%
0%
100% Reads
50% Read 50% Write
100% Writes
8KB
100%
100%
100%
16KB
93%
91%
32KB
52%
50%
64KB
50%
30%
This graph is not 100% accurate, but is used to show a drop in IOPS when the block size is increased. Values may change 5-10%.
21 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Performance drop off with block size increase
Rules of thumb (for a mixed workload 50% read, 50% write, 100% random)
FC & NL Disks
8KB
32KB
64KB
IOPS given by HP Sizer
16KB ~ same as 8KB block size
~ 80%
~ 50% of 8KB throughput of 8KB throughput
SSD Disks
8KB IOPS given by HP Sizer
16KB ~ 90% of 8KB block size
32KB ~ 50% of 8KB throughput
64KB ~ 30% of 8KB throughput
22 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
While there is a difference in the number of backend IO required for a front-end IO on a
Thin VV compared to a full VV, this only applies to first write
This is usually completely transparent to the user and the application, since the system will acknowledge the IO to the host and write to disk afterwards
Most applications usually “prepare” new capacity before using it
After the first write, there is absolutely no difference between Thin and full VVs
23 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Workload is 100% random read, then 100% random write, then 50/50
Same amount of backend IOs
Frontend IOs on full VV
Frontend IOs on Thin VV
24 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Impact of Snapshots
• A Snapshot is a point-in-time copy of a source Virtual Volume
− It can use any RAID level
− Reservation less - It has no “background normalization” process
− Normalization occurs as a result of host writes to the source VV
• The Copy-on-Write can cause increased Host I/O Latencies
−
Before “New” data can be written to disk, the “Old” data needs to be read off the disks and copied to the snapshot space. This is called a Copy On [first] Write or COW.
• Copy-on-Writes cause additional backend disk IOs and increased Host I/O latency
As long as the system is not maxed out in terms of backend IOs, snapshots will have a marginal impact
25 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Backend IO/s
Create Snapshot
Frontend IO/s
Write response time
Read response time
26 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
3PAR FLUSHER Backend Disks
• 3PAR arrays perform Copy on Write operations in the background, allowing Hosts to see Electronic
(Cache) latencies
• Write Latencies CAN still increase if there is sufficient activity to slow down the cache flush rate
−
Busy drives = Slower cache flush rate
• Time estimate are for a system that has enough buffers available to support the workload
Host t=0.25ms
8kb SCSI Write
Status Good t=0.25ms
3PAR Cache
Flush Data
Read 16kb Old Data
Data t=6ms
Read 16kb from Disk Drives
Store Data into SD Space
Flush Update to SA/SD space t=4ms
Write Old Data to SA/SD Space
Status Good
Write Host 8kb to original Location t=4ms
Write New Data to Disk
Status Good
27 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Performance Summary
Base workload:
70/30 8kb
IOPS
Average Latency
1 Virtual Copy
IOPS
Average Latency
2 Virtual Copy
IOPS
Average Latency
4 Virtual Copy
IOPS
Average Latency
8 Virtual Copy
IOPS
Average Latency
SS7200, 3.1.2 R1
11,500
4.1ms
11,000
4.2ms
10900
4.3ms
10200
4.6ms
8500
5.6ms
SS7200, 3.1.2 R5
9000 IOPS
5.2ms
9000
5.2ms
8900
5.3ms
8100
5.5ms
8200
5.8ms
28 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
SS7200, 3.1.2 R6
8000 IOPS
6ms
8000
6ms
8000
6ms
7600
6.2ms
7300
6.5ms
OLTP 70/30 8kb, q=48, 144x15k, SS7200, 1 snap
11500 iops Average Latency
2.5
2
1.5
1
0.5
0
5
4.5
4
3.5
3
Time
Remove Snapshot Create Snapshot
Average Latency
29 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
OLTP 70/30 8kb, q=48, 144x15k, SS7200, 8 snaps
Baseline: 11500 iops/4msAverage Latency
12
6
4
2
0
10
8
Time
Create 8 Snaps Average Latency
30 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Remove 8 Snaps
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Control cache
Data cache
Max drives per System
Max drives per Node
Per Node Cache Buffers
7200 2-node
16GB
(4)8GB
144 (120)
72 (60)
262144 (4GB)
Number of drives needed to reach maximum number of
Write Buffers
108 drives per Node Pair
You cannot maximize Write cache with only NL Drives
28 drives per Node Pair
7400 2-node/4-Node
16GB/32GB
(8)16GB/32GB
240 (216)/480(432)
120(108)
524288 (8GB)
Number of drives needed to reach maximum number of
Write Buffers
216 drives per Node Pair
You cannot maximize Write cache with only NL Drives
56 drives per Node Pair
15k (2400 buffers)
NL (1200 buffers)
SSD (9600 buffers)
The table above estimates the number of disk drives that need to be distributed in a 3PAR StoreServ 7000 to ensure the maximum number of Write Buffers are available.
Caution must be used when cache size is taken into account to size for performance. Just because the array has 8GB, it does not mean a host workload will be able to utilize the full amount for Writes
With all NL Drives, you CANNOT allocate the maximum amount of Write-Buffers
This is important to understand for small disk 3PAR StoreServ 7000 systems which may suffer higher response times on writes that is expected for the size of the arrays cache.
32 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Upon writing to the 3PAR array, the data will be put in write cache. Each 3PAR controller node only allows a maximum number of pages for a given number and type of disk
When reaching 85% of this number of maximum allowed cache page, the system will start delaying the acknowledgement of IOs in order to throttle down the hosts, until some cache pages have been freed by having their data de-staged to disk (condition known as “delayed ack”)
This de-staging happens at a fixed rate that depends on the number and type of disks
The maximum write bandwidth of the hosts will be limited to the de-staging speed
33 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
While undocumented, the following rules of thumb can be used for small systems (less that 100 disks of a type). Note that this is empirical data
De-staging speed for FC 15K disks : 8-10 MB/s (front-end) per FC 15K PD
De-staging speed for NL disks : 4-6 MB/s (front-end) per NL PD
Note that when doing read IOs, this limit does not apply and much higher values can be reached
(35-40 MB/s per FC 15K PD)
Beware of the controller node limits when sizing or troubleshooting bandwidth related issues
Always use the Storage Optimizer when sizing for MB/s
34 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
“Delayed ack” is a behaviour of HP 3PAR systems when the cache gets filled faster than it can be de-staged to disk (most likely because the physical disks are maxed out)
This is determined by the number of dirty cache pages for a type of disk exceeding 85% of the allowed maximum
If the threshold is reached, the system will reduce the host IO rate by delaying the “ack” sent back on host writes. Throttling is done to reduce the possibility of hitting max allowable dirty CMP limit
(cache full).
The host will see this behavior and naturally slow down the IO rate it sends to the InServ (extreme cases cause host IO timeout & outage). If continually in delayed-ack mode, load needs to be lowered on hosts, or additional nodes/disks.
35 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
The maximum number of cache pages is a function of the number of disks of each type that are connected to a given node :
• SSD : 4800 pages per PD
• FC : 1200 pages per PD
• NL : 600 pages per PD
For example on a 4 node system with 32 SSDs, 256 FC disks and 64 NL disks (each node will see
16 SSDs, 128 FC and 32 NL):
• Per node : 76800 pages for SSDs, 153600 pages for FC, 19200 pages for NL
36 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
The statcmp command shows the number of dirty pages per node, type of disks, the maximum number of pages allowed and the number of delyed acks :
Page Statistics
---------CfcDirty--------- -----------CfcMax------------ -----------DelAck------------
Node FC_10KRPM FC_15KRPM NL SSD FC_10KRPM FC_15KRPM NL SSD FC_10KRPM FC_15KRPM NL SSD
2 0 15997 2 0 0 19200 19200 0 0 53896 16301 0
3 0 18103 1 0 0 19200 19200 0 0 95982 15092 0
Current number of dirty pages for each node for this type of disks (instant)
Max allowed pages per node for this type of disks
Number of delayed acks.
This counter is incremented whenever a delayed ack happens
37 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Delayed acks occur the cache gets filled faster than it can be de-staged to disk
Factors that contribute to delayed-ack :
38 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
On workloads using single streams (1 outstanding IO) of small write IOs, the performance of the
3PAR system might be lower than expected. This really matters for block sizes < 128KB
On inform OS 3.1.1 situation can sometimes be improved by disabling the “interrupt coalescing” feature of the front-end ports. (Default is disabled for Inform OS 3.1.2)
While interrupt coalescing has a positive effect on most workloads (offload 3PAR CPUs on controller nodes), it can have a detrimental impact on this specific type of workloads experience with 3.1.2 is that it gives you 2x Performance improvement for single thread sequential IO over 3.1.1
39 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Interrupt Coalescing (intcoal) is defined on front-end target ports
It should only be disabled on ports used by hosts that use this type of workload
To disable it use the following command on each port: controlport intcoal disable <N:S:P>
The port will be reset (expect a short host io interruption during reset)
40 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
① WRITE COMMAND
③ DATA TRANSFER
② XFER READY
④ STATUS
Server
Storage Array
The write from a server to a storage device uses a dual round trip SCSI Write protocol to service the write request
41 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Intcoal enabled
1KB 2KB
Intcoal disabled
4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB
50% lower latency
42 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
All of the previous objects have stat* commands (use “help stat” for complete list)
Stat* commands display average values between 2 iterations
Because the result is an average, a single anomalously long IO might be hidden by a large number of IOs with a good service time.
Prefer a short sampling interval (i.e 15 sec or less)
The hist* commands can be used to display buckets of response times and block sizes if required
Use “help hist” to see the list of hist* commands
44 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Watch for
• High vlun latencies
• Large IO Sizes
• Large IOSizes, Latencies
• Delayed Acks
• Heavy Disk IO activity
• Slow PD (high queue/latency)
1) statvlun –rw -ni
2) statport –rw –ni –host
3) statvv –rw –ni
4) statcmp and statcpmvv
5) statpd –rw -ni
6) statport –rw –ni –disk
45 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Example of Debugging Performance Issues
Issue – Customer is seeing service time increase greatly, and IOPs staying the same.
They are running on 10K FC drives .
CLI “ statport-host ” output:
Host “ iostat ” output:
47 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Example of Debugging Performance Issues (cont.)
CLI “ statpd ” output:
48 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Example of Debugging Performance Issues (cont.)
If you remember earlier in the presentation, as IOPs increase, so does the service time.
However, here the IOPs have stopped increasing.
When a hardware hits its IOP max, additional IO gets queued on the device and is caused to wait, which adds to the service time.
From earlier slides, we know the following about PD IOP limits:
7K NL = 75 IOPS
10K FC = 150 IOPS
15K FC = 200 IOPS
In our case, PDs are running at 133% of maximum load and is the bottleneck.
Solution – Additional hardware (PDs) would be needed to reduce the backend load and reduce service times to the application(s).
49 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Example of Debugging Performance Issues
Issue – Customer is expecting higher IOPS, but not getting what was “advertised”.
They are running on 10K FC drives .
CLI “ statport-host ” output:
50 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Example of Debugging Performance Issues (cont.)
CLI “ statpd ” output:
51 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Example of Debugging Performance Issues (cont.)
• When 3PAR quotes performance limits for IOPS or throughput, it is assuming a giving IO size (typically 8KB for Random Workloads). If the customer workload diverges from that, the quoted limits will most likely not be achievable.
• Looking at the stats facing the host, we see the block size coming in to the InServ is 32KB , and the block size to the disks is also 32KB .
• From earlier slides, we know in order to hit node and PD performance max:
Access pattern = RANDOM
Block Size <= 16KB
• The block size is above the size we require to get the true max performance from the StoreServ. Because of this, they will only get approx 75% of the max.
• The customer can either lower their application request size (if possible), or add additional PDs to sustain their desired IOPS number (taking into consideration the % drop with a larger block size)
52 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Example of Debugging Performance Issues (cont.)
Issue – Customer sees very high write service times for small IO sizes in ” statvlun ” but ” statvv ” shows no problem with write service times
CLI “ statvlun ” output:
56.1 ms
CLI “ statvv ” output:
0.1 ms
53 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Example of Debugging Performance Issues (cont.)
54 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Example of Debugging Performance Issues (cont.)
The page statistics of statcmp showed Delayed ACK is occurring on all nodes for FC &
NL drives.
Note: DelAck is a cumulative counter value.
CLI “ statcmp ” output:
Examine load on FC & NL drives to identify root cause. (I/O size, High Avg. IOPS per drive, etc…). If applicable, check for RC related contention as well
55 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
3PAR performance counters are available at many levels to help troubleshoot performance issues
Physical objects
Physical disks
CPUs
FC Ports iSCSI ports
Links (Memory, PCI, ASIC-to-ASIC)
Logical objects
Chunklets
Logical disks
Virtual Volumes
VLUNs
Cache
Remote Copy links and VVs
56 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Some options are common to most stat* commands:
-ni : display only non-idle objects
-rw : displays read and write stats separately. Output will have 3 lines per object : read (r), write
(w) and total (t)
-iter <X> : only display X iterations. Default : loop continuously
-d <X> : specifies an interval of X seconds between 2 iterations. Default : 2 seconds
57 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Shows Physical disks stats :
• Current/Average/Max IOPS
• Current/Average/Max KB/s
• Current/Average service time
• Current/Average IO size
• Queue length
• Current/Average % idle cli% statpd -devinfo
17:45:48 08/20/12 r/w I/O per second KBytes per sec Svt ms IOSz KB Idle %
ID Port Cur Avg Max Cur Avg Max Cur Avg Cur Avg Qlen Cur Avg
0 2:2:1 t 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0 100 100
1 3:2:1 t 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0 100 100
58 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Statpd will show :
• Backend IOs caused by host IOs
• IOs caused by data movement, such as DO tunes, AO region moves
• IOs caused by clones
• IOs caused by disk rebuild
Statpd will not show :
• IOs caused by chunklet initialisation. The only way to see that chunklet initiatialisation is going on is to use “showpd -c”
59 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Useful options :
• -devinfo : displays the type and speed of each disk
• -p –devtype FC/NL/SSD : display only FC/NL/SSD PDs
Things to look for :
• PDs that have too many IOPS (based on recommended numbers, see “Limits”). Usually these
PDs will also have a % idle < 20
• PDs of a given type that have significantly more/less IOPS than other PDs of the same type
Usually a sign that PDs are incorrectly balanced.
• PDs with anomalous response times
60 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Statport will show the aggregated stats for all devices (disks or hosts) connected on a port
The totals reported by statport –host are the same as the totals of statvlun
The totals reported by statport –disk are the same as the totals of statpd
61 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Useful options :
• -host/disk/rcfc/peer: displays only host/disk/rcfc/peer ports
Things to look for :
• Host ports that have a higher response time than other for the same hosts. Might indicate problem on fabric
• Host ports that have reached they maximum read/write bandwidth
• Host ports that are busy in terms of bandwidth as this can increase the response time of IOs for hosts
62 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Remote Copy synchronous uses a “posted read” method, which consists in the remote system constantly posting read IOs on the source system (384 per port)
When the source system has some data to send to the destination system, it doesn’t need to do a write IO because there’s already a read IO pending from the destination.
This technique is used to save a round-trip between the 2 systems
Because of these posted reads, the average response time and queue of RCFC ports will always be very high, and will actually decrease as more data is being replicated
• When no data is being sent, the average response time on the RCFC ports will be of 60,000ms
(these IOs have a timeout of 60s) and the queue length of 384
• When replicating an average of 100 MB/s, the average response time will be 75ms
This is completely normal and no cause for concern.
To find the Remote Copy round-trip latency, use the statrcopy -hb command
63 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Statvlun is the highest level that can be measured and the statistics reported will the be the closest to what could be measured on the host.
Statvlun shows :
• All host IOs, including cache hits
Statvlun does not show :
• RAID overhead
• IOs caused by internal data copy/movement, such as clones, DO/AO tasks…
• IOs caused by disk rebuilds
• IOs caused by VAAI copy offload (XCOPY)
64 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Statvlun read service time :
• Excludes interrupt coalescing time
• Includes statvv read time
• Includes additional time spent dealing with the VLUN
Statvlun write service time :
• Excludes the first interrupt coalescing time
• Includes the time spent between telling the host it’s OK to send data and the host actually sending data. Because of this if the host/HBA/link is busy the statvlun time will increase but the problem will be at the host/SAN level!
• Includes the second interrupt coalescing time when the host sends data
• Includes the time spent writing data to cache + mirroring
• Includes delayed ack time
65 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Useful options :
• -vvsum: displays only 1 line per VV
• -hostsum : diplays only 1 line per host
• -v <VV name> : displays only VLUN for specified VV
Things to look for :
• High read/write response times
• Higher response times on some paths only
• Using -hostsum : has the host reached its max read/write bandwidth?
• Single threaded workloads : will have a queue of 1 steadily. Consider disabling interupt coalescing
• Maximum host/HBA/VM queue length reached for a path/host
66 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Statvv stats represent the IOs done by the array to the VV. They exclude all time spent communicating with the host and all time spent at the FC/iSCSI level.
Statvv includes :
• Cache hits
• IOs caused by the pre-fetching during sequential read IOs. Because of this it is possible to have more KB/s at the VV level than at the VLUN level
• (needs checking) IOs caused by VAAI copy offload (XCOPY)
• IOs caused by cloning operations
• IOs caused by Remote Copy
Things to look for :
• High write response times. Might indicate delayed ack
67 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Useful options :
• -v : shows read/write cache hit/miss stats per VV instead of per node
Things to look for :
• Delayed ack on a device type
• High LockBlock
68 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Things to look for :
• CPUs maxed out
69 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Useful options :
• -hb : shows link heart-beat response time
Things to look for :
• Max write bandwidth reached on a link
• Higher heart-beat round-trip latency on a link than on the other with -hb.
70 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
3PAR performance can be measured / captured using different tools :
- GUI : real time performance graphs. No historical data. Small granularity (seconds)
On demand only
- System Reporter : Historical performance information. Min granularity = 1 min, default = 5 min
Continuous
- CLI : Real-time performance stats and histograms (buckets). Small granularity (seconds)
On demand only
- Service Processor / STaTS / “Perform” files. Very large granularity (4 hours)
Continuous
- Service Processor / STaTS / Performance Analyzer (Perfanal files). Small granularity (seconds)
On demand only
72 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Connect to the Service Processor by pointing a browser to http://<IP_address_of_SP>
Login with the login “spvar” and password “HP3parvar” (SP 2.5.1 MU1 or later) or
“3parvar” (SP 2.5.1 or earlier)
Select “Support” on the left, then “Performance Analyzer”
Click “Select all” and enter the number of iterations to capture
Example, to capture 1 hours of data, enter 360 iterations of 10 seconds
The default of 60 iterations of 10 seconds will correspond to at least 10 minutes of data
Click “Launch Performance Analysis tool”
73 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Once the performance capture is over, the files will be uploaded automatically to the HP 3PAR support center and can be downloaded from
STATs. ( http://stwebprod.hp.com/ )
If the service processor is not configured to send data automatically, the file can be found in /files/<3PAR serial number>/perf_analysis
74 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
What is the problem reported by the user?
Is this problem visible at the 3PAR level? If not it might be a problem higher up the chain
Poor response times on VLUNs:
Is the problem affecting only reads IOs or write IOs?
Is the problem visible on VVs?
High write service time on VLUNs and VVs -> Look for delayed ack with statcmp
76 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
What is the queue at the statvlun level?
• If the queue is steadily at 0 or 1, that’s typical of a single-threaded workload. Look for ways of increasing the queue depth at the application level or the behaviour of the application
• If the queue is steadily at a value, for example 15/16, or 31/32, this indicates that the maximum queue length of the host/HBA/VM… has been reached. Increase Host’s HBA Q
Length
77 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
High service time on VLUNs but not on VVs :
• Try disabling interupt coalescing
• Can be representative of a problem on the host. Statvlun includes some time spent on the host for write IOs
• Have some host ports reached their max bandwidth?
• Are there some host ports that have a higher response time than some other.
• For all hosts? Might indicate problem on the fabric/switch port/SFP…
78 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Delayed ack is happening :
• Look for busy PDs
• Is the maximum write bandwidth of the system reached?
• If using RC synchronous, is the maximum RC bandwidth reached?
• If using RC synchronous, is there delayed ack happening on the remote system?
79 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.