Understanding and Troubleshooting 3PAR Performance

Understanding and troubleshooting

3PAR performance

Christophe Dubois / 3PAR Ninja team

20.03.2013

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Agenda

A discussion about performance of HP 3PAR StoreServ and basic troubleshooting techniques

2 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Architecture Overview

RTFM Read the concept guide!

http://bizsupport1.austin.hp.com/bc/docs/support/SupportManual/c02986000/c02986000.

pdf

Or search 3PAR concept guide in your favourite search engine


Limits

When sizing for or troubleshooting performance, consider the following limits that will apply to any 3PAR system :

• How many IOPS the controller nodes can sustain

• How many IOPS the physical disks can sustain

• Block size matters!

• How much read bandwidth the controller nodes can sustain

• How much write bandwidth the controller nodes can sustain

• Software limitation of the write cache algorithm

• If using AO, the IO locality profile


Front-end to back-end ratio

Depending on the type of IOs (read/write), whether they are random/sequential, the RAID type and the IO size, the ratio between front-end and back-end will vary

Cache hit

• Read cache hit is a read IO to a portion of data that is already in cache

• Write cache hit IO is an IO to a portion of data that is in write cache, but has not been destaged to disk yet

Note that when doing sequential read IOs, the system will report very high read hit ratios (99%+) because the pre-fetching algorithm puts the data in cache



How RAID write overhead is calculated

RAID 1 – Writes

D1 D1

1.

Write new data to 1 st mirror (1 IOP)

2.

Write new data to 2 nd mirror (1 IOP)

RAID 5 – Writes

D1 D2 D3 P

1.

Read old data block (1 IOP)

2.

Read old parity block (1 IOP)

3.

Calculate new parity block (0 IOP)

4.

Write new data block (1 IOP)

5.

Write new parity block (1 IOP)

1.

Complicated Process RAID 6 – Writes

A1 A2 A3 Ap

B1 B2 Bp B3

C1 p C2 C3

Dp D1 D2 D3


Total IOPS per 1 New Write = 2

Total IOPS per 1 New Write = 4

Total IOPS per 1 New Write = 6.66


Depending on the type of IOs (read/write), whether they are random/sequential, the RAID type and the IO size, the ratio between front-end and back-end will vary

RAID1

- Random read IO : 1 front-end IO = 1 back-end IO

- Sequential reads : 1 KiB/s of front-end = at least 1 KiB/s of back-end

Do not look at IOPS when doing sequential workloads, as the system will aggregate multiple IOs when going to the backend. Use KiB/s instead

Because of prefetching there will almost always be more KB/s on the backend than on the front-end

- Random write IO : 1 front-end IO = 2 back-end IO

- Sequential writes : 1 KiB/s of front-end = 2 KiB/s of back-end




RAID5


- Sequential reads : 1 KiB/s of front-end = at least 1 KiB/s of back-end



- Random write IO : 1 front-end IO = 4 back-end IO

- Sequential writes : 1 KiB/s of front-end = 1 KiB/s * (setsize / (setsize – 1)) of back-end




RAID6


- Sequential reads : 1 KiB/s of front-end = 1 KiB/s of back-end



- Random write IO : 1 front-end IO = 6.66 back-end IO

- Sequential writes : 1 KiB/s of front-end = 1 KiB/s * (setsize / (setsize – 2)) of back-end



Controller nodes IOPS limits

V-class

Approximately 120K backend IOPS per node pair

Note that with read IOs on V-class, the number does not scale linearly with the number of node pairs. It does scale linearly for write IOs

T/F-class

Approximately 64K backend IOPS per node pair

These numbers do not include cache IOPS.

Cache IOPS are not characterised, however during the SPC-1 benchmark for the V800 it is estimated that on top of the 120K backend IOPS, each node pair was doing 37K cache IOPS


Controller nodes bandwidth limits

The following limits are front-end :

V-class

• Reads : approximately 3250 MB/s per node pair

• Writes : 1500 MB/s if only 2 nodes, 2600 MB/s per node pair when using more than 2 nodes

T-class


• Writes : 600 MB/s per node pair

F-class


• Writes : 550 MB/s per node pair


V-class tested limits (no cache IOs) with RAID1

Random Reads IOPs

Random Writes IOPs

Random 50/50 RW IOPs

Sequential Reads MBs

Sequential Writes MBs

2node V400

120700

60700

95000

3250

1630

8node

V800

365000

246000

335000

14300

11100

12

SO data taken solely from full disk config results (480 or 1920 PDs)

Random IOPS: Multiple threads doing 8K IOsize

Sequential MBs: 1 to 2 threads doing 256K IOsize



Random Reads IOPs

Random Writes IOPs




2node V400 8node V800

114000

34300

58500

3390

1550

345000

140300

235000

14800

10500

SO data taken solely from full disk config results (480 or 1920 PDs)

Random IOPS: Multiple threads doing 8K IOsize

Sequential MBs: 1 to 2 threads doing 256K IOsize



Random Reads IOPs

Random Writes IOPs




2node V400 8node V800

105000 340000

20000

33000

3000

1250

77000

137000

12000

6500


HP 3PAR StoreServ 7000 Performance

7200 8KB Rand Read

IOPS

8KB Rand

Write IOPS

Seq. Read

MB/sec

15K HDD drive limited drive limited 2,500

SSD 150,000 75,000 2,800

7400 8KB Rand Read

IOPS

8KB Rand

Write IOPS

Seq. Read

MB/sec

15K HDD drive limited drive limited 4,800

SSD 320,000 160,000 4,800

Seq. Write

MB/sec

1,200

1,200

Seq. Write

MB/sec

2,400

2,400

Drive Limited = the performance depends on the number of drives and the drive capabilities and is not limited by the controllers or interconnects.


Physical Disks IOPS limits

Recommended Max drive IOPS

The following numbers are for small IOPS (8-16KB)

15K RPM disks = 200 backend IOPS per PD recommended. Can do 250 reasonably well

10K RPM disks = 150-170 backend IOPS

7.2K RPM disks (NL) = 75 backend IOPS per PD recommended

SSD = It depends!

• On type of IO

• On type of RAID (yes, even for back-end IOs)

• But 3000 IOPS per disk is a safe assumption


SSDs

SSD 100/200 GB

Max IOPS per drive

100% Reads

70% /30%

50% / 50%

30% / 70 %

100% Writes

RAID 1, Aligned RAID 1,

Unaligned

8000 7200

6000 4500

5000

5000

5000

4000

4000

4000

RAID 5, Aligned RAID 5,

Unaligned

7500 7000

3300 1700

3000

2800

2800

1500

1400

1400 we recommend using a figure of 3000 IOPS when sizing with 100-200GB SSD


T10 DIF and Performance Impact

What is impacted significantly by DIF?

Most IOPS are not affected significantly by introduction of DIF (e.g., reads, R5, R6, FC15, SSDs, sequential)

RAID1 random writes on 10000 NL PDs are affected UNLESS the IOsize is a multiple of 16K AND the IO is aligned

ALIGNMENT

Usually ignored by 3PAR in the past. Secondary performance impact (load on PDs)

Increasingly an issue (e.g., DIF, SSDs)


Block size matters

All random IOPS numbers on 3PAR are given for small IOs : 8 KBs

When doing IOs larger than 8 KB, the number of backend IOs a system can sustain do may drop off significantly as IO size increases

For FC/NL disks, there is virtually no difference between 8 KB and 16 KB IOs.

Above 16KB, the number of IOs per PD degrades

For SSDs, since a cell is 8KB, any IO larger than 8KB will cause a performance degradation

Remember that when using large blocks, the bandwidth limitation can be reached faster than the

IOPS limitations!


FC disks : IOPS vs block size (empirical data)

With 32x 15K disks, RAID5 fully allocated VV

Maximum IOPS vs Block Size (8K = 100%)

120%

100%

80%

60%

40%

20%

0%

100% Reads

50% Read 50% Write

100% Writes

8KB

100%

100%

100%

16KB

98%

98%

98%

32KB

92%

81%

63%

64KB

72%

52%

47%

This graph is not 100% accurate, but is used to show a drop in IOPS when the block size is increased. Values may change 5-10%.


SSD disks : IOPS vs block size (empirical data)

With 16 x 100GB SSDs, RAID5 fully allocated VV

Maximum IOPS vs Block Size (8K = 100%)

120%

100%

80%

60%

40%

20%

0%

100% Reads

50% Read 50% Write

100% Writes

8KB

100%

100%

100%

16KB

93%

91%

32KB

52%

50%

64KB

50%

30%

This graph is not 100% accurate, but is used to show a drop in IOPS when the block size is increased. Values may change 5-10%.


Block size matters

Performance drop off with block size increase

Rules of thumb (for a mixed workload 50% read, 50% write, 100% random)

FC & NL Disks

8KB

32KB

64KB

IOPS given by HP Sizer

16KB ~ same as 8KB block size

~ 80%

~ 50% of 8KB throughput of 8KB throughput

SSD Disks

8KB IOPS given by HP Sizer

16KB ~ 90% of 8KB block size

32KB ~ 50% of 8KB throughput

64KB ~ 30% of 8KB throughput


Difference between full VVs and thin VVs

While there is a difference in the number of backend IO required for a front-end IO on a

Thin VV compared to a full VV, this only applies to first write

This is usually completely transparent to the user and the application, since the system will acknowledge the IO to the host and write to disk afterwards

Most applications usually “prepare” new capacity before using it

After the first write, there is absolutely no difference between Thin and full VVs


Full VVs vs TPVVs after initial write

Workload is 100% random read, then 100% random write, then 50/50

Same amount of backend IOs

Frontend IOs on full VV

Frontend IOs on Thin VV


Snapshots

Impact of Snapshots

• A Snapshot is a point-in-time copy of a source Virtual Volume

− It can use any RAID level

− Reservation less - It has no “background normalization” process

− Normalization occurs as a result of host writes to the source VV

• The Copy-on-Write can cause increased Host I/O Latencies

−

Before “New” data can be written to disk, the “Old” data needs to be read off the disks and copied to the snapshot space. This is called a Copy On [first] Write or COW.

• Copy-on-Writes cause additional backend disk IOs and increased Host I/O latency

As long as the system is not maxed out in terms of backend IOs, snapshots will have a marginal impact


Without and with snapshot

Backend IO/s

Create Snapshot

Frontend IO/s

Write response time

Read response time


3PAR Copy on Write Implementation

3PAR FLUSHER Backend Disks

• 3PAR arrays perform Copy on Write operations in the background, allowing Hosts to see Electronic

(Cache) latencies

• Write Latencies CAN still increase if there is sufficient activity to slow down the cache flush rate

−

Busy drives = Slower cache flush rate

• Time estimate are for a system that has enough buffers available to support the workload

Host t=0.25ms

8kb SCSI Write

Status Good t=0.25ms

3PAR Cache

Flush Data

Read 16kb Old Data

Data t=6ms

Read 16kb from Disk Drives

Store Data into SD Space

Flush Update to SA/SD space t=4ms

Write Old Data to SA/SD Space

Status Good

Write Host 8kb to original Location t=4ms

Write New Data to Disk

Status Good


3PAR StoreServ 7200 Virtual Copy Snapshot

Performance Summary

Base workload:

70/30 8kb

IOPS

Average Latency

1 Virtual Copy

IOPS

Average Latency

2 Virtual Copy

IOPS

Average Latency

4 Virtual Copy

IOPS

Average Latency

8 Virtual Copy

IOPS

Average Latency

SS7200, 3.1.2 R1

11,500

4.1ms

11,000

4.2ms

10900

4.3ms

10200

4.6ms

8500

5.6ms

SS7200, 3.1.2 R5

9000 IOPS

5.2ms

9000

5.2ms

8900

5.3ms

8100

5.5ms

8200

5.8ms


SS7200, 3.1.2 R6

8000 IOPS

6ms

8000

6ms

8000

6ms

7600

6.2ms

7300

6.5ms

3PAR Virtual Copy, Raid 1, 1 Snap

OLTP 70/30 8kb, q=48, 144x15k, SS7200, 1 snap

11500 iops Average Latency

2.5

2

1.5

1

0.5

0

5

4.5

4

3.5

3

Time

Remove Snapshot Create Snapshot

Average Latency


3PAR Virtual Copy, Raid 1,8 SNAPS

OLTP 70/30 8kb, q=48, 144x15k, SS7200, 8 snaps

Baseline: 11500 iops/4msAverage Latency

12

6

4

2

0

10

8

Time

Create 8 Snaps Average Latency


Remove 8 Snaps

Couple of things to be aware of…


3PAR Cache and drive populations

Control cache

Data cache

Max drives per System

Max drives per Node

Per Node Cache Buffers

7200 2-node

16GB

(4)8GB

144 (120)

72 (60)

262144 (4GB)

Number of drives needed to reach maximum number of

Write Buffers

108 drives per Node Pair

You cannot maximize Write cache with only NL Drives


7400 2-node/4-Node

16GB/32GB

(8)16GB/32GB

240 (216)/480(432)

120(108)

524288 (8GB)

Number of drives needed to reach maximum number of

Write Buffers


You cannot maximize Write cache with only NL Drives


15k (2400 buffers)

NL (1200 buffers)

SSD (9600 buffers)

The table above estimates the number of disk drives that need to be distributed in a 3PAR StoreServ 7000 to ensure the maximum number of Write Buffers are available.

Caution must be used when cache size is taken into account to size for performance. Just because the array has 8GB, it does not mean a host workload will be able to utilize the full amount for Writes

With all NL Drives, you CANNOT allocate the maximum amount of Write-Buffers

This is important to understand for small disk 3PAR StoreServ 7000 systems which may suffer higher response times on writes that is expected for the size of the arrays cache.


MB/s per disk and Write cache flusher limits

Upon writing to the 3PAR array, the data will be put in write cache. Each 3PAR controller node only allows a maximum number of pages for a given number and type of disk

When reaching 85% of this number of maximum allowed cache page, the system will start delaying the acknowledgement of IOs in order to throttle down the hosts, until some cache pages have been freed by having their data de-staged to disk (condition known as “delayed ack”)

This de-staging happens at a fixed rate that depends on the number and type of disks

The maximum write bandwidth of the hosts will be limited to the de-staging speed


MB/s per disk and Write cache flusher limits

While undocumented, the following rules of thumb can be used for small systems (less that 100 disks of a type). Note that this is empirical data

De-staging speed for FC 15K disks : 8-10 MB/s (front-end) per FC 15K PD

De-staging speed for NL disks : 4-6 MB/s (front-end) per NL PD

Note that when doing read IOs, this limit does not apply and much higher values can be reached

(35-40 MB/s per FC 15K PD)

Beware of the controller node limits when sizing or troubleshooting bandwidth related issues

Always use the Storage Optimizer when sizing for MB/s


Delayed ack mode

“Delayed ack” is a behaviour of HP 3PAR systems when the cache gets filled faster than it can be de-staged to disk (most likely because the physical disks are maxed out)

This is determined by the number of dirty cache pages for a type of disk exceeding 85% of the allowed maximum

If the threshold is reached, the system will reduce the host IO rate by delaying the “ack” sent back on host writes. Throttling is done to reduce the possibility of hitting max allowable dirty CMP limit

(cache full).

The host will see this behavior and naturally slow down the IO rate it sends to the InServ (extreme cases cause host IO timeout & outage). If continually in delayed-ack mode, load needs to be lowered on hosts, or additional nodes/disks.


Delayed ack mode

The maximum number of cache pages is a function of the number of disks of each type that are connected to a given node :

• SSD : 4800 pages per PD

• FC : 1200 pages per PD

• NL : 600 pages per PD

For example on a 4 node system with 32 SSDs, 256 FC disks and 64 NL disks (each node will see

16 SSDs, 128 FC and 32 NL):

• Per node : 76800 pages for SSDs, 153600 pages for FC, 19200 pages for NL


Identifying delayed ack mode

The statcmp command shows the number of dirty pages per node, type of disks, the maximum number of pages allowed and the number of delyed acks :

Page Statistics

---------CfcDirty--------- -----------CfcMax------------ -----------DelAck------------

Node FC_10KRPM FC_15KRPM NL SSD FC_10KRPM FC_15KRPM NL SSD FC_10KRPM FC_15KRPM NL SSD

2 0 15997 2 0 0 19200 19200 0 0 53896 16301 0

3 0 18103 1 0 0 19200 19200 0 0 95982 15092 0

Current number of dirty pages for each node for this type of disks (instant)

Max allowed pages per node for this type of disks

Number of delayed acks.

This counter is incremented whenever a delayed ack happens


Reasons why the system may be in delayed ack mode

Delayed acks occur the cache gets filled faster than it can be de-staged to disk

Factors that contribute to delayed-ack :



PD’s are maxed out



Check status with statpd



“ servicemag ’” running with heavy IO already occurring



The cache flushing limit might be reached for this type of disks



flusher speed bug



Drastic change in host IO profile



RemoteCopy sync destination array’s disks maxed out (make sure to include in your performance analysis)


Performance issues with single streamed, small write IOs

On workloads using single streams (1 outstanding IO) of small write IOs, the performance of the

3PAR system might be lower than expected. This really matters for block sizes < 128KB

On inform OS 3.1.1 situation can sometimes be improved by disabling the “interrupt coalescing” feature of the front-end ports. (Default is disabled for Inform OS 3.1.2)

While interrupt coalescing has a positive effect on most workloads (offload 3PAR CPUs on controller nodes), it can have a detrimental impact on this specific type of workloads experience with 3.1.2 is that it gives you 2x Performance improvement for single thread sequential IO over 3.1.1


Performance issues with single streamed, small write IOs

Interrupt Coalescing (intcoal) is defined on front-end target ports

It should only be disabled on ports used by hosts that use this type of workload

To disable it use the following command on each port: controlport intcoal disable <N:S:P>

The port will be reset (expect a short host io interruption during reset)


Write Request

① WRITE COMMAND

③ DATA TRANSFER

② XFER READY

④ STATUS

Server

Storage Array

The write from a server to a storage device uses a dual round trip SCSI Write protocol to service the write request


Response time of single write IOs with and without interrupt coalescing

Intcoal enabled

1KB 2KB

Intcoal disabled

4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB

50% lower latency


Measuring performance


Stat* vs Hist* commands

All of the previous objects have stat* commands (use “help stat” for complete list)

Stat* commands display average values between 2 iterations

Because the result is an average, a single anomalously long IO might be hidden by a large number of IOs with a good service time.

Prefer a short sampling interval (i.e 15 sec or less)

The hist* commands can be used to display buckets of response times and block sizes if required

Use “help hist” to see the list of hist* commands


6 Layers to Check

Watch for

• High vlun latencies

• Large IO Sizes

• Large IOSizes, Latencies

• Delayed Acks

• Heavy Disk IO activity

• Slow PD (high queue/latency)

FRONT END

1) statvlun –rw -ni

2) statport –rw –ni –host

NODES

3) statvv –rw –ni

4) statcmp and statcpmvv

BACK END

5) statpd –rw -ni

6) statport –rw –ni –disk


Performance Case Studies


Case Study #1

Example of Debugging Performance Issues

Issue – Customer is seeing service time increase greatly, and IOPs staying the same.

They are running on 10K FC drives .

CLI “ statport-host ” output:

Host “ iostat ” output:


Case Study #1

Example of Debugging Performance Issues (cont.)

CLI “ statpd ” output:


Case Study #1


If you remember earlier in the presentation, as IOPs increase, so does the service time.

However, here the IOPs have stopped increasing.

When a hardware hits its IOP max, additional IO gets queued on the device and is caused to wait, which adds to the service time.

From earlier slides, we know the following about PD IOP limits:

 7K NL = 75 IOPS

 10K FC = 150 IOPS

 15K FC = 200 IOPS

In our case, PDs are running at 133% of maximum load and is the bottleneck.

Solution – Additional hardware (PDs) would be needed to reduce the backend load and reduce service times to the application(s).


Case Study #2

Example of Debugging Performance Issues

Issue – Customer is expecting higher IOPS, but not getting what was “advertised”.

They are running on 10K FC drives .

CLI “ statport-host ” output:


Case Study #2


CLI “ statpd ” output:


Case Study #2


• When 3PAR quotes performance limits for IOPS or throughput, it is assuming a giving IO size (typically 8KB for Random Workloads). If the customer workload diverges from that, the quoted limits will most likely not be achievable.

• Looking at the stats facing the host, we see the block size coming in to the InServ is 32KB , and the block size to the disks is also 32KB .

• From earlier slides, we know in order to hit node and PD performance max:

 Access pattern = RANDOM

 Block Size <= 16KB

• The block size is above the size we require to get the true max performance from the StoreServ. Because of this, they will only get approx 75% of the max.

• The customer can either lower their application request size (if possible), or add additional PDs to sustain their desired IOPS number (taking into consideration the % drop with a larger block size)


Case Study #3


Issue – Customer sees very high write service times for small IO sizes in ” statvlun ” but ” statvv ” shows no problem with write service times

CLI “ statvlun ” output:

56.1 ms

CLI “ statvv ” output:

0.1 ms


Case Study #3


• Examination of the backend showed very high IO rates

• The large difference between ”statvlun” and ”statvv” can be an indication of delayed ack .

• Next we will examine statcmp and look for possible delayedAcks…


Case Study #3


The page statistics of statcmp showed Delayed ACK is occurring on all nodes for FC &

NL drives.

Note: DelAck is a cumulative counter value.

CLI “ statcmp ” output:

Examine load on FC & NL drives to identify root cause. (I/O size, High Avg. IOPS per drive, etc…). If applicable, check for RC related contention as well


Measuring performance

3PAR performance counters are available at many levels to help troubleshoot performance issues

Physical objects

Physical disks

CPUs

FC Ports iSCSI ports

Links (Memory, PCI, ASIC-to-ASIC)

Logical objects

Chunklets

Logical disks

Virtual Volumes

VLUNs

Cache

Remote Copy links and VVs


Common stat options

Some options are common to most stat* commands:

-ni : display only non-idle objects

-rw : displays read and write stats separately. Output will have 3 lines per object : read (r), write

(w) and total (t)

-iter <X> : only display X iterations. Default : loop continuously

-d <X> : specifies an interval of X seconds between 2 iterations. Default : 2 seconds


statpd

Shows Physical disks stats :

• Current/Average/Max IOPS

• Current/Average/Max KB/s

• Current/Average service time

• Current/Average IO size

• Queue length

• Current/Average % idle cli% statpd -devinfo

17:45:48 08/20/12 r/w I/O per second KBytes per sec Svt ms IOSz KB Idle %

ID Port Cur Avg Max Cur Avg Max Cur Avg Cur Avg Qlen Cur Avg

0 2:2:1 t 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0 100 100

1 3:2:1 t 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0 100 100


statpd

Statpd will show :

• Backend IOs caused by host IOs

• IOs caused by data movement, such as DO tunes, AO region moves

• IOs caused by clones

• IOs caused by disk rebuild

Statpd will not show :

• IOs caused by chunklet initialisation. The only way to see that chunklet initiatialisation is going on is to use “showpd -c”


statpd

Useful options :

• -devinfo : displays the type and speed of each disk

• -p –devtype FC/NL/SSD : display only FC/NL/SSD PDs

Things to look for :

• PDs that have too many IOPS (based on recommended numbers, see “Limits”). Usually these

PDs will also have a % idle < 20

• PDs of a given type that have significantly more/less IOPS than other PDs of the same type

Usually a sign that PDs are incorrectly balanced.

• PDs with anomalous response times


statport

Statport will show the aggregated stats for all devices (disks or hosts) connected on a port

The totals reported by statport –host are the same as the totals of statvlun

The totals reported by statport –disk are the same as the totals of statpd


statport

Useful options :

• -host/disk/rcfc/peer: displays only host/disk/rcfc/peer ports


• Host ports that have a higher response time than other for the same hosts. Might indicate problem on fabric

• Host ports that have reached they maximum read/write bandwidth

• Host ports that are busy in terms of bandwidth as this can increase the response time of IOs for hosts


Warning about statport stats for RCFC ports

Remote Copy synchronous uses a “posted read” method, which consists in the remote system constantly posting read IOs on the source system (384 per port)

When the source system has some data to send to the destination system, it doesn’t need to do a write IO because there’s already a read IO pending from the destination.

This technique is used to save a round-trip between the 2 systems

Because of these posted reads, the average response time and queue of RCFC ports will always be very high, and will actually decrease as more data is being replicated

• When no data is being sent, the average response time on the RCFC ports will be of 60,000ms

(these IOs have a timeout of 60s) and the queue length of 384

• When replicating an average of 100 MB/s, the average response time will be 75ms

This is completely normal and no cause for concern.

To find the Remote Copy round-trip latency, use the statrcopy -hb command


statvlun

Statvlun is the highest level that can be measured and the statistics reported will the be the closest to what could be measured on the host.

Statvlun shows :

• All host IOs, including cache hits

Statvlun does not show :

• RAID overhead

• IOs caused by internal data copy/movement, such as clones, DO/AO tasks…

• IOs caused by disk rebuilds

• IOs caused by VAAI copy offload (XCOPY)


statvlun

Statvlun read service time :

• Excludes interrupt coalescing time

• Includes statvv read time

• Includes additional time spent dealing with the VLUN

Statvlun write service time :

• Excludes the first interrupt coalescing time

• Includes the time spent between telling the host it’s OK to send data and the host actually sending data. Because of this if the host/HBA/link is busy the statvlun time will increase but the problem will be at the host/SAN level!

• Includes the second interrupt coalescing time when the host sends data

• Includes the time spent writing data to cache + mirroring

• Includes delayed ack time


statvlun

Useful options :

• -vvsum: displays only 1 line per VV

• -hostsum : diplays only 1 line per host

• -v <VV name> : displays only VLUN for specified VV


• High read/write response times

• Higher response times on some paths only

• Using -hostsum : has the host reached its max read/write bandwidth?

• Single threaded workloads : will have a queue of 1 steadily. Consider disabling interupt coalescing

• Maximum host/HBA/VM queue length reached for a path/host


statvv

Statvv stats represent the IOs done by the array to the VV. They exclude all time spent communicating with the host and all time spent at the FC/iSCSI level.

Statvv includes :

• Cache hits

• IOs caused by the pre-fetching during sequential read IOs. Because of this it is possible to have more KB/s at the VV level than at the VLUN level

• (needs checking) IOs caused by VAAI copy offload (XCOPY)

• IOs caused by cloning operations

• IOs caused by Remote Copy


• High write response times. Might indicate delayed ack


statcmp

Useful options :

• -v : shows read/write cache hit/miss stats per VV instead of per node


• Delayed ack on a device type

• High LockBlock


statcpu


• CPUs maxed out


statrcopy

Useful options :

• -hb : shows link heart-beat response time


• Max write bandwidth reached on a link

• Higher heart-beat round-trip latency on a link than on the other with -hb.


Capturing performance data


Capturing performance

3PAR performance can be measured / captured using different tools :

- GUI : real time performance graphs. No historical data. Small granularity (seconds)

On demand only

- System Reporter : Historical performance information. Min granularity = 1 min, default = 5 min

Continuous

- CLI : Real-time performance stats and histograms (buckets). Small granularity (seconds)

On demand only

- Service Processor / STaTS / “Perform” files. Very large granularity (4 hours)

Continuous

- Service Processor / STaTS / Performance Analyzer (Perfanal files). Small granularity (seconds)

On demand only


Capturing a “Performance analyzer”

Connect to the Service Processor by pointing a browser to http://<IP_address_of_SP>

Login with the login “spvar” and password “HP3parvar” (SP 2.5.1 MU1 or later) or

“3parvar” (SP 2.5.1 or earlier)

Select “Support” on the left, then “Performance Analyzer”

Click “Select all” and enter the number of iterations to capture

Example, to capture 1 hours of data, enter 360 iterations of 10 seconds

The default of 60 iterations of 10 seconds will correspond to at least 10 minutes of data

Click “Launch Performance Analysis tool”


Capturing a “Performance analyzer”

Once the performance capture is over, the files will be uploaded automatically to the HP 3PAR support center and can be downloaded from

STATs. ( http://stwebprod.hp.com/ )

If the service processor is not configured to send data automatically, the file can be found in /files/<3PAR serial number>/perf_analysis


Performance troubleshooting guide


Troubleshooting guide

What is the problem reported by the user?

Is this problem visible at the 3PAR level? If not it might be a problem higher up the chain

Poor response times on VLUNs:

Is the problem affecting only reads IOs or write IOs?

Is the problem visible on VVs?

High write service time on VLUNs and VVs -> Look for delayed ack with statcmp



What is the queue at the statvlun level?

• If the queue is steadily at 0 or 1, that’s typical of a single-threaded workload. Look for ways of increasing the queue depth at the application level or the behaviour of the application

• If the queue is steadily at a value, for example 15/16, or 31/32, this indicates that the maximum queue length of the host/HBA/VM… has been reached.  Increase Host’s HBA Q

Length



High service time on VLUNs but not on VVs :

• Try disabling interupt coalescing

• Can be representative of a problem on the host. Statvlun includes some time spent on the host for write IOs

• Have some host ports reached their max bandwidth?

• Are there some host ports that have a higher response time than some other.

• For all hosts? Might indicate problem on the fabric/switch port/SFP…



Delayed ack is happening :

• Look for busy PDs

• Is the maximum write bandwidth of the system reached?

• If using RC synchronous, is the maximum RC bandwidth reached?

• If using RC synchronous, is there delayed ack happening on the remote system?


Thanks!


Understanding and Troubleshooting 3PAR Performance

Understanding and troubleshooting

3PAR performance

Agenda

Architecture Overview

Limits

Front-end to back-end ratio

Front-end to back-end ratio

Front-end to back-end ratio

Front-end to back-end ratio

Front-end to back-end ratio

Controller nodes IOPS limits

Controller nodes bandwidth limits

V-class tested limits (no cache IOs) with RAID1

Random Reads IOPs

Random Writes IOPs

Random 50/50 RW IOPs

Sequential Reads MBs

Sequential Writes MBs

2node V400

120700

60700

95000

3250

1630

8node

V800

365000

246000

335000

14300

11100

SO data taken solely from full disk config results (480 or 1920 PDs)

Random IOPS: Multiple threads doing 8K IOsize

Sequential MBs: 1 to 2 threads doing 256K IOsize

V-class tested limits (no cache IOs) with RAID5

Random Reads IOPs

Random Writes IOPs

Random 50/50 RW IOPs

Sequential Reads MBs

Sequential Writes MBs

2node V400 8node V800

114000

34300

58500

3390

1550

345000

140300

235000

14800

10500

SO data taken solely from full disk config results (480 or 1920 PDs)

Random IOPS: Multiple threads doing 8K IOsize

Sequential MBs: 1 to 2 threads doing 256K IOsize

V-class tested limits (no cache IOs) with RAID6

Random Reads IOPs

Random Writes IOPs

Random 50/50 RW IOPs

Sequential Reads MBs

Sequential Writes MBs

2node V400 8node V800

105000 340000

20000

33000

3000

1250

77000

137000

12000

6500

HP 3PAR StoreServ 7000 Performance

Physical Disks IOPS limits

SSDs

T10 DIF and Performance Impact

Block size matters

FC disks : IOPS vs block size (empirical data)

SSD disks : IOPS vs block size (empirical data)

Block size matters

Difference between full VVs and thin VVs

**Stat* vs Hist* commands**