High Speed Sequential IO on Windows NT™ 4.0 Erik Riedel Catharine van Ingen

advertisement
High Speed Sequential IO on
Windows NT™ 4.0 (sp3)
Erik Riedel (of CMU)
Catharine van Ingen
Jim Gray
http://Research.Microsoft.com/BARC/Sequential_IO/
Outline
• Intro/Overview
• Disk background, technology trends
• Measurements of Sequential IO
–
–
–
–
Single disk (temp, buffered, unbuffered, deep)
Multiple disks and busses
RAID
Pitfalls
• Summary
We Got a Lot of Help
• Brad Waters, Wael Bahaa-El-Din, and Maurice Franklin
Shared experience, results, tools, and hardware lab.
Helped us understand NT
Feedback on our preliminary measurements
•
•
•
•
Tom Barclay iostress benchmark program
Barry Nolte & Mike Parkes allocate issues
Doug Treuting, Steve Mattos + Adaptec SCSI and Adaptec device drivers
Bill Courtright, Stan Skelton, Richard Vanderbilt,
Mark Regester loanded us a Symbios Logic array, host adapters, and r expertise. .
• Will Dahli : helped us understand NT configuration and measurement.
• Joe Barrera & Don Slutz & Felipe Cabrera
valuable comments, feedback and helped in understanding NTFS internals.
• David Solomon: Inside Windows NT 2nd edition draft
The Actors
• Measured & Modeling Sequential IO
• Where are the bottlenecks?
• How does it scale with
– SMP, RAID, new interconnects
Goals:
balanced bottlenecks
Low overhead
Scale many processors (10s)
Scale many disks (100s)
Memory
File cache
Mem bus
App address
space
PCI
Adapter
SCSI
Controller
PAP (peak advertised Performance) vs
RAP (real application performance)
• Goal: PAP = RAP / 2 (the half-power point)
System Bus
422 MBps
40 MBps
7.2 MB/s
7.2 MB/s
Application
Data
10-15 MBps
7.2 MB/s
File System
Buffers
SCSI
Disk
133 MBps
7.2 MB/s
PCI
Outline
• Intro/Overview
• Disk background, technology trends
• Measurements of Sequential IO
–
–
–
–
Single disk (temp, buffered, unbuffered, deep)
Multiple disks and busses
RAID
Pitfalls
• Summary
Two Basic Shapes
• Circle (disk)
– storage frequently returns to same spot
– so less total surface area
• Line (tape)
– Lots more area,
– Longer time to get to the data.
• Key idea: multiplex expensive read/write head
over large storage area:
trade $/GB for access/second
Disk Terms
•
•
•
•
•
•
•
Disks are called platters
Data is recorded on tracks (circles) on the disk.
Tracks are formatted into fixed-sized sectors.
A pair of Read/Write heads for each platter
Mounted on a disk arm
Client addresses logical blocks (cylinder, head, sector)
Bad blocks are remapped to spare good blocks.
Disk Access Time
• Access time = SeekTime
+ RotateTime
+ ReadTime
• Rotate time:
– 5,000 to 10,000 rpm
• ~ 12 to 6 milliseconds per rotation
• ~ 6 to 3 ms rotational latency
• Improved 3x in 20 years
6 ms
3 ms
1 ms
Disk Seek Time
• Seek time is ~ Sqrt(distance)
(distance = 1/2 acceleration x time2)
• Specs assume seek is
1/3 of disk
• Short seeks are common.
(over 50% are zero length)
• Typical 1/3 seek time: 8 ms
• 4x improvement in 20 years.
time
Read/Write Time: Density
• Time = Size / BytesPerSecond
• Bytes/Second = Speed * Density
– 5 to 15 MBps
• MAD (Magnetic Aerial Density)
– Today 3
Gbits/inch2
5 gbpsi in lab
– Rising > 60%/year
– ParaMagnetic Limit:
10 Gb/inch2
– linear density is sqrt
10x per decade
10,000
1,000
100
10
1
1970
1980
1990
2000
Read/Write Time: Rotational Speed
• Bytes/Second = Speed * Density
• Speed greater at edge of circle
• Speed 3600 -> 10,000 rpm
– 5%/year improvement
• bit rate varies by ~1.5x today
p r2 = 4
Throughput (MB/s)
10
p
Ultra SCSI
8
6
4
.
2
0
Radial Distance
25%
50%
75%
1
r=2
r=1
Fast Wide SCSI
0%
r2 =
100%
Read/Write Time: Zones
• Disks are sectored
– typical: 512 bytes/sector
– Sector is read/write unit
– Failfast: can detect bad sectors.
• Disks are zoned
8 sectors/track
– outer zones have more sectors
– Bytes/second higher in outer zones.
8 sectors/track
14 sectors/track
Disk Access Time
• Access time = SeekTime
+ RotateTime
+ ReadTime
• Other useful facts:
6 ms
3 ms
1 ms
5%/y
5%/y
25%/y
– Power rises more than size3 (so small is indeed beautiful)
– Small devices are more rugged
– Small devices can use plastics (forces are much smaller)
e.g. bugs fall without breaking anything
The Access Time Myth
The Myth: seek or pick time dominates
The Reality:(1) Queuing dominates
(2) Transfer dominates BLOBs
(3) Disk seeks often short
Implication: many cheap servers
better than one fast expensive server
– shorter queues
– parallel transfer
– lower cost/access and cost/byte
Transfer
This is now obvious for disk arrays
This will be obvious for tape arrays
Seek
Wait
Transfer
Rotate
Rotate
Seek
Storage Ratios Changed
• DRAM/disk media price
ratio changed
• 10x better access time
• 10x more bandwidth
• 4,000x lower media price
1
1980
1990
Year
0.1
2000
Storage Price vs Time
Megabytes per kilo-dollar
100
10,000.
1,000.
MB/k$
Accesses per Second
1.
Capacity (GB)
seeks per second
bandwidth: MB/s
10.
10
1970-1990
100:1
1990-1995
10:1
1995-1997
50:1
today ~ .2$pMB disk
10$pMB dram
Disk accesses/second
vs Time
Disk Performance vs Time
100
–
–
–
–
10
100.
10.
1.
1
1980
1990
Year
2000
0.1
1980
1990
Year
2000
Year 2002 Disks
• Big disk (10 $/GB)
–
–
–
–
3”
100 GB
150 kaps (k accesses per second)
20 MBps sequential
• Small disk (20 $/GB)
–
–
–
–
3”
4 GB
100 kaps
10 MBps sequential
• Both running Windows NT™ 7.0?
(see below for why)
Tape & Optical:
Beware of the Media Myth
• Optical is cheap: 200 $/platter
3 GB/platter
=> 70$/GB (cheaper than disc)
• Tape is cheap:
=> 1.5 $/GB
30 $/tape
20 GB/tape
(100x cheaper than disc).
The Media Myth
• Tape needs a robot (10 k$ ... 3 m$ )
10 ... 1000 tapes (at 20GB each) => 10$/GB ... 150$/GB
(1x…10x cheaper than disc)
Optical needs a robot (100 k$ )
100 platters = 200GB ( TODAY ) => 400 $/GB
( more expensive than mag disc )
• Robots have poor access times
Not good for Library of Congress (25TB)
Data motel: data checks in but it never checks out!
Crazy Disk Ideas
• Disk Farm on a card: surface mount disks
• Disk (magnetic store) on a chip:
(micro machines in Silicon)
• NT and BackOffice in the disk controller
(a processor with 100MB dram)
ASIC
The Disk Farm On a Card
The 100GB disc card
An array of discs
Can be used as
100 discs
1 striped disc
10 Fault Tolerant discs
....etc
LOTS of accesses/second
bandwidth
14"
Life is cheap, its the accessories that cost ya.
Processors are cheap, it’s the peripherals that cost ya
(a 10k$ disc card).
Functionally Specialized Cards
• Storage
P mips processor
ASIC
Today:
P=50 mips
M MB DRAM
• Network
M= 2 MB
In a few years
ASIC
P= 200 mips
M= 64 MB
• Display
ASIC
It’s Already True of Printers
Peripheral = CyberBrick
• You buy a printer
• You get a
– several network interfaces
– A Postscript engine
•
•
•
•
cpu,
memory,
software,
a spooler (soon)
– and… a print engine.
All Device Controllers will be Cray 1’s
• TODAY
– Disk controller is 10 mips risc engine
with 2MB DRAM
– NIC is similar power
• SOON
Central
Processor &
Memory
– Will become 100 mips systems
with 100 MB DRAM.
• They are nodes in a federation
(can run Oracle on NT in disk controller).
• Advantages
–
–
–
–
–
Uniform programming model
Great tools
Security
Economics (cyberbricks)
Move computation to data (minimize traffic)
Tera Byte
Backplane
System On A Chip
• Integrate Processing with memory on one chip
–
–
–
–
chip is 75% memory now
1MB cache >> 1960 supercomputers
256 Mb memory chip is 32 MB!
IRAM, CRAM, PIM,… projects abound
• Integrate Networking with processing on one chip
– system bus is a kind of network
– ATM, FiberChannel, Ethernet,.. Logic on chip.
– Direct IO (no intermediate bus)
• Functionally specialized cards shrink to a chip.
With Tera Byte Interconnect
and Super Computer Adapters
• Processing is incidental to
– Networking
– Storage
– UI
• Disk Controller/NIC is
– faster than device
– close to device
– Can borrow device
package & power
Tera Byte
Backplane
• So use idle capacity for computation.
• Run app in device.
Implications
Conventional
• Offload device handling
to NIC/HBA
• higher level protocols:
I2O, NASD, VIA…
• SMP and Cluster
parallelism is important.
Central
Processor &
Memory
Radical
• Move app to
NIC/device controller
• higher-higher level
protocols: CORBA /
DCOM.
• Cluster parallelism is
VERY important.
Tera Byte
Backplane
How Do They Talk to Each Other?
Applications
Each node has an OS
Each node has local resources: A federation.
Each node does not completely trust the others.
Nodes use RPC to talk to each other
– CORBA? DCOM? IIOP? RMI?
– One or all of the above.
Applications
?
RPC
streams
datagrams
• Huge leverage in high-level interfaces.
• Same old distributed system story.
VIAL/VIPL
?
RPC
streams
datagrams
•
•
•
•
h
Wire(s)
Will He Ever Get to The Point?
• I thought this was about NTFS sequential IO.
• Why is he telling me all this other crap?
It is relevant background
Outline
• Intro/Overview
• Disk background, technology trends
• Measurements of Sequential IO
–
–
–
–
Single disk (temp, buffered, unbuffered, deep)
Multiple disks and busses
RAID
Pitfalls
• Summary
The Actors
• Processor - Memory bus
• Memory
• The Disk: writes, stores, reads data
• The Disk Controller:
– manages drive (error handling)
– reads & writes drive
– converts SCSI commands
to disk actions
– May buffer or do RAID
– holds file cache and app data
• Application
– reads and writes memory
Memory
File cache
Mem bus
App address
space
PCI
• The SCSI bus: carries bytes
•
The Host-Bus Adapter:
– protocol converter to system
bus
– may do RAID
Adapter
SCSI
Controller
10
Sequential vs Random IO
• Random IO is typically small IO (8KB)
– seek+rotate+transfer is ~ 10 ms
– 100 IO per second
– 800 KB per second
• Sequential IO is typically large IO
– almost no seek (one per cylinder read/written)
– No rotational delay (reading whole disk track)
– Runs at MEDIA speed: 8 MB per second
1
• Sequential is 10x more bandwidth than random!
• Buffered:
–
–
–
–
Basic File Concepts
File reads/writes go to file cache
File system does pre-fetch, post write, aggregation.
Unbuffered bypasses file cache
Data written to disk at file close or LRU or lazy write
• Overlapped:
– requests are pipelined
– completions via events, completion ports,
– A simpler alternative to multi-threaded IO.
• Temporary Files:
– Files written to cache, not flushed on close.
Experiment Background
•
•
•
•
Used Intel/Gateway 2000 G6-200Mhz Pentium Pro
64 MB DRAM (4x interleave)
32-bit PCI
Adaptec 2940 Fast-Wide (20 MBps)
and Ultra-Wide (40 MBps) controllers
• Seagate 4GB SCSI disks (fast and ultra)
– (7200 rpm, 7-15 MBps “internal”)
• NT 4.0 SP3, NTFS
• i.e.: modest 1997 technology.
• Not multi-processor, Not DEC Alpha, Some RAID
Simplest Possible Code
#include <stdio.h>
#include <windows.h>
int main()
{ const int iREQUEST_SIZE = 65536;
char cRequest[iREQUEST_SIZE];
unsigned long ibytes;
HANDLE hFile =
CreateFile("C:\\input.dat",
// name
GENERIC_READ,
// desired access
0, NULL,
// share & security
OPEN_EXISTING,
// pre-existing file
FILE_ATTRIBUTE_TEMPORARY | FILE_FLAG_SEQUENTIAL_SCAN,
NULL);
// file template
ReadFile
while(
(hFile,cRequest,iREQUEST_SIZE,&ibytes,NULL) ) // do read
{ if (ibytes == 0) break;
// break on end of file
/* do something with the data */ };
CloseHandle(hFile);
return 0;
}
• Error checking adds some more, but still, its easy
The Best Case: Temp File, NO IO
Temp file Read / Write File System Cache
Program uses small (in cpu cache) buffer.
So, write/read time is bus move time (3x better than copy)
Paradox: fastest way to move data is to write then read it.
This hardware is
Temp File Read/Write
200
limited to 150 MBps
148
per processor
136
150
100
MBps
•
•
•
•
•
54
50
0
Temp read
Temp write
Memcopy ()
Out of the Box Disk File Performance
• One NTFS disk
• Buffered read
• NTFS does 64 KB read-ahead
– if you ask FILE_FLAG_SEQUENTIAL
– or if it thinks you are sequential
• NTFS does 64 KB write behind
– under same conditions
– aggregates many small IO to few big IO.
64KB
Synchronous Buffered Read/Write
• Net: default out of the box
• Read throughput is GREAT!
performance is good.
• Write throughput is 40% of read
• 20 ms/MB ~ 2 instructions/byte!
• WCE is fast but dangerous
• CPU will saturate at 50MBps
Out of the Box Throughput
Out of the Box Overhead
10
Throughput (MB/s)
Read
8
Write +WCE
6
4
Write
2
0
Overhead (cpu msec/MB)
80
70
Read
60
Write
50
Write + WCE
40
Write
30
20
Read
10
0
2
4
8
16
32
64
128 192
Request Size (K-Bytes)
2
4
8
16 32
64 128 192
Request Size (K Bytes)
Write Multiples of Cluster Size
Out of the Box Throughput
10
Read
Throughput (MB/s)
• For IOs less than 4KB
if OVERWRITING data
file system reads 4KB page
then overwrites bytes
then writes bytes
• Cuts throughput by 2x - 3x
• So, write in multiples of
cluster size.
8
Write +WCE
6
4
Write
2
0
2
4
8
16
32
64
128 192
Request Size (K-Bytes)
2KB writes are
5x slower than reads
2x or 3x slower than 4KB writes
What is WCE?
• Write Cache Enable lets disk controller respond
“yes” before data is on disk.
• Dangerous
– If power fails, WCE can destroy data integrity
– Most RAID controllers have Non Volatile RAM
That makes WCE safe (invisible) if they do RESET right.
• About 50% of disks we see have WCE on
You can turn it off with 3rd party SCSI Utilities.
• As seen later:
3-deep request buffering
gets similar performance.
Synchronous Un-Buffered Read/Write
•
•
•
•
• 1/2 power point
Reads do well above 2KB
Writes are terrible
WCE helps writes
Ultra media is 1.5x Faster
– Read: 4KB
– Write: 64h KB no wce
4 KB with wce
Unbuffered Throughput
10
WCE Unbuffered Write Throughput
10
Ultra Read
Ultra Write WCE
8
Fast Read
6
Ultra Write
4
Fast Write
Throughput (MB/s)
Throughput (MB/s)
8
6
Fast Write WCE
4
2
2
0
0
2
4
8
16
32
64
Request Size (K bytes)
128
192
2
4
8
16
32
64
128
Request Size (K bytes)
192
Cost of Un-Buffered IO
• Saves Buffer Memory copy. • Buffered:
• Was 20 ms/MB, now 2 ms/MB – saturates CPU at 50 MB/s
• Cost/request ~ 120 s (wow) • Un Buffered
• Note: unbuffered must be sector aligned.
100
CPU milliseconds per MB
– saturates CPU at 500 MB/s
CPU milliseconds per Request
CPU Utilization
35%
0.30
Fast Read
30%
Ultra Read
20%
cpu idle because
non-WCE w rites so
slow
15%
10%
Cost (ms/request)
Cost (CPU%)
Cost (ms/MB)
10
Fast Write
0.25
25%
Ultra Write
Ultra Write WCE
0.20
Fast write WCE
0.15
5%
1
0.10
0%
2
4
8 16 32 64 128 192
Request Size (K bytes)
2
4
8
16 32 64 128 192
Request Size (K bytes)
2
4
8
16 32
64 128 192
Request Size (K bytes)
Summary
• Out of the box
• Parallelism Tricks:
– Read RAP ~PAP (thanks NTFS)
– Write RAP ~ PAP / 10 …PAP/2
– deep requests (async, overlap)
– striping (raid0, raid5)
– allocation and other tricks
• Buffering small IO is great!
• Buffering large IO is expensive
• WCE is a dangerous way out
but frequently used.
Throughput (MB/s)
Out of the Box Throughput
Un-Buffered
10
8
Read & Write
WCE Out of Box Throughput
6
4
4
2
Read Buffered
Write Buffered
Write Buffered + WCE
Read
Write
Write+WCE
60
50
8
6
FS Buffered
Read & Write
Un-Buffered Write
10
Out of the Box Overhead
Buffered Write
40
30
20
2
10
0
0
2
4
8 16 32 64 128 192
Request Size (K-Bytes)
2
4
8
16 32 64 128 192
Request Size (K-Bytes)
0
2
4
8
16
Request Size (K Bytes)
32
64
128
192
Bottleneck Analysis
• Drawn to linear scale
Disk R/W
~9MBps
Memory
MemCopy Read/Write
~50 MBps
~150 MBps
Theoretical
Bus Bandwidth
422MBps = 66 Mhz x 64 bits
Outline
• Intro/Overview
• Disk background, technology trends
• Measurements of Sequential IO
–
–
–
–
Single disk (temp, buffered, unbuffered, deep)
Multiple disks and busses
RAID
Pitfalls
• Summary
Kinds of Parallel Execution
Pipeline
Partition
outputs split N ways
inputs merge M ways
A
Sequential
Step
Sequential
Sequential
Any
Sequential
Sequential
Step
Any
Sequential
Step
Any
Sequential
Sequential
Step
Pipeline Requests to One Disk
• Does not help reads much
• Helps writes a LOT
They were already pipelined
– Above 16KB
by the disk controller
3-deep matches WCE
• Pipeline (async, overlap) IO is a BIG win (RAP ~ 85% PAP)
10
Read Throughput - 1 Fast Disk,
Various Request Depths
10
Write Throughput - 1 Fast Disk,
Various Request Depths
WCE
8
Throughput (MB/s)
Throughput (MB/s)
8
6
4
4
2
0
0
4
8
16
32
64
Request Size (K bytes)
128
192
3 Buffers
8 Buffers
6
2
2
1 Buffer
2
4
8
16
32
64
Request Size (K bytes)
128
192
Parallel Access To Data?
At 10 MB/s
1.2 days to scan
1 Terabyte
1,000 x parallel
100 second SCAN.
1 Terabyte
10 GB/s
10 MB/s
Parallelism:
divide a big problem
into many smaller ones
to be solved in parallel.
Pipeline Access: Stripe Across 4 disks
• 8-deep Pipeline
matches WCE
• Stripes NEED pipeline
• 3-deep is good enough
• Saturate at 15 MBps
20
Read 4 Disk Stripes
Throughput vs Request Depth
20
Write 4 Disk Stripes
Throughput vs Request Depth
WCE
15
Throughput (MB/s)
Throughput (MB/s)
15
10
1 Buffer
3 Buffers
8 Buffers
10
5
0
5
0
2
4
8
16
32
64 128
Request Size (K bytes)
192
2
4
8
16
32
64
Request Size (K bytes)
128
192
3 Stripes and Your Out!
• 3 disks can saturate adapter • CPU time goes down
• Similar story with UltraWide with request size
• Ftdisk (striping is cheap)
=
WriteThroughput vs Stripes 3 deep Fast
Throughput (MB/s)
20
20
Throughput (MB/s)
15
10
5
0
100
15
1 Disk
2 Disks
10
3 Disks
4 Disks
5
4 8 16 32 64 128 192
Request Size (K bytes)
10
1
0
2
CPU miliseconds per MB
Cost (CPU ms/MB)
Read Throughput vs Stripes 3 deep Fast
2
4
8 16 32 64 128 192
Request Size (K bytes)
2
4
8
16
32
64
128
Request Size (bytes)
192
Parallel SCSI Busses Help
2x

One or Two SCSI Busses
Read
Write
WCE
Read
Write
WCE
25
Throughput (MB/s)
• Second SCSI bus nearly
doubles read and wce
throughput
• Write needs deeper buffers
• Experiment is unbuffered
(3-deep +WCE)
20
15
2 busses
1 Bus
10
5
0
2
4
8
16
32
64
Request Size (K bytes)
128 192
File System Buffering & Stripes
(UltraWide Drives)
• FS buffering helps small reads • Write peaks at 20 MBps
• FS buffered writes peak at
• Read peaks at 30 MBps
12MBps
• 3-deep async helps
Three Disks, 1 Deep
35
Three Disks, 3 Deep
35
FS Read
Read
FS Write WCE
Write WCE
30
25
Throughput (MB/s)
Throughput (MB/s)
25
30
20
20
15
15
10
10
5
5
0
0
2
4
8
16
32
64 128
Request Size (K Bytes)
192
2
4
8
16
32
64 128
Request Size (K Bytes)
192
PAP vs RAP
• Reads are easy, writes are hard
• Async write can match WCE.
422 MBps
142 MBps
SCSI
Application
Data
Disks
40 MBps
File System
10-15 MBps
31 MBps
9 MBps
•
133 MBps
72 MBps
PCI
SCSI
Bottleneck Analysis
• NTFS Read/Write 9 disk, 2 SCSI bus, 1 PCI
~ 65 MBps Unbuffered read
~ 43 MBps Unbuffered write
~ 40 MBps Buffered read
~ 35 MBps Buffered write
Adapter
Memory
~30 MBps
PCI Read/Write
~70 MBps
~150 MBps
Adapter
Hypothetical Bottleneck Analysis
• NTFS Read/Write 12 disk, 4 SCSI, 2 PCI
(not measured, we had only one PCI bus available, 2nd one was “internal”)
~ 120 MBps Unbuffered read
~ 80 MBps Unbuffered write
~ 40 MBps Buffered read
~ 35 MBps Buffered write
Adapter
~30 MBps
Adapter
PCI
~70 MBps
Memory
Read/Write
~150 MBps
Adapter
PCI
Adapter
Outline
• Intro/Overview
• Disk background, technology trends
• Measurements of Sequential IO
–
–
–
–
Single disk (temp, buffered, unbuffered, deep)
Multiple disks and busses
RAID
Pitfalls
• Summary
Stripes, Mirrors, Parity (RAID 0,1, 5)
• RAID 0: Stripes
– bandwidth
0,3,6,..
1,4,7,..
2,5,8,..
• RAID 1: Mirrors, Shadows,…
– Fault tolerance
– Reads faster, writes 2x slower
0,1,2,..
0,1,2,..
• RAID 5: Parity
– Fault tolerance
– Reads faster
– Writes 4x or 6x slower.
0,2,P2,.. 1,P1,4,.. P0,3,5,..
Where To Do RAID?
• RAID in host (= NT)
– no special hardware
– data FtDisk responsible for data integrity
– can stripe across multiple busses/adapters
• RAID in Adapter
– Gets safe WCE if not volatile
– Offloads host
– Not good for WolfPack
• RAID in disk controller
– Gets safe WCE if not volatile
– offloads host
– best data integrity for MSCS
NT Host-Based Striping is OK
• 3 Ultra-disks per Stripe.
•
• WCE is enabled in all cases
• Requests are 3-deep
Striping Read Throughput
35
Striping WriteThroughput
35
30
25
20
15
10
Controller-Based Striping
Host-Based Striping
5
Throughput (MB/s)
Throughput (MB/s)
30
25
20
15
10
5
Array-Based Striping
0
0
2
4
8
16
32
Request Size (Kbytes)
64
128
2
4 Request
8
16 (Kbytes)
32
64
Size
128
Surprise: Good NT RAID5 Performance
• At 8 KB, performance is
similar
• Write performance is bad
in all cases.
• Ignores read performance in the
case of disk fault.
• Above 32KB requests, CPU
write cost is significant.
RAID5 CPU millise conds pe r M B
RAID5 Throughput vs Request Depth
35
100
Read
25
Throughput (MB/s)
Throughput (MB/s)
30
20
15
Write
10
Array Read
Array Write
Hos t Read
Hos t Write
10
5
0
1
2
4
8
16
32
64
Request Size (K bytes)
128
192
2
4
8
16
32
64
Request Size (K bytes)
128
192
Controller & Adapters are Complex
Elapsed Time (ms)
• Min response time 300µs
Elapsed time vs Request Size
• Typical 1ms for 8KB
Controller Cache vs Controller Prefetch
• Many strange effects
(e.g. Ultra cache is busted). 10
1
Ultra Cached
Fast Cached
Narrow Cached
Narrow Prefetch
Fast Prefetch
Ultra Prefetch
0.1
0
10
20
30
40
50
60
Request Size (K bytes)
70
Bus Overhead Grows
• Small requests (8KB) are more than 1/2 overhead.
• 3x more disks means 5x more overhead
SCSI Overhead Grows with Disks
90%
80%
SCSI Bus Utilization
80%
Overhead
70%
Data
60%
56%
50%
40%
30%
31%
27%
27%
18%
20%
11%
10%
3%
0%
1 Disk
8KB
1 Disk
64KB
2 Disks
64KB
3 Disks
64KB
Allocate/Extend
Suppresses Async Writes
Allocate/Extend While Writing
4-disk w rite8 deep
no-extend
20
Throughput (MB/s)
• When you allocate space
•
NT zeros it
(both DRAM and disk)
• Prevents others from reading
data you “delete”
• This “kills” pipeline writes.
• Solution: pre-allocate or
reuse files
whenever you can.
• Do VERY large writes.
15
1-disk w rite
8-deep
no extend
10
5
0
2
1 deep equals
8-deep extend
4
8
16
32
64 128 192
Request Size (K bytes)
Stripe Alignment: Chunk vs Cluster
• 64 KB read becomes two reads: 4KB and 60KB
• Twice as many physical
Alignment, 4-disk(ultra), 3-deep
Unaligned Read
requests.
Aligned Read
• Stripe has chunk size (64KB)
Aligned Write
Unaligned Write
• Volume has cluster size
35
Throughput (MB/s)
30
– default is 4KB (for big disks).
25
20
15
10
5
64KB 64KB 64KB
0
2
4 64KB 64KB
60
4
8
16
32
Request Size (bytes)
64
128
192
Other Issues.
•
•
•
•
•
•
Multi-processor
DEC Alpha
Memory Mapped Files
Fragmentation
Ultra-2, Merced, FC,…
NT5
–
–
–
–
Veritas volume manger
64-bit
performance improvements
I2O,...
Summary
 Read is easy, write is hard

SCSI & FS read prefetch works
Read PAP ~ .8 RAP
Write PAP ~ .05 RAP to .8 RAP

 NTFS buffering is good for small IOs
coalesces into 64KB requests

 Bigger is better: 8KB ok, 64KB best

 Deep requests help
3-deep is good, 8-deep is better
 WCE is fast but dangerous
3-deep writes approximate WCE

for > 8KB requests.
 3 disks can saturate a SCSI bus,
both Fast-Wide (15 MBps) or Ultra-Wide (31 MBps)
Memory speed is ultimate limit
with multiple disks, multiple PCI
50MBps copy, 150 MBps r/w.
Avoid FS buffering above 16KB
costs 20 ms/MB of cpu
Preallocate & reuse files when possible
Avoids Allocate/Extend sync IO
Software RAID5 performs well
 but fault tolerance is a problem
 writes are expensive in any case
Pitfalls
 Read-before-write: 2KB buffered IO
 Allocate/Extend: synchronous write
 Zoned disks => 50% speed bump
 RAID alignment => 20% speed bump
More Details at
• Web site has
–
–
–
–
–
Paper
Sample code
Test program we used
These slides
http://research.Microsoft.com/BARC/Sequential_IO/
Outline
• Intro/Overview
• Disk background, technology trends
• Measurements of Sequential IO
–
–
–
–
Single disk (temp, buffered, unbuffered, deep)
Multiple disks and busses
RAID
Pitfalls
• Summary
Related documents
Download