High Speed Sequential IO on Windows NT™ 4.0 (sp3) Erik Riedel (of CMU) Catharine van Ingen Jim Gray http://Research.Microsoft.com/BARC/Sequential_IO/ Outline • Intro/Overview • Disk background, technology trends • Measurements of Sequential IO – – – – Single disk (temp, buffered, unbuffered, deep) Multiple disks and busses RAID Pitfalls • Summary We Got a Lot of Help • Brad Waters, Wael Bahaa-El-Din, and Maurice Franklin Shared experience, results, tools, and hardware lab. Helped us understand NT Feedback on our preliminary measurements • • • • Tom Barclay iostress benchmark program Barry Nolte & Mike Parkes allocate issues Doug Treuting, Steve Mattos + Adaptec SCSI and Adaptec device drivers Bill Courtright, Stan Skelton, Richard Vanderbilt, Mark Regester loanded us a Symbios Logic array, host adapters, and r expertise. . • Will Dahli : helped us understand NT configuration and measurement. • Joe Barrera & Don Slutz & Felipe Cabrera valuable comments, feedback and helped in understanding NTFS internals. • David Solomon: Inside Windows NT 2nd edition draft The Actors • Measured & Modeling Sequential IO • Where are the bottlenecks? • How does it scale with – SMP, RAID, new interconnects Goals: balanced bottlenecks Low overhead Scale many processors (10s) Scale many disks (100s) Memory File cache Mem bus App address space PCI Adapter SCSI Controller PAP (peak advertised Performance) vs RAP (real application performance) • Goal: PAP = RAP / 2 (the half-power point) System Bus 422 MBps 40 MBps 7.2 MB/s 7.2 MB/s Application Data 10-15 MBps 7.2 MB/s File System Buffers SCSI Disk 133 MBps 7.2 MB/s PCI Outline • Intro/Overview • Disk background, technology trends • Measurements of Sequential IO – – – – Single disk (temp, buffered, unbuffered, deep) Multiple disks and busses RAID Pitfalls • Summary Two Basic Shapes • Circle (disk) – storage frequently returns to same spot – so less total surface area • Line (tape) – Lots more area, – Longer time to get to the data. • Key idea: multiplex expensive read/write head over large storage area: trade $/GB for access/second Disk Terms • • • • • • • Disks are called platters Data is recorded on tracks (circles) on the disk. Tracks are formatted into fixed-sized sectors. A pair of Read/Write heads for each platter Mounted on a disk arm Client addresses logical blocks (cylinder, head, sector) Bad blocks are remapped to spare good blocks. Disk Access Time • Access time = SeekTime + RotateTime + ReadTime • Rotate time: – 5,000 to 10,000 rpm • ~ 12 to 6 milliseconds per rotation • ~ 6 to 3 ms rotational latency • Improved 3x in 20 years 6 ms 3 ms 1 ms Disk Seek Time • Seek time is ~ Sqrt(distance) (distance = 1/2 acceleration x time2) • Specs assume seek is 1/3 of disk • Short seeks are common. (over 50% are zero length) • Typical 1/3 seek time: 8 ms • 4x improvement in 20 years. time Read/Write Time: Density • Time = Size / BytesPerSecond • Bytes/Second = Speed * Density – 5 to 15 MBps • MAD (Magnetic Aerial Density) – Today 3 Gbits/inch2 5 gbpsi in lab – Rising > 60%/year – ParaMagnetic Limit: 10 Gb/inch2 – linear density is sqrt 10x per decade 10,000 1,000 100 10 1 1970 1980 1990 2000 Read/Write Time: Rotational Speed • Bytes/Second = Speed * Density • Speed greater at edge of circle • Speed 3600 -> 10,000 rpm – 5%/year improvement • bit rate varies by ~1.5x today p r2 = 4 Throughput (MB/s) 10 p Ultra SCSI 8 6 4 . 2 0 Radial Distance 25% 50% 75% 1 r=2 r=1 Fast Wide SCSI 0% r2 = 100% Read/Write Time: Zones • Disks are sectored – typical: 512 bytes/sector – Sector is read/write unit – Failfast: can detect bad sectors. • Disks are zoned 8 sectors/track – outer zones have more sectors – Bytes/second higher in outer zones. 8 sectors/track 14 sectors/track Disk Access Time • Access time = SeekTime + RotateTime + ReadTime • Other useful facts: 6 ms 3 ms 1 ms 5%/y 5%/y 25%/y – Power rises more than size3 (so small is indeed beautiful) – Small devices are more rugged – Small devices can use plastics (forces are much smaller) e.g. bugs fall without breaking anything The Access Time Myth The Myth: seek or pick time dominates The Reality:(1) Queuing dominates (2) Transfer dominates BLOBs (3) Disk seeks often short Implication: many cheap servers better than one fast expensive server – shorter queues – parallel transfer – lower cost/access and cost/byte Transfer This is now obvious for disk arrays This will be obvious for tape arrays Seek Wait Transfer Rotate Rotate Seek Storage Ratios Changed • DRAM/disk media price ratio changed • 10x better access time • 10x more bandwidth • 4,000x lower media price 1 1980 1990 Year 0.1 2000 Storage Price vs Time Megabytes per kilo-dollar 100 10,000. 1,000. MB/k$ Accesses per Second 1. Capacity (GB) seeks per second bandwidth: MB/s 10. 10 1970-1990 100:1 1990-1995 10:1 1995-1997 50:1 today ~ .2$pMB disk 10$pMB dram Disk accesses/second vs Time Disk Performance vs Time 100 – – – – 10 100. 10. 1. 1 1980 1990 Year 2000 0.1 1980 1990 Year 2000 Year 2002 Disks • Big disk (10 $/GB) – – – – 3” 100 GB 150 kaps (k accesses per second) 20 MBps sequential • Small disk (20 $/GB) – – – – 3” 4 GB 100 kaps 10 MBps sequential • Both running Windows NT™ 7.0? (see below for why) Tape & Optical: Beware of the Media Myth • Optical is cheap: 200 $/platter 3 GB/platter => 70$/GB (cheaper than disc) • Tape is cheap: => 1.5 $/GB 30 $/tape 20 GB/tape (100x cheaper than disc). The Media Myth • Tape needs a robot (10 k$ ... 3 m$ ) 10 ... 1000 tapes (at 20GB each) => 10$/GB ... 150$/GB (1x…10x cheaper than disc) Optical needs a robot (100 k$ ) 100 platters = 200GB ( TODAY ) => 400 $/GB ( more expensive than mag disc ) • Robots have poor access times Not good for Library of Congress (25TB) Data motel: data checks in but it never checks out! Crazy Disk Ideas • Disk Farm on a card: surface mount disks • Disk (magnetic store) on a chip: (micro machines in Silicon) • NT and BackOffice in the disk controller (a processor with 100MB dram) ASIC The Disk Farm On a Card The 100GB disc card An array of discs Can be used as 100 discs 1 striped disc 10 Fault Tolerant discs ....etc LOTS of accesses/second bandwidth 14" Life is cheap, its the accessories that cost ya. Processors are cheap, it’s the peripherals that cost ya (a 10k$ disc card). Functionally Specialized Cards • Storage P mips processor ASIC Today: P=50 mips M MB DRAM • Network M= 2 MB In a few years ASIC P= 200 mips M= 64 MB • Display ASIC It’s Already True of Printers Peripheral = CyberBrick • You buy a printer • You get a – several network interfaces – A Postscript engine • • • • cpu, memory, software, a spooler (soon) – and… a print engine. All Device Controllers will be Cray 1’s • TODAY – Disk controller is 10 mips risc engine with 2MB DRAM – NIC is similar power • SOON Central Processor & Memory – Will become 100 mips systems with 100 MB DRAM. • They are nodes in a federation (can run Oracle on NT in disk controller). • Advantages – – – – – Uniform programming model Great tools Security Economics (cyberbricks) Move computation to data (minimize traffic) Tera Byte Backplane System On A Chip • Integrate Processing with memory on one chip – – – – chip is 75% memory now 1MB cache >> 1960 supercomputers 256 Mb memory chip is 32 MB! IRAM, CRAM, PIM,… projects abound • Integrate Networking with processing on one chip – system bus is a kind of network – ATM, FiberChannel, Ethernet,.. Logic on chip. – Direct IO (no intermediate bus) • Functionally specialized cards shrink to a chip. With Tera Byte Interconnect and Super Computer Adapters • Processing is incidental to – Networking – Storage – UI • Disk Controller/NIC is – faster than device – close to device – Can borrow device package & power Tera Byte Backplane • So use idle capacity for computation. • Run app in device. Implications Conventional • Offload device handling to NIC/HBA • higher level protocols: I2O, NASD, VIA… • SMP and Cluster parallelism is important. Central Processor & Memory Radical • Move app to NIC/device controller • higher-higher level protocols: CORBA / DCOM. • Cluster parallelism is VERY important. Tera Byte Backplane How Do They Talk to Each Other? Applications Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other – CORBA? DCOM? IIOP? RMI? – One or all of the above. Applications ? RPC streams datagrams • Huge leverage in high-level interfaces. • Same old distributed system story. VIAL/VIPL ? RPC streams datagrams • • • • h Wire(s) Will He Ever Get to The Point? • I thought this was about NTFS sequential IO. • Why is he telling me all this other crap? It is relevant background Outline • Intro/Overview • Disk background, technology trends • Measurements of Sequential IO – – – – Single disk (temp, buffered, unbuffered, deep) Multiple disks and busses RAID Pitfalls • Summary The Actors • Processor - Memory bus • Memory • The Disk: writes, stores, reads data • The Disk Controller: – manages drive (error handling) – reads & writes drive – converts SCSI commands to disk actions – May buffer or do RAID – holds file cache and app data • Application – reads and writes memory Memory File cache Mem bus App address space PCI • The SCSI bus: carries bytes • The Host-Bus Adapter: – protocol converter to system bus – may do RAID Adapter SCSI Controller 10 Sequential vs Random IO • Random IO is typically small IO (8KB) – seek+rotate+transfer is ~ 10 ms – 100 IO per second – 800 KB per second • Sequential IO is typically large IO – almost no seek (one per cylinder read/written) – No rotational delay (reading whole disk track) – Runs at MEDIA speed: 8 MB per second 1 • Sequential is 10x more bandwidth than random! • Buffered: – – – – Basic File Concepts File reads/writes go to file cache File system does pre-fetch, post write, aggregation. Unbuffered bypasses file cache Data written to disk at file close or LRU or lazy write • Overlapped: – requests are pipelined – completions via events, completion ports, – A simpler alternative to multi-threaded IO. • Temporary Files: – Files written to cache, not flushed on close. Experiment Background • • • • Used Intel/Gateway 2000 G6-200Mhz Pentium Pro 64 MB DRAM (4x interleave) 32-bit PCI Adaptec 2940 Fast-Wide (20 MBps) and Ultra-Wide (40 MBps) controllers • Seagate 4GB SCSI disks (fast and ultra) – (7200 rpm, 7-15 MBps “internal”) • NT 4.0 SP3, NTFS • i.e.: modest 1997 technology. • Not multi-processor, Not DEC Alpha, Some RAID Simplest Possible Code #include <stdio.h> #include <windows.h> int main() { const int iREQUEST_SIZE = 65536; char cRequest[iREQUEST_SIZE]; unsigned long ibytes; HANDLE hFile = CreateFile("C:\\input.dat", // name GENERIC_READ, // desired access 0, NULL, // share & security OPEN_EXISTING, // pre-existing file FILE_ATTRIBUTE_TEMPORARY | FILE_FLAG_SEQUENTIAL_SCAN, NULL); // file template ReadFile while( (hFile,cRequest,iREQUEST_SIZE,&ibytes,NULL) ) // do read { if (ibytes == 0) break; // break on end of file /* do something with the data */ }; CloseHandle(hFile); return 0; } • Error checking adds some more, but still, its easy The Best Case: Temp File, NO IO Temp file Read / Write File System Cache Program uses small (in cpu cache) buffer. So, write/read time is bus move time (3x better than copy) Paradox: fastest way to move data is to write then read it. This hardware is Temp File Read/Write 200 limited to 150 MBps 148 per processor 136 150 100 MBps • • • • • 54 50 0 Temp read Temp write Memcopy () Out of the Box Disk File Performance • One NTFS disk • Buffered read • NTFS does 64 KB read-ahead – if you ask FILE_FLAG_SEQUENTIAL – or if it thinks you are sequential • NTFS does 64 KB write behind – under same conditions – aggregates many small IO to few big IO. 64KB Synchronous Buffered Read/Write • Net: default out of the box • Read throughput is GREAT! performance is good. • Write throughput is 40% of read • 20 ms/MB ~ 2 instructions/byte! • WCE is fast but dangerous • CPU will saturate at 50MBps Out of the Box Throughput Out of the Box Overhead 10 Throughput (MB/s) Read 8 Write +WCE 6 4 Write 2 0 Overhead (cpu msec/MB) 80 70 Read 60 Write 50 Write + WCE 40 Write 30 20 Read 10 0 2 4 8 16 32 64 128 192 Request Size (K-Bytes) 2 4 8 16 32 64 128 192 Request Size (K Bytes) Write Multiples of Cluster Size Out of the Box Throughput 10 Read Throughput (MB/s) • For IOs less than 4KB if OVERWRITING data file system reads 4KB page then overwrites bytes then writes bytes • Cuts throughput by 2x - 3x • So, write in multiples of cluster size. 8 Write +WCE 6 4 Write 2 0 2 4 8 16 32 64 128 192 Request Size (K-Bytes) 2KB writes are 5x slower than reads 2x or 3x slower than 4KB writes What is WCE? • Write Cache Enable lets disk controller respond “yes” before data is on disk. • Dangerous – If power fails, WCE can destroy data integrity – Most RAID controllers have Non Volatile RAM That makes WCE safe (invisible) if they do RESET right. • About 50% of disks we see have WCE on You can turn it off with 3rd party SCSI Utilities. • As seen later: 3-deep request buffering gets similar performance. Synchronous Un-Buffered Read/Write • • • • • 1/2 power point Reads do well above 2KB Writes are terrible WCE helps writes Ultra media is 1.5x Faster – Read: 4KB – Write: 64h KB no wce 4 KB with wce Unbuffered Throughput 10 WCE Unbuffered Write Throughput 10 Ultra Read Ultra Write WCE 8 Fast Read 6 Ultra Write 4 Fast Write Throughput (MB/s) Throughput (MB/s) 8 6 Fast Write WCE 4 2 2 0 0 2 4 8 16 32 64 Request Size (K bytes) 128 192 2 4 8 16 32 64 128 Request Size (K bytes) 192 Cost of Un-Buffered IO • Saves Buffer Memory copy. • Buffered: • Was 20 ms/MB, now 2 ms/MB – saturates CPU at 50 MB/s • Cost/request ~ 120 s (wow) • Un Buffered • Note: unbuffered must be sector aligned. 100 CPU milliseconds per MB – saturates CPU at 500 MB/s CPU milliseconds per Request CPU Utilization 35% 0.30 Fast Read 30% Ultra Read 20% cpu idle because non-WCE w rites so slow 15% 10% Cost (ms/request) Cost (CPU%) Cost (ms/MB) 10 Fast Write 0.25 25% Ultra Write Ultra Write WCE 0.20 Fast write WCE 0.15 5% 1 0.10 0% 2 4 8 16 32 64 128 192 Request Size (K bytes) 2 4 8 16 32 64 128 192 Request Size (K bytes) 2 4 8 16 32 64 128 192 Request Size (K bytes) Summary • Out of the box • Parallelism Tricks: – Read RAP ~PAP (thanks NTFS) – Write RAP ~ PAP / 10 …PAP/2 – deep requests (async, overlap) – striping (raid0, raid5) – allocation and other tricks • Buffering small IO is great! • Buffering large IO is expensive • WCE is a dangerous way out but frequently used. Throughput (MB/s) Out of the Box Throughput Un-Buffered 10 8 Read & Write WCE Out of Box Throughput 6 4 4 2 Read Buffered Write Buffered Write Buffered + WCE Read Write Write+WCE 60 50 8 6 FS Buffered Read & Write Un-Buffered Write 10 Out of the Box Overhead Buffered Write 40 30 20 2 10 0 0 2 4 8 16 32 64 128 192 Request Size (K-Bytes) 2 4 8 16 32 64 128 192 Request Size (K-Bytes) 0 2 4 8 16 Request Size (K Bytes) 32 64 128 192 Bottleneck Analysis • Drawn to linear scale Disk R/W ~9MBps Memory MemCopy Read/Write ~50 MBps ~150 MBps Theoretical Bus Bandwidth 422MBps = 66 Mhz x 64 bits Outline • Intro/Overview • Disk background, technology trends • Measurements of Sequential IO – – – – Single disk (temp, buffered, unbuffered, deep) Multiple disks and busses RAID Pitfalls • Summary Kinds of Parallel Execution Pipeline Partition outputs split N ways inputs merge M ways A Sequential Step Sequential Sequential Any Sequential Sequential Step Any Sequential Step Any Sequential Sequential Step Pipeline Requests to One Disk • Does not help reads much • Helps writes a LOT They were already pipelined – Above 16KB by the disk controller 3-deep matches WCE • Pipeline (async, overlap) IO is a BIG win (RAP ~ 85% PAP) 10 Read Throughput - 1 Fast Disk, Various Request Depths 10 Write Throughput - 1 Fast Disk, Various Request Depths WCE 8 Throughput (MB/s) Throughput (MB/s) 8 6 4 4 2 0 0 4 8 16 32 64 Request Size (K bytes) 128 192 3 Buffers 8 Buffers 6 2 2 1 Buffer 2 4 8 16 32 64 Request Size (K bytes) 128 192 Parallel Access To Data? At 10 MB/s 1.2 days to scan 1 Terabyte 1,000 x parallel 100 second SCAN. 1 Terabyte 10 GB/s 10 MB/s Parallelism: divide a big problem into many smaller ones to be solved in parallel. Pipeline Access: Stripe Across 4 disks • 8-deep Pipeline matches WCE • Stripes NEED pipeline • 3-deep is good enough • Saturate at 15 MBps 20 Read 4 Disk Stripes Throughput vs Request Depth 20 Write 4 Disk Stripes Throughput vs Request Depth WCE 15 Throughput (MB/s) Throughput (MB/s) 15 10 1 Buffer 3 Buffers 8 Buffers 10 5 0 5 0 2 4 8 16 32 64 128 Request Size (K bytes) 192 2 4 8 16 32 64 Request Size (K bytes) 128 192 3 Stripes and Your Out! • 3 disks can saturate adapter • CPU time goes down • Similar story with UltraWide with request size • Ftdisk (striping is cheap) = WriteThroughput vs Stripes 3 deep Fast Throughput (MB/s) 20 20 Throughput (MB/s) 15 10 5 0 100 15 1 Disk 2 Disks 10 3 Disks 4 Disks 5 4 8 16 32 64 128 192 Request Size (K bytes) 10 1 0 2 CPU miliseconds per MB Cost (CPU ms/MB) Read Throughput vs Stripes 3 deep Fast 2 4 8 16 32 64 128 192 Request Size (K bytes) 2 4 8 16 32 64 128 Request Size (bytes) 192 Parallel SCSI Busses Help 2x One or Two SCSI Busses Read Write WCE Read Write WCE 25 Throughput (MB/s) • Second SCSI bus nearly doubles read and wce throughput • Write needs deeper buffers • Experiment is unbuffered (3-deep +WCE) 20 15 2 busses 1 Bus 10 5 0 2 4 8 16 32 64 Request Size (K bytes) 128 192 File System Buffering & Stripes (UltraWide Drives) • FS buffering helps small reads • Write peaks at 20 MBps • FS buffered writes peak at • Read peaks at 30 MBps 12MBps • 3-deep async helps Three Disks, 1 Deep 35 Three Disks, 3 Deep 35 FS Read Read FS Write WCE Write WCE 30 25 Throughput (MB/s) Throughput (MB/s) 25 30 20 20 15 15 10 10 5 5 0 0 2 4 8 16 32 64 128 Request Size (K Bytes) 192 2 4 8 16 32 64 128 Request Size (K Bytes) 192 PAP vs RAP • Reads are easy, writes are hard • Async write can match WCE. 422 MBps 142 MBps SCSI Application Data Disks 40 MBps File System 10-15 MBps 31 MBps 9 MBps • 133 MBps 72 MBps PCI SCSI Bottleneck Analysis • NTFS Read/Write 9 disk, 2 SCSI bus, 1 PCI ~ 65 MBps Unbuffered read ~ 43 MBps Unbuffered write ~ 40 MBps Buffered read ~ 35 MBps Buffered write Adapter Memory ~30 MBps PCI Read/Write ~70 MBps ~150 MBps Adapter Hypothetical Bottleneck Analysis • NTFS Read/Write 12 disk, 4 SCSI, 2 PCI (not measured, we had only one PCI bus available, 2nd one was “internal”) ~ 120 MBps Unbuffered read ~ 80 MBps Unbuffered write ~ 40 MBps Buffered read ~ 35 MBps Buffered write Adapter ~30 MBps Adapter PCI ~70 MBps Memory Read/Write ~150 MBps Adapter PCI Adapter Outline • Intro/Overview • Disk background, technology trends • Measurements of Sequential IO – – – – Single disk (temp, buffered, unbuffered, deep) Multiple disks and busses RAID Pitfalls • Summary Stripes, Mirrors, Parity (RAID 0,1, 5) • RAID 0: Stripes – bandwidth 0,3,6,.. 1,4,7,.. 2,5,8,.. • RAID 1: Mirrors, Shadows,… – Fault tolerance – Reads faster, writes 2x slower 0,1,2,.. 0,1,2,.. • RAID 5: Parity – Fault tolerance – Reads faster – Writes 4x or 6x slower. 0,2,P2,.. 1,P1,4,.. P0,3,5,.. Where To Do RAID? • RAID in host (= NT) – no special hardware – data FtDisk responsible for data integrity – can stripe across multiple busses/adapters • RAID in Adapter – Gets safe WCE if not volatile – Offloads host – Not good for WolfPack • RAID in disk controller – Gets safe WCE if not volatile – offloads host – best data integrity for MSCS NT Host-Based Striping is OK • 3 Ultra-disks per Stripe. • • WCE is enabled in all cases • Requests are 3-deep Striping Read Throughput 35 Striping WriteThroughput 35 30 25 20 15 10 Controller-Based Striping Host-Based Striping 5 Throughput (MB/s) Throughput (MB/s) 30 25 20 15 10 5 Array-Based Striping 0 0 2 4 8 16 32 Request Size (Kbytes) 64 128 2 4 Request 8 16 (Kbytes) 32 64 Size 128 Surprise: Good NT RAID5 Performance • At 8 KB, performance is similar • Write performance is bad in all cases. • Ignores read performance in the case of disk fault. • Above 32KB requests, CPU write cost is significant. RAID5 CPU millise conds pe r M B RAID5 Throughput vs Request Depth 35 100 Read 25 Throughput (MB/s) Throughput (MB/s) 30 20 15 Write 10 Array Read Array Write Hos t Read Hos t Write 10 5 0 1 2 4 8 16 32 64 Request Size (K bytes) 128 192 2 4 8 16 32 64 Request Size (K bytes) 128 192 Controller & Adapters are Complex Elapsed Time (ms) • Min response time 300µs Elapsed time vs Request Size • Typical 1ms for 8KB Controller Cache vs Controller Prefetch • Many strange effects (e.g. Ultra cache is busted). 10 1 Ultra Cached Fast Cached Narrow Cached Narrow Prefetch Fast Prefetch Ultra Prefetch 0.1 0 10 20 30 40 50 60 Request Size (K bytes) 70 Bus Overhead Grows • Small requests (8KB) are more than 1/2 overhead. • 3x more disks means 5x more overhead SCSI Overhead Grows with Disks 90% 80% SCSI Bus Utilization 80% Overhead 70% Data 60% 56% 50% 40% 30% 31% 27% 27% 18% 20% 11% 10% 3% 0% 1 Disk 8KB 1 Disk 64KB 2 Disks 64KB 3 Disks 64KB Allocate/Extend Suppresses Async Writes Allocate/Extend While Writing 4-disk w rite8 deep no-extend 20 Throughput (MB/s) • When you allocate space • NT zeros it (both DRAM and disk) • Prevents others from reading data you “delete” • This “kills” pipeline writes. • Solution: pre-allocate or reuse files whenever you can. • Do VERY large writes. 15 1-disk w rite 8-deep no extend 10 5 0 2 1 deep equals 8-deep extend 4 8 16 32 64 128 192 Request Size (K bytes) Stripe Alignment: Chunk vs Cluster • 64 KB read becomes two reads: 4KB and 60KB • Twice as many physical Alignment, 4-disk(ultra), 3-deep Unaligned Read requests. Aligned Read • Stripe has chunk size (64KB) Aligned Write Unaligned Write • Volume has cluster size 35 Throughput (MB/s) 30 – default is 4KB (for big disks). 25 20 15 10 5 64KB 64KB 64KB 0 2 4 64KB 64KB 60 4 8 16 32 Request Size (bytes) 64 128 192 Other Issues. • • • • • • Multi-processor DEC Alpha Memory Mapped Files Fragmentation Ultra-2, Merced, FC,… NT5 – – – – Veritas volume manger 64-bit performance improvements I2O,... Summary Read is easy, write is hard SCSI & FS read prefetch works Read PAP ~ .8 RAP Write PAP ~ .05 RAP to .8 RAP NTFS buffering is good for small IOs coalesces into 64KB requests Bigger is better: 8KB ok, 64KB best Deep requests help 3-deep is good, 8-deep is better WCE is fast but dangerous 3-deep writes approximate WCE for > 8KB requests. 3 disks can saturate a SCSI bus, both Fast-Wide (15 MBps) or Ultra-Wide (31 MBps) Memory speed is ultimate limit with multiple disks, multiple PCI 50MBps copy, 150 MBps r/w. Avoid FS buffering above 16KB costs 20 ms/MB of cpu Preallocate & reuse files when possible Avoids Allocate/Extend sync IO Software RAID5 performs well but fault tolerance is a problem writes are expensive in any case Pitfalls Read-before-write: 2KB buffered IO Allocate/Extend: synchronous write Zoned disks => 50% speed bump RAID alignment => 20% speed bump More Details at • Web site has – – – – – Paper Sample code Test program we used These slides http://research.Microsoft.com/BARC/Sequential_IO/ Outline • Intro/Overview • Disk background, technology trends • Measurements of Sequential IO – – – – Single disk (temp, buffered, unbuffered, deep) Multiple disks and busses RAID Pitfalls • Summary