CyberBricks: The future of Database And Storage Engines Jim Gray http://research.Microsoft.com/~Gray 1 Outline • What storage things are coming from • • Microsoft? TerraServer: a 1 TB DB on the Web Storage Metrics: Kaps, Maps, Gaps, Scans • The future of storage: ActiveDisks 2 New Storage Software From Microsoft • SQL Server 7.0: » Simplicity: Auto-most-things » Scalability on Win95 to Enterprise » Data warehousing: built-in OLAP, VLDB • NT 5: » Better volume management (from Veritas) » HSM architecture » Intellimirror » Active directory for transparency 3 Thin Client Support TSO comes to NT • Lower Per-Client cost • Huge centralized data stores. Net PC “Hydra” Server Existing, Desktop PC MS-DOS, UNIX, Mac clients Dedicated Windows terminal 4 Windows NT 5.0 Intelli-Mirror™ • Files and settings mirrored on • • • • • client and server Great for mobile users Facilitates roaming Easy to replace PCs Optimizes network performance Means HUGE data stores 5 Outline • What storage things are coming from Microsoft? • TerraServer: a 1 TB DB on the Web • Storage Metrics: Kaps, Maps, Gaps, Scans • The future of storage: ActiveDisks 6 Microsoft TerraServer: Scaleup to Big Databases • • Build a 1 TB SQL Server database Data must be • Loaded • • On the web (world’s largest atlas) Sell images with commerce server. » 1 TB » Unencumbered » Interesting to everyone everywhere » And not offensive to anyone anywhere » 1.5 M place names from Encarta World Atlas » 3 M Sq Km from USGS (1 meter resolution) » 1 M Sq Km from Russian Space agency (2 m) 7 Microsoft TerraServer Background • Earth is 500 Tera-meters square • Someday • • • • • • » USA is 10 tm2 100 TM2 land in 70ºN to 70ºS We have pictures of 6% of it » 3 tsm from USGS » 2 tsm from Russian Space Agency Compress 5:1 (JPEG) to 1.5 TB. Slice into 10 KB chunks Store chunks in DB Navigate with » Encarta™ Atlas » multi-spectral image » of everywhere » once a day / hour 1.8x1.2 km2 tile 10x15 km2 thumbnail 20x30 km2 browse image 40x60 km2 jump image • globe • gazetteer » StreetsPlus™ in the USA 8 USGS Digital Ortho Quads (DOQ) • US Geologic Survey • 4 Tera Bytes • Most data not yet published • Based on a CRADA » Microsoft TerraServer makes data available. 1x1 meter 4 TB Continental US New Data Coming USGS “DOQ” 9 Russian Space Agency(SovInfomSputnik) SPIN-2 (Aerial Images is Worldwide Distributor) • • • • SPIN-2• • 1.5 Meter Geo Rectified imagery of (almost) anywhere Almost equal-area projection De-classified satellite photos (from 200 KM), More data coming (1 m) Selling imagery on Internet. 10 2 Putting 2 tm onto Microsoft TerraServer. Demo http://www.TerraServer. Microsoft.com/ Microsoft BackOffice SPIN-2 11 Demo • navigate by coverage map to White House • Download image • buy imagery from USGS • navigate by name to Venice • buy SPIN2 image & Kodak photo • Pop out to Expedia street map of Venice • Mention that DB will double in next 18 months (2x USGS, 2X SPIN2) 12 The Microsoft TerraServer Hardware • Compaq AlphaServer 8400 • 8x400Mhz Alpha cpus • 10 GB DRAM • 324 9.2 GB StorageWorks Disks » 3 TB raw, 2.4 TB of RAID5 • STK 9710 tape robot (~14 TB) • WindowsNT 4 EE, SQL Server 7.0 14 Software Web Client Image Server Active Server Pages Internet Information Server 4.0 Java Viewer browser MTS Terra-Server Stored Procedures HTML The Internet Internet Info Server 4.0 SQL Server 7 Microsoft Automap ActiveX Server TerraServer DB Automap Server TerraServer Web Site Internet Information Server 4.0 Microsoft Site Server EE Image Delivery SQL Server Application 7 15 Image Provider Site(s) System Management & Maintenance • Backup and Recovery » STK 9710 Tape robot » Legato NetWorker™ » SQL Server 7 Backup & Restore » Clocked at 80 MBps (peak) (~ 200 GB/hr) • SQL Server Enterprise Mgr » DBA Maintenance » SQL Performance Monitor 16 Microsoft TerraServer File Group Layout • Convert 324 disks to 28 RAID5 sets plus 28 spare drives • Make 4 WinNT volumes (RAID 50) 595 GB per volume • Build 30 20GB files on each volume • DB is File Group of 120 files E: F: G: H: 17 Image Delivery and Load Incremental load of 4 more TB in next 18 months DLT Tape DLT Tape “tar” NT DoJob \Drop’N’ LoadMgr DB Wait 4 Load Backup LoadMgr LoadMgr ESA Alpha Server 4100 100mbit EtherSwitch 60 4.3 GB Drives Alpha Server 4100 ImgCutter \Drop’N’ \Images ... 10: ImgCutter 20: Partition 30: ThumbImg 40: BrowseImg 45: JumpImg 50: TileImg 55: Meta Data 60: Tile Meta 70: Img Meta 80: Update Place Enterprise Storage Array STK DLT Tape Library 108 9.1 GB Drives 108 9.1 GB Drives 108 9.1 GB Drives Alpha Server 8400 18 Kilo Mega Giga Tera Peta Exa Zetta Yotta Some Tera-Byte Databases • The Web: 1 TB of HTML • TerraServer 1 TB of images • Several other 1 TB (file) servers • Hotmail: 7 TB of email • Sloan Digital Sky Survey: 40 TB raw, 2 TB cooked • EOS/DIS (picture of planet each week) » 15 PB by 2007 • Federal Clearing house: images of checks » 15 PB by 2006 (7 year history) • Nuclear Stockpile Stewardship Program » 10 Exabytes (???!!) 20 A novel Kilo A letter Mega Library of Congress (text) LoC (sound + cinima) All Disks All Tapes Giga Tera Peta Exa A Movie LoC (image) All Photos Zetta All Information! Yotta 22 Michael Lesk’s Points www.lesk.com/mlesk/ksg97/ksg.html • Soon everything can be recorded and kept • Most data will never be seen by humans • Precious Resource: Human attention Auto-Summarization Auto-Search will be a key enabling technology. 23 Outline • What storage things are coming from • Microsoft? TerraServer: a 1 TB DB on the Web • Storage Metrics: Kaps, Maps, Gaps, Scans • The future of storage: ActiveDisks 24 Storage Latency: How Far Away is the Data? 109 Andromeda Tape /Optical Robot 106 Disk 100 10 2 1 Memory On Board Cache On Chip Cache Registers 2,000 Years Pluto Sacramento 2 Years 1.5 hr This Campus 10 min This Room My Head 1 min 25 MetaMessage: Technology Ratios Are Important • If everything gets faster&cheaper at the same rate THEN nothing really changes. • Things getting MUCH BETTER: » communication speed & cost 1,000x » processor speed & cost 100x » storage size & cost 100x • Things staying about the same » speed of light (more or less constant) » people (10x more expensive) » storage speed (only 10x better) 27 Today’s Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs Size vs Speed 1e 1 1e 0 1e -1 1 1012 109 106 104 Cache Nearline Tape Offline Main 102 Tape Secondary Disc Online Online Secondary Tape Disc Tape 100 Main Offline Nearline Tape Tape -2 1e 2 1e 1 1e 0 1e -1 1 $/MB Typical System (bytes) 10151e 2 Price vs Speed 10 Cache 103 10-4 10-9 10-6 10-3 10 0 10 3 Access Time (seconds) 10-9 10-6 10-3 10 0 10 3 Access Time (seconds) 28 Storage Ratios Changed in Last 20 Years • MediaPrice: 4000X, Bandwidth 10X, Access/s 10X • DRAM:DISK $/MB: 100:1 25:1 • TAPE : DISK $/GB: 100:1 5:1 100 Storage Price vs Time Megabytes per kilo-dollar Disk accesses/second vs Time Disk Performance vs Time 10. 1 1980 1. 1990 Year 0.1 2000 1,000. MB/k$ 10 Accesses per Second 100 Capacity (GB) seeks per second bandwidth: MB/s 10,000. 10 100. 10. 1. 1 1980 1990 Year 2000 0.1 1980 1990 2000 Year 29 Disk Access Time • Access time = • SeekTime + RotateTime + ReadTime Other useful facts: 6 ms 3 ms 1 ms 5%/y 5%/y 25%/y » Power rises more than size3 (so small is indeed beautiful) » Small devices are more rugged » Small devices can use plastics (forces are much smaller) e.g. bugs fall without breaking anything 31 Standard Storage Metrics • Capacity: » RAM: » Disk: » Tape: MB and $/MB: today at 100MB & 1$/MB GB and $/GB: today at 10GB and 50$/GB TB and $/TB: today at .1TB and 10$/GB (nearline) » RAM: » Disk: » Tape: 100 ns 10 ms 30 second pick, 30 second position • Access time (latency) • Transfer rate » RAM: » Disk: » Tape: 1 GB/s 5 MB/s - - - Arrays can go to 1GB/s 3 MB/s - - - not clear that striping works 32 New Storage Metrics: Kaps, Maps, Gaps, SCANs • Kaps: How many kilobyte objects served per second » the file server, transaction procssing metric • Maps: How many megabyte objects served per second » the Mosaic metric • Gaps: How many gigabyte objects served per hour » the video & EOSDIS metric • SCANS: How many scans of all the data per day » the data mining and utility metric 33 How To Get Lots of Maps, Gaps, SCANS • parallelism: use many little devices in parallel At 10 MB/s: 1.2 days to scan 1 Terabyte 1,000 x parallel: 100 seconds/scan 1 Terabyte 10 MB/s Parallelism: divide a big problem into many smaller ones to be solved in parallel. 34 Tape & Optical: Beware of the Media Myth Optical is cheap: 200 $/platter 2 GB/platter => 100$/GB (5x cheaper than disc) Tape is cheap: => 2.5 $/GB 100 $/tape 40 GB/tape (100x cheaper than disc). 35 Tape & Optical Reality: Media is 10% of System Cost Tape needs a robot (10 k$ ... 3 m$ ) 10 ... 1000 tapes (at 40GB each) => 20$/GB ... 200$/GB (1x…10x cheaper than disc) Optical needs a robot (50 k$ ) 100 platters = 200GB ( TODAY ) => 250 $/GB ( more expensive than disc ) Robots have poor access times Not good for Library of Congress (25TB) Data motel: data checks in but it never checks out! 36 The Access Time Myth The Myth: seek or pick time dominates The reality: (1) Queuing dominates (2) Transfer dominates BLOBs (3) Disk seeks often short Implication: many cheap servers better than one fast expensive server » shorter queues » parallel transfer » lower cost/access and cost/byte This is obvious for disk & tape arrays Wait Transfer Transfer Rotate Rotate Seek Seek 37 My Solution to Tertiary Storage Tape Farms, Not Mainframe Silos 100 robots 1M$ 40TB 25$/GB 3K Maps 1.5K Gaps 2 Scans 10K$ robot 10 tapes 400 GB 6 MB/s 25$/GB Scan in 12 hours. independent tape robots 30 Maps many (like a disc farm) 15 Gaps 2 Scans 38 The Metrics: Disk and Tape Farms Win GB/K$ 1,000,000 Kaps 100,000 Maps 10,000 Scans Data Motel: Data checks in, but it never checks out SCANS/Day 1,000 100 10 1 0.1 0.01 1000 x Disc Farm STK Tape Robot 6,000 tapes, 8 readers 100x DLT Tape Farm 39 Cost Per Access (3-year) 540,000 100,000 500K 67,000 Kaps/$ Maps/$ 100 Gaps/$ SCANS/k$ 68 23 10 7 7 120 4.3 100 2 1.5 1 0.2 0.1 1000 x Disc Farm STK Tape Robot 6,000 tapes, 16 readers 100x DLT Tape Farm 40 Storage Ratios Impact on Software • Gone from 512 B pages to 8192 B pages • (will go to 64 KB pages in 2006) Treat disks as tape: » Increased use of sequential access » Use disks for backup copies • Use tape for » VERY COLD data or » Offsite Archive » Data interchange 41 Summary • Storage accesses are the bottleneck • Accesses are getting larger (Maps, Gaps, SCANS) • Capacity and cost are improving • • • BUT Latencies and bandwidth are not improving much SO Use parallel access (disk and tape farms) Use sequential access (scans) 42 The Memory Hierarchy • Measuring & Modeling Sequential IO • Where is the bottleneck? • How does it scale with » SMP, RAID, new interconnects Goals: balanced bottlenecks Low overhead Scale many processors (10s) Scale many disks (100s) Memory File cache Mem bus App address space PCI Adapter SCSI Controller 43 PAP (peak advertised Performance) vs RAP (real application performance) • Goal: RAP = PAP / 2 (the half-power point) System Bus 422 MBps 40 MBps 7.2 MB/s 7.2 MB/s Application Data 10-15 MBps 7.2 MB/s File System Buffers SCSI Disk 133 MBps 7.2 MB/s PCI 45 The Best Case: Temp File, NO IO • Temp file Read / Write File System Cache • Program uses small (in cpu cache) buffer. • So, write/read time is bus move time (3x better • MBps • than copy) Paradox: fastest way to move data is to write then read it. Temp File Read/Write 200 This hardware is 148 136 150 limited to 150 MBps 100 per processor 54 50 0 Temp read Temp write 46 Memcopy () Bottleneck Analysis • Drawn to linear scale Disk R/W ~9MBps Memory MemCopy Read/Write ~50 MBps ~150 MBps Theoretical Bus Bandwidth 422MBps = 66 Mhz x 64 bits 47 PAP vs RAP • Reads are easy, writes are hard • Async write can match WCE. 422 MBps 142 MBps SCSI Application Data Disks 40 MBps File System 10-15 MBps 31 MBps 9 MBps • 133 MBps 72 MBps PCI SCSI 51 Bottleneck Analysis • NTFS Read/Write 9 disk, 2 SCSI bus, 1 PCI ~ 65 MBps Unbuffered read ~ 43 MBps Unbuffered write ~ 40 MBps Buffered read ~ 35 MBps Buffered write Adapter Memory ~30 MBps PCI Read/Write ~70 MBps ~150 MBps Adapter 52 Peak Thrughput on Intel/NT • NTFS Read/Write 24 disk, 4 SCSI, 2 PCI (64 bit) ~ 190 MBps Unbuffered read ~ 95 MBps Unbuffered write so: 0.8 TB/hr read, 0.4 TB/hr write on a 25k$ server. Adapter ~30 MBps Adapter PCI ~70 MBps Memory Read/Write ~150 MBps Adapter PCI Adapter 53 Penny Sort Ground Rules http://research.microsoft.com/barc/SortBenchmark • How much can you sort for a penny. » Hardware and Software cost » Depreciated over 3 years » 1M$ system gets about 1 second, » 1K$ system gets about 1,000 seconds. » Time (seconds) = SystemPrice ($) / 946,080 • Input and output are disk resident • Input is » 100-byte records (random data) » key is first 10 bytes. • Must create output file • and fill with sorted version of input file. Daytona (product) and Indy (special) categories 54 PennySort • Hardware » 266 Mhz Intel PPro » 64 MB SDRAM (10ns) » Dual Fujitsu DMA 3.2GB EIDE • Software » NT workstation 4.3 » NT 5 sort • Performance PennySort Machine (1107$ ) Disk 25% Cabinet + Assembly 7% Memory 8% Other 22% board 13% Network, Video, floppy 9% Software 6% cpu 32% » sort 15 M 100-byte records » Disk to disk » elapsed time 820 sec (~1.5 GB) • cpu time = 404 sec 55 Cluster Sort •Multiple Data Sources A AAA BBB CCC Conceptual Model •Multiple Data Destinations AAA AAA AAA •Multiple nodes AAA AAA AAA •Disks -> Sockets -> Disk -> Disk B C AAA BBB CCC CCC CCC CCC AAA BBB CCC BBB BBB BBB BBB BBB BBB CCC CCC CCC 56 Outline • What storage things are coming from • • Microsoft? TerraServer: a 1 TB DB on the Web Storage Metrics: Kaps, Maps, Gaps, Scans • The future of storage: ActiveDisks 60 Crazy Disk Ideas • Disk Farm on a card: surface mount disks • Disk (magnetic store) on a chip: (micro machines in Silicon) • NT and BackOffice in the disk controller (a processor with 100MB dram) ASIC 61 Remember Your Roots 62 Year 2002 Disks • Big disk (10 $/GB) » 3” » 100 GB » 150 kaps (k accesses per second) » 20 MBps sequential • Small disk (20 $/GB) » 3” » 4 GB » 100 kaps » 10 MBps sequential • Both running Windows NT™ 7.0? (see below for why) 63 The Disk Farm On a Card The 1 TB disc card An array of discs 14" Can be used as 100 discs 1 striped disc 10 Fault Tolerant discs ....etc LOTS of accesses/second bandwidth Life is cheap, its the accessories that cost ya. Processors are cheap, it’s the peripherals that cost ya (a 10k$ disc card). 64 Put Everything in Future (Disk) Controllers (it’s not “if”, it’s “when?”) Acknowledgements: Dave Patterson explained this to me a year ago Kim Keeton Helped me sharpen Erik Riedel these arguments Catharine Van Ingen 65 Kilo Mega Giga Tera Peta Exa Zetta Technology Drivers: Disks • Disks on track • 100x in 10 years • • 2 TB 3.5” drive Shrink to 1” is 200GB Disk replaces tape? Yotta • Disk is super computer! 66 Data Gravity Processing Moves to Transducers (moves to data sources & sinks) • Move Processing to data sources • Move to where the power (and sheet metal) is • Processor in » Modem » Display » Microphones (speech recognition) & cameras (vision) » Storage: Data storage and analysis 67 It’s Already True of Printers Peripheral = CyberBrick • You buy a printer • You get a » several network interfaces » A Postscript engine • cpu, • memory, • software, • a spooler (soon) » and… a print engine. 68 Functionally Specialized Cards • Storage P mips processor ASIC Today: P=50 mips • Network M MB DRAM M= 2 MB In a few years ASIC P= 200 mips M= 64 MB • Display ASIC 69 Basic Argument for x-Disks • Future disk controller is a super-computer. » 1 bips processor » 128 MB dram » 100 GB disk plus one arm • Connects to SAN via high-level protocols » RPC, HTTP, DCOM, Kerberos, Directory Services,…. » Commands are RPCs » Management, security,…. » Services file/web/db/… requests » Managed by general-purpose OS with good dev environment • Apps in disk saves data movement » need programming environment in controller 71 The Slippery Slope Nothing = Sector Server • If you add function to server • Then you add more function to server • Function gravitates to data. Everything = App Server 72 Why Not a Sector Server? (let’s get physical!) • Good idea, that’s what we have today. • But » cache added for performance » Sector remap added for fault tolerance » error reporting and diagnostics added » SCSI commends (reserve,.. are growing) » Sharing problematic (space mgmt, security,…) • Slipping down the slope to a 2-D block server 73 Why Not a 1-D Block Server? Put A LITTLE on the Disk Server • Tried and true design » HSC - VAX cluster » EMC » IBM Sysplex (3980?) • But look inside » Has a cache » Has space management » Has error reporting & management » Has RAID 0, 1, 2, 3, 4, 5, 10, 50,… » Has locking » Has remote replication » Has an OS » Security is problematic » Low-level interface moves too many bytes 74 Why Not a 2-D Block Server? Put A LITTLE on the Disk Server • Tried and true design » Cedar -> NFS » file server, cache, space,.. » Open file is many fewer msgs • Grows to have » Directories + Naming » Authentication + access control » RAID 0, 1, 2, 3, 4, 5, 10, 50,… » Locking » Backup/restore/admin » Cooperative caching with client • File Servers are a BIG hit: NetWare™ »SNAP! is my favorite today 75 Why Not a File Server? Put a Little on the Disk Server • Tried and true design » Auspex, NetApp, » Netware ... • Yes, but look at NetWare » File interface gives you app invocation interface » Became an app server • Mail, DB, Web,…. » Netware had a primitive OS • Hard to program, so optimized wrong thing 76 Why Not Everything? Allow Everything on Disk Server (thin client’s) • Tried and true design » Mainframes, Minis, ... » Web servers,… » Encapsulates data » Minimizes data moves » Scaleable • It is where everyone ends up. • All the arguments against are short-term. 77 Disk = Node • has magnetic storage (100 GB?) • has processor & DRAM • has SAN attachment • has execution Applications environment Services DBMS RPC, ... File System SAN driver Disk driver OS Kernel 79 Technology Drivers: System on a Chip • Integrate Processing with memory on chip » chip is 75% memory now » 1MB cache >> 1960 supercomputers » 256 Mb memory chip is 32 MB! » IRAM, CRAM, PIM,… projects abound • Integrate Networking with processing on chip » system bus is a kind of network » ATM, FiberChannel, Ethernet,.. Logic on chip. » Direct IO (no intermediate bus) • Functionally specialized cards shrink to a chip. 80 Technology Drivers: What if Networking Was as Cheap As Disk IO? • Disk • TCP/IP »Unix/NT 100% cpu @ 40MBps »Unix/NT 8% cpu @ 40MBps Why the Difference? Host does TCP/IP packetizing, checksum,… flow control small buffers Host Bus Adapter does SCSI packetizing, checksum,… flow control 82 DMA Technology Drivers: The Promise of SAN/VIA:10x in 2 years • Today: http://www.ViArch.org/ » wires are 10 MBps (100 Mbps Ethernet) » ~20 MBps tcp/ip saturates 2 cpus » round-trip latency is ~300 us • In the lab » Wires are 10x faster Myrinet, Gbps Ethernet, ServerNet,… » Fast user-level communication • tcp/ip ~ 100 MBps 10% of each processor • round-trip latency is 15 us 83 SAN: Standard Interconnect Gbps Ethernet: 110 MBps • LAN faster than PCI: 70 MBps UW Scsi: 40 MBps • • • memory bus? 1 GBps links in lab. 100$ port cost soon Port is computer FW scsi: 20 MBps scsi: 5 MBps 84 Technology Drivers Plug & Play Software • RPC is standardizing: (DCOM, IIOP, HTTP) » Gives huge TOOL LEVERAGE » Solves the hard problems for you: • naming, • security, • directory service, • operations,... • Commoditized programming environments » FreeBSD, Linix, Solaris,…+ tools » NetWare + tools » WinCE, WinNT,…+ tools » JavaOS + tools • Apps gravitate to data. • General purpose OS on controller runs apps. 86 Basic Argument for x-Disks • Future disk controller is a super-computer. » 1 bips processor » 128 MB dram » 100 GB disk plus one arm • Connects to SAN via high-level protocols » RPC, HTTP, DCOM, Kerberos, Directory Services,…. » Commands are RPCs » management, security,…. » Services file/web/db/… requests » Managed by general-purpose OS with good dev environment • Move apps to disk to save data movement » need programming environment in controller 87 Outline • What storage things are coming from Microsoft? • TerraServer: a 1 TB DB on the Web • Storage Metrics: Kaps, Maps, Gaps, Scans • The future of storage: ActiveDisks • Papers and Talks at http://research.Microsoft.com/~Gray 88