Martin Coetzer Technical Consultant Microsoft UNC312 Agenda Exchange storage background Storage technology 2010+ Large mailbox value E2010 storage architecture Store innovations ESE database innovations E2010 storage design Summary Exchange 2003 HA/Storage Design MSIT 4+3 SCC SAN example +1 IOPS/Mailbox SAN Fabric A 4 Active Nodes 3 Passive Node 8 Processor cores 4 GB of RAM 4000 Users/Server 250 MB Mailboxes Backups: Daily Full Stream to disk/tape SAN Fabric B RAID10 3.5” 10K FC Disks Storage is single point of failure Exchange 2007 HA/Storage Design MSIT CCR + DAS example .33 IOPS/Mailbox ~4000 Mailboxes/Cluster 8 Processor cores 16 GB of RAM 2 GB Mailboxes Backups: DPM 15 min Incremental Daily Express Full Hub Transport Server: File Share Witness Transport Dumpster Replay Public Network Private Network RAID RAID CCR RAID RAID Transaction Log Shipping Active Node No single points of failure! Passive Node RAID5 2.5” 10K SAS Disks Disk Technology Disk Capacity trend predicted to continue 2TB Desktop class SATA disks available 1TB Nearline/Midline SAS disk available Sequential throughput increasing linearly based on areal density 2010 SATA = ~250MB/sec Random I/O performance not expected to improve substantially 15K RPM is the ceiling Random vs. Sequential Disk IO Random IO Disk Head Disk head has to move to process subsequent IO Head movement = High IO latency Seek Latency limits IOPS Sequential IO Disk head does not move to process subsequent IO Stationary Head = Low IO latency Disk RPM speed limits IOPS 7.2K SATA Disk (20ms Latency) Random = 50 IOPS Sequential = +300 IOPS! IOPS = Input/Outputs (IO’s)per second FLASH/SSD: E2010 Scenarios NAND Flash best utilized by E2010 when used as a cache within storage stack PCM NAND HBA / RAID E2010 Mailbox Server SATA SSD Hybrid HDD Enterprise SAN Array ? E-mail Trends Messages Sent/Received Per User/Day 250 200 150 100 50 0 2008 2010 2012 The average corporate user, today, can expect to send and receive about 156 messages a day, and this number is expected to grow to about 233 messages a day by 2012. An increase of 33% over the four-year period. (Radicati, 2008) Business users report that they currently spend 19% of their work day, or close to 2 hours/day on email. (Radicati, 2007) Large Mailbox Value Large Mailbox = 1-10GB+ “Aggregate Mailbox” = Primary mailbox + Archive Mailbox ~1 Year of mail (minimum) Increased knowledge worker productivity Time Items Mailbox Size (MB) 1 Day 200 10 1 Month 4000 200 1 Year 48,000 2,400 4 Years 192,000 9,600 *Very Heavy Profile = 150 Receive + 50 Send /Day, 50KB, no deletions Reduced mailbox management Client Accessibility (Outlook/OWA/Mobile) Eliminate/Reduce PST’s Eliminate/Reduce 3rd Party Archive Large Mailbox Challenges & Solutions Client Experience Outlook 2007 Performance (Cached Mode) Outlook 2007 (Online)/OWA Performance Items/folder Limitations View Creation Performance Client Search Performance Performance Improvements: Office 2007 SP2 (KB953195) Updated OST sizing guidance (10GB) Utilize the E2010 Archive Mailbox to reduce data cached to OST E2010 Store/ESE changes E2010 Store/ESE changes E2010 Search Performance Improvements Real-time result views 2x increase in indexing performance E2010 Store/ESE changes Large Mailbox Challenges & Solutions Deployment/Operations Long Backup Times Fast Recovery Requirements (RTO) High Storage Costs Backup off passive copies Daily Incremental/Weekly Full backups DPM Express Full Backups E2010 HA + Hold Policy is your backup E2010 HA E2010 Store/ESE changes IOPS (efficiently utilizing low performance/high capacity disks) RAID overhead Move Mailbox Downtime E2010 Online Move Mailbox Database Maintenance E2010 Store/ESE changes Online Maintenance Duration (OLD) DB corruption (-1018) pain point DB re-seed performance hit on active copy Exchange 2010 Storage Vision IO Reduction Sequential IO SATA/Tier 2 Disk Optimization Large, Fast, Lowcost Mailboxes Storage Design Flexibility RAID’less Storage (JBOD) IOPS Reductions: Store Schema Changes Store Schema = The way the Store organizes data in the ESE Database E2010: One simple theme Move away from doing many, random, small size, disk IOs to doing fewer, sequential, large size, disk IOs. Significant Benefits Fast/Efficient.. OWA/Outlook Online Mode …end user viewing for “cold” states/first time view creation …Calendar Operations …Search performance Outlook Cached Mode/Exchange Active Sync OST sync = sequential IO EAS sync = sequential IO Server Management …Move mailbox …Content Index Crawls IOPS Reduction: Store Table Architecture Per Database Per Folder E2007 Mailbox Table Folders Table Message Table (Msg) Attachments Table Message/Folder Table (MFT) Jeff’s Mbx Jeff:Inbox Joe:Msg10 Jeff:Excel.xls Joe:Inbox:H3 Ann’s Mbx Ann:Drafts Jeff:Msg32 Ann:Pic.bmp Joe:Inbox:H2 Joe’s Mbx Joe:Unread Ann:Msg180 Joe:Help.doc Joe:Inbox:H1 Secondary Indexes used for Views Per Mailbox Per Database E2010 Per View Mailbox Table Folders Table Message Header Table Body View Tables (e.g. From) Jeff’s Mbx Joe:Inbox Joe:H10 Joe:Msg10 Joe:H920 Ann’s Mbx Joe:Drafts Joe:H302 Joe:Help.doc Joe:H302 Joe’s Mbx Joe:Unread Joe:H920 Joe:Msg302 Joe:H10 New Store Schema = no more single instance storage within a DB Store Schema Changes: Physical Contiguity B+ Tree E2007 1078 92 4577 6 872 7210 3278 21 9346 DB Pages (Page Numbers) Many, small size, IOs (1 per 8K page) E2010 B+ Tree 1078 B+Tree = Table 1079 1080 1081 1082 1083 3456 3457 3458 Fewer, larger size, sequential IOs Store Schema Changes: Logical Contiguity Mailbox Inbox Calendar Drafts For Follow-up DL Mail M1 M3 M5 M4 M2 E2007 Many, small size, IOs Random Mailbox E2010 DL Mail M1 Calendar M2 Drafts M3 For Follow-up M4 Inbox M5 Sequential Fewer, large size, IOs Store Schema Changes: Lazy View Updates Reducing IO by deferring view updates View updates utilize sequential IO All Unread or Flagged items (view) M1 E2007 M2 M1 M3 M2 Nickel & Dime Approach DB I/O Many, random, IOs (1 per update) M1 arrives M2 arrives M1 flagged M3 arrives M2 deleted Time User uses OWA/Outlook Online and switches to this view E2010 Pay to Play Approach All Unread or Flagged items (view) M1 M2 M1 M3 M2 Fewer, sequential, IOs (1 per view) Outlook 2007 SP2 Large Mailbox Performance on E2010 IOPS Reduction: ESE Changes Optimize for new Store Schema Allocate database space in contiguous manner Maintain database contiguity over time Utilize space efficiently (Database compression) Increase IO Sizes DB page size increased from 8KB to 32KB Improved read/write IO coalescing (Gap coalescing) Provide improved async read capability (Pre-read) Increase Cache Effectiveness 100MB Checkpoint Depth (HA configurations only) DB Cache Compression (aka Dehydration) DB Cache Priority (aka Fast Evict) IOPS Reduction: Space Management Allocate space based on contiguity Database Space Allocation Hints: • Allocate DB space based on either data compactness or data contiguity (usage pattern) DB Cache Space Contiguity Page X Page Y Page Z Msg Header Msg Header Event History Disk Space Compactness Page 1 Page 2 Page 3 Page 4 Page 5 Used Event History Used Msg Header Msg Header Contiguity Random/Compact Sequential/Bloat IOPS Reduction: Maintain Contiguity New database maintenance architecture ESE Function E2007 SP1 E2010 Cleanup (deleted items/mailboxes) Cleanup performed during Online Defrag (OLD) which occurs during Online Maintenance (OLM) time window Cleanup performed at run time (when hard delete occurs). Happens during Store dumpster cleanup (OLM), pages are zeroed by default. Space Compaction Database is compacted and space reclaimed during Online Defrag (OLD) Database is compacted and space reclaimed at run-time. Auto-throttled. Maintain Contiguity N/A: Contiguity is compromised by space compaction Database is analyzed for contiguity and space at run time and is defragmented in the background (B+Tree Defrag/OLD2). Auto-throttled. Database Checksum When configured, ½ of OLD maintenance window reserved for sequential scan (Checksum), manual throttle. Active DB copy only. Two options (both Active and Passive copies): 1. Run DB Checksum in the background 24x7 (default). Sequential IO 2. Run DB Checksum during OLM window. Sequential IO IOPS Reduction: DB Contiguity Results DB Page Numbers E2007 Message Folder Table (aka MFT) FRAGMENTED Random Deletes at the tail E2010 Message Header Table (aka MsgHeader) CONTIGUOUS *Production database analysis Blue = contiguous (good) Red = fragmented (bad) Mitigate DB Space Growth: Database Compression Store Schema change, Space Hints, B+Tree Defrag & 32KB page size combine to increase DB file size by 20%. Growth is 100% mitigated by Database Compression 7bit/XPRESS Compression for message headers and text/html bodies (Long Values) DB File Size Comparison 1,50 1,20 1,00 1,00 1,00 DB Space Analysis 0,88 0,50 0,00 E2007/RTF E2010/RTF Counts E2007 SP1 Mailbox Count 750 Tables 14754 Secondary Indexes 85784 Pages 28,486,144 Used Pages (%) 85.7% Available Pages (%) 14.3% E2010 750 92435 4557 5,814,032 86.7% 13.3% E2010/Mix E2010/HTML 1 Database, 750 x 250MB mailboxes,RTF = RTF Compressed, Mix = 77% HTML, 15% RTF, 8% Text, Avg. Message size = ~50KB Msg Views 32KB Pages IOPS Reduction: DB Page Size Increased to 32KB E2007 DB Read 20KB Message DB Cache Page 1 Page 3 Page 5 Msg Header Msg Body Msg Body Disk 3 Read IO’s 8 KB Pages E2010 DB Read 20KB Message DB Cache Page 1 Page 2 Page 3 Page 4 Page 5 Msg Header X Msg Body X Msg Body Page 1 (32KB) Msg Header, Msg Body 1 Read IO Disk 32 KB Pages Page 1 (32KB) Page 2 (32KB) Msg Header, Msg Body X ~20KB Message IOPS Reduction: IO Gap Coalescing Read Case E2007 DB Read Behavior DB Cache Page 1 Page 3 Page 5 Msg Header Msg Body Msg Body Disk 3 Read IO’s E2010 DB Read Behavior DB Cache Page 1 Page 2 Page 3 Page 4 Page 5 Msg Header X Msg Body X Msg Body Page 1 Page 2 Page 3 Page 4 Page 5 Msg Header Temp Buffer Msg Body Temp Buffer Msg Body 1 Read IO Disk Page 1 Page 2 Page 3 Page 4 Page 5 Msg Header X Msg Body X Msg Body IOPS Reduction: 100MB Checkpoint Depth Checkpoint Depth = The amount of data that is waiting to be committed to the database file (edb). E2010 default Checkpoint Depth Max is increasing from 20MB to 100MB only on databases protected by E2010 HA (standalone still 20MB). Deep Checkpoint Benefit = Efficient DB writes (~40% reduction) 100MB Checkpoint Depth = 40% DB write IO reduction 120 100 Database Pages Repeatedly Written/sec 80 60 40 DB Writes/sec (avg) 20 0 20 40 60 Checkpoint Depth (MB) 80 100 Loadgen Test: 3000 Mailbox, 12 DB, Outlook 2007 Online Very Heavy Profile Deep Checkpoint Risks = long store shutdown times, long crash recovery times. Risk Mitigation: shutdown databases in parallel, failover on store crash IOPS Reduction: DB Cache Compression Problem: New Store Schema + 32KB pages can reduce efficiency of cache. E.g. A page with 8KB of data consumes 32KB of memory in the DB Cache. Solution: Implement DB Cache Compression to shrink partially used cached pages in memory; allowing more Effective cache. 1. 2. 32KB Page with only 8KB of data is read off disk 32KB page is compressed to a 8KB inmemory image DB Cache Page 1 (32KB) Page 1 (8KB) 8KB 8KB Disk Page 1 (32KB) 8KB Up to 30% more cache/mailbox server More Cache = Less DB IO! IOPS Reduction: DB Cache Priority Problem: Background and recovery DB operations can pollute the cache. E.g. DB Check summing, OLD2, HA log replay. Solution: Implement DB Cache Priority to allow lower cache priorities for background/replay operations. Outlook Message Read HA Log Replay DB Maintenance (Passive) DB Cache Time Past Cache Eviction Now Cache Entry ESE Caching Algorithm = LRU-K (Least Recently Used) Future Exchange 2010 Storage Speeds and Feeds Mailbox IO Characteristics: E2007 vs... E2010 DB IO E2007 E2010 Log IO E2007 E2010 IO Type Random “Sequentialish” IO Type Sequential Sequential Read:Write 1:1 3:2 Avg Read IO Size (KB) 12 52 Read:Write 0:1 0:1 n/a n/a Avg Write IO Size (KB) 8 60 Avg Read IO Size (KB) Avg Write IO Size (KB) 10 10 DB IO Sizes increase by 5x!! Log IO Write Size is the same... 3000 Mailboxes, 12 DB’s 4MB DBCache/Mailbox, Loagen Outlook 2007 Online Very Heavy Profile, 250MB Mailbox Size IOPS Reduction: E2007 vs. E2010 Results DB IOPS Comparison +70% Reduction! 500 450 400 350 DB Read IO/Sec DB Write IO/Sec DB IO/Sec 300 250 200 150 100 50 0 E2007 E2010 3000 Mailboxes, 3MB DB Cache/user, Loadgen Outlook 2007 Online Very Heavy Profile, 250MB Mailbox Size, E2010 Beta Exchange IOPS Trend DB IOPS/Mailbox +90% Reduction! 1 Exchange 2003 Exchange 2007 Exchange 2010 0,8 0,6 0,4 0,2 0 Exchange 2003 Exchange 2007 Exchange 2010 Optimize for SATA/Tier 2 Disks DB Write IO “Burstiness” Problem: Bursty DB writes negatively affect DB read and Log write latency • The more write IO’s issued at a time, the more disk contention. IO Latency Based on Max DB Write IO’s (ms) 120 100 DB Read IO 80 Latency (ms) 60 40 Log Write IO 20 0 2 4 8 16 32 64 Maximum DB Write IO's Issued Solution: Throttle DB writes based on Checkpoint target (QoS), DB Write Smoothing Single 7.2k SATA disk, logs/db on same spindle, Loadgen load generating 250 RPC Operations/second, ~50 IOPS DB Write Smoothing: Results E2010 Smooth DB IO Benefit 49 50% Reduction! 50 45 40 34 35 DB Read Latency (ms) 30 Log Write Latency (ms) 25 20 RPC Average Latency 15 10 5 10,1 3,7 5,1 0,7 0 Exchange 2010 Baseline Exchange 2010 Smooth DB IO 3000 Mailboxes, 3MB DB Cache/user, 12 x 7.2k SATA disks (DB/Logs on same spindles), Loadgen Outlook 2007 Online Very Heavy Profile Putting It Altogether: Mailboxes/Disk E2010 Storage improvements cannot be quantified in IOPS reductions alone Mailboxes/Disk +500 125 Exchange 2007 Exchange 2010 250MB Mailbox Size, 3MB DB Cache/user, 12 x 7.2k SATA disks (DB/Logs on same spindles), Loadgen Outlook 2007 Online Very Heavy Profile, measured at <20ms RPC Average latency JBOD/RAID'less Storage: Now an option! JBOD : 1 disk = 1 Database/Log Requires E2010 HA (3+ DB Copies) Annual Disk Failure Rate (AFR) = ~5% JBOD Advantages Reducing Storage Costs/Complexity JBOD Challenges Exchange HA/Storage must replace RAID functionality Eliminates unnecessary DB copies: Server and Storage redundancy can be symmetrical Disk Striping performance (e.g. RAID10) cannot be leveraged Reduces Disk IO: Eliminates RAID write penalty Disk Failure = Database Failover (~30 second outage) Enables Simple Storage Design: 1 disk = 1 database Re-enabling Resiliency = Spare disk assignment/partitioning/format/DB re-seed (scriptable) Enables Simple Storage Failure Recovery Soft Disk Errors (bad blocks) must be detected and repaired JBOD/RAID'less Storage: E2010 Optimizations Improve HA storage failure detection and failover Optimize HA Failovers/Switchovers Improve storage failure detection (bad blocks/corruption) Improve Database Seeding/Repair HA now detects storage failures and automatically fails over Failovers < 30 seconds ESE tuned to maintain DB cache after failover (Cache warming) Active/Passive copy background scan (Checksum) Active/Passive copy Lost Write Detection Utilize DB passive copy for seeding source Seed capability for Content Index Catalog Reduce re-seeds by using Single Page Restore (Active and Passive) JBOD/RAID'less Storage: Single Page Restore (Active) 1. Page corruption detected on Active Copy (e.g. -1018) 2. Active DB places marker in log stream to notify passive copies to ship up to date page 3. Passive receives log and replays up to marker, retrieves good page, invokes Replay Service callback and ships page 4. Active receives good page, writes page to DB. Page is restored. 5. Subsequent page repair from additional copies ignored Database Availability Group (DAG) Mailbox Server Node 1 Mailbox Server Node 2 Mailbox Server Node 3 DB1-Active DB1-CopyA DB1-CopyB Log Log Log Page1 Page1 Page1 Page2 Page2 Page2 Page3 Page3 Page3 Database Database Database E2010 HA Storage Design Flexibility SAN • HA = Shared Storage Clustering • +1.0 IOPS/Mailbox • 3.5” 15K 146GB FC Disks • RAID10 for DB & Logs • Dedicated Spindles • Multi-path (HBA’s, FC Switches, SAN array controllers) • Backup = Streaming off active • Fast Recovery = Hardware VSS (Snapshots/Clones) DAS (SAS) • • • • • • • • HA = CCR .33 IOPS/Mailbox 2.5” 146GB 10K SAS Disks RAID5 for DB RAID10 for Logs SAS Array Controller (/w BBU) Backup = VSS Snapshot Fast Recovery = CCR DAS (SATA/Tier2) • HA = DAG (2+ DB copies) • .11 IOPS/Mailbox • 3.5” 1TB 7.2K SATA/Tier2 Disks • RAID10 for DB & Logs • SAS Array Controller (/w BBU) • Backup = VSS Snapshot/Optional • Fast Recovery = Database Failover More options to reduce storage cost JBOD (SATA/Tier2) • HA = DAG (3+ DB copies) • .11 IOPS/Mailbox • 3.5” 1TB 7.2K SATA/Tier2 Disks • 1 DB = 1 Disk • SAS Array Controller (/w BBU) • Backup = VSS Snapshot/Optional • Fast Recovery = Database Failover E2010 Storage Design Flexibility Exchange Online Archive provides mailbox storage flexibility One Mailbox per user or two E2010 optimized for DAS storage, SAN storage is fully supported IOPS reductions/SATA optimizations enable lower performing storage E2010 HA architected for DAS (simpler) JBOD* and RAID storage support E2010 optimized for Tier 2 (SATA) disks, Enterprise disks are fully supported SSD storage supported but not recommended for mainstream due to high $/GB Storage Groups are gone; Max 100 Databases/Server Max recommended DB Size = 2TB* Max recommended Folder Item Count = 100K** *2+ copy E2010 HA only ** Assuming no 3rd party applications E2010 Storage Requirements Storage Guidance Stand Alone E2010 HA(2 copies) E2010 HA(3+ copies) Storage Type DAS, SAN (Fibre Channel, iSCSI) Disk Type SAS, Fibre Channel, SATA/Tier2 , SSD RAID RAID recommended RAID optional RAID Type RAID-1/0, RAID-5, RAID-6 JBOD DB/Log Isolation Best Practice Windows Disk Type Basic (recommended), Dynamic (supported) Partition Type GPT (recommended), MBR (supported) Partition Alignment Windows 2008/R2 Default (1MB) File System NTFS NTFS Allocation Unit Size 64KB for both database and log volumes Encryption Support Outlook Protection Rules, Bitlocker Not required See Appendix for full details E2010 HA/JBOD Storage Example Single Site, 3 Node, 3 Copy DAG 8 Cores 32GB RAM Mbx Server 1 8 Cores 32GB RAM Mbx Server 2 8 Cores 32GB RAM Mbx Server 3 DB1 DB2 DB3 DB4 DB5 DB6 DB1 DB2 DB3 DB4 DB5 DB6 DB1 DB2 DB3 DB4 DB5 DB6 DB7 DB8 DB9 DB10 DB11 DB12 DB7 DB8 DB9 DB10 DB11 DB12 DB7 DB8 DB9 DB10 DB11 DB12 DB14 DB15 DB16 DB17 DB18 DB20 DB21 DB22 DB23 DB24 DB26 DB27 DB28 DB29 DB30 DB13 DB14 DB15 DB16 DB17 DB18 DB19 DB20 DB21 DB22 DB23 DB24 DB25 DB26 DB27 DB28 DB29 DB30 D B DB13 1 D DB19 B 1 DB25 DB14 DB15 DB16 DB17 DB18 DB20 DB21 DB22 DB23 DB24 DB26 DB27 DB28 DB29 DB30 DD BB DB13 11 DD DB19 BB 11 DB25 Database Availability Group (DAG) Active copy Passive copy Legend Spare Disk 10,000 Mailboxes Heavy Profile: 120 Messages/day .11 IOPS/Mailbox 2GB Mailbox Size 3,333 Active Mailboxes/Server 3 Nodes, 3 Copies = double disk failure resiliency 1TB 7.2k disks (SAS/SATA/Tier2) JBOD: 30 Disks/node Online Spares Battery Backed Caching Array Controller Key Takeaways Exchange Server 2010.. Reduces DB IOPS by +70%...again! Optimizes for large mailboxes (+10GB) and 100K Item counts Optimizes for large/slow/low-cost disks (SATA/Tier2) Makes JBOD/RAID'less storage a viable option Enables unmatched storage flexibility to reduce costs Resources Tech·Ed Africa 2009 sessions will be made available for download the week after the event from: www.tech-ed.co.za www.microsoft.com/teched www.microsoft.com/learning International Content & Community Microsoft Certification & Training Resources http://microsoft.com/technet http://microsoft.com/msdn Resources for IT Professionals Resources for Developers Related Content Microsoft Exchange Server 2010 Transition and Deployment (UNC310) High Availability in Microsoft Exchange Server 2010 (UNC301) Unified Messaging in Microsoft Exchange Server 2010 (UNC311) Microsoft Exchange Server 2010 Management Tools (UNC309) Storage in Microsoft Exchange Server 2010 (UNC312) Microsoft Hyper-V: Dos and Don'ts for Microsoft Exchange Server 2007 SP1 and 2010 (VIR308) Archiving and Retention in Microsoft Exchange Server 2010 (UNC307) Call to Action Learn More! Related Content at TechEd on “Related Content” Slide Attend in-person or consume post-event at TechEd Online Check out online learning/training resources http://technet.microsoft.com/exchange/2010 http://technet.microsoft.com/office/ocs Try It Out! Download the Exchange Server 2010 Beta Evaluation http://www.microsoft.com/exchange/2010/try-it Get a 5-Day Trial of Office Communications Server 2007 R2 https://r2.uctrial.com/ 10 pairs of MP3 sunglasses to be won Complete a session evaluation and enter to win! © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. IOPS Reductions: Store Schema Elements How do you move from random IO to Sequential IO? Element Physical Contiguity (ESE) Logical Contiguity (Store) E2007 E2010 Poor physical contiguity of leaf Excellent physical contiguity of pages. Hence many, small size, leaf pages. So fewer, large size IOs (1 for each page) IOs, spanning N pages (N ≈100) Headers for each folder kept in separate table. So many, small size, IOs spread over many tables All views and indexes updated Temporal Contiguity each time a mail is delivered. (View) So many, small size, IOs spread over time Headers for an entire mailbox kept in a single table. Hence fewer, large sized, IOs on a single table Views and indexes updated only when they are accessed by user. So fewer, large sized, IOs done together IOPS Reduction: Maintain Contiguity Over Time New E2010 behavior… 1. Delivery 2. Random Delete 3. Defragmentation Mailbox Messages Mailbox Messages Mailbox Messages M1 M1 M1 M2 M2 M3 M3 M3 M5 M4 M4 M7 M5 M6 Contiguous M5 M6 Fragmented M10 M11 M7 M7 M12 M8 M8 M13 M9 M9 M14 M10 M10 M15 Contiguous IOPS Reduction: Write IO Gap Coalescing DB Cache E2007 DB Write Behavior Page 1 Page 2 Page 3 Page 4 Page 5 Dirty Clean Dirty Clean Dirty Writes spaced out over time 3 Write IO’s Disk DB Cache E2010 DB Write Behavior Page 1 Page 2 Page 3 Page 4 Page 5 Dirty Clean Dirty Clean Dirty 1 Write IO Disk Big IO: How Big is Too Big? IO Latency increases with IO size Random DB IO Latency Based on Size 25 Write IO Latency (ms) 20 15 Read 10 5 E2010 Max IO Size = 256KB for Read 384KB for Write 0 0 128 256 384 512 640 768 896 1024 IO Size (KB) SqlIO Test, 1x 750GB 7.2k SATA, no caching array controller Optimize for SATA/Tier 2 Disks Solution: Smooth DB Write IO Throttle DB writes based on Checkpoint target (QoS) • When Checkpoint Depth equals 1x ->1.24x of Checkpoint target, Limit Max Outstanding DB writes/LUN to 1 • When Checkpoint Depth meets or exceeds 1.25x of Checkpoint target, ratchet up Max Outstanding DB writes/LUN • The further behind on checkpoint, the more aggressively we raise the Max Outstanding DB writes/LUN (Maximum = 512/LUN) Max Outstanding DB Writes Max Outstanding DB Writes vs.. Checkpoint Depth 40 35 30 Works for both JBOD SATA through RAID10 SAN 25 20 15 10 5 0 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 LogCheckpoint Checkpoint Depth Log Depth(MB) (MB) 20MB Max Checkpoint example JBOD/RAID'less Storage: Lost Flush Detection What is a lost flush? A DB write IO that the disk subsystem/OS returned as completed did not actually get written to media or was written in the wrong location (aka lost write). Why are they so bad? Your database may be logically corrupt and you do not know it! How can they be detected in E2010? Two methods: 1. In Memory Flush Map (Active & Passive): memory overhead of 2 bits/page. Event ID 530 is fired when detected (-1119) and page can be patched. 2. Database Recovery: Event is fired (ID 516: timestamp mismatch, (-567)) and database must be re-seeded. Exchange 2010 High Availability Simplified Mailbox High Availability and Disaster Recovery with New Unified Platform San Jose Recover quickly from disk and database failures New York Mailbox Server Mailbox Server Mailbox Server DB1 DB2 DB3 DB4 DB5 DB1 DB2 DB3 DB4 DB5 DB1 DB2 DB3 DB4 DB5 Evolution of Continuous Replication technology (Database Mobility) Easier than traditional clustering to deploy and manage Allows each database to have 16 replicated copies Provides full redundancy of Exchange roles on as few as two servers Replicate databases to remote datacenter E2010 High Availability Architecture AD site: Dallas Client DB1 DB3 AD site: San Jose DB5 CAS/HUB Mailbox Server 6 Database Availability Group (DAG) Mailbox Server 1 Mailbox Server 2 Mailbox Server 3 Mailbox Server 4 Mailbox Server 5 DB1 DB4 DB2 DB5 DB3 DB2 DB5 DB3 DB1 DB4 DB3 DB1 DB4 DB2 DB5 JBOD/RAID'less Storage: Single Page Restore Passive 1. Page corruption detected on DB Copy (e.g. -1018) 2. Passive copy pauses log replay (log copying continues) 3. 4. 5. Database Availability Group (DAG) Mailbox Server Node 1 Mailbox Server Node 2 Mailbox Server Node 3 Passive retrieves the corrupted page # from the active using DB seeding infrastructure DB1-Active DB1-CopyA DB1-CopyB Log Log Log Passive copy waits till log file which meets max required generation requirement is copied/inspected, then patches page Page1 Page1 Page1 Page2 Page2 Page2 Page3 Page3 Page3 Database Database Database Passive resumes log replay Exchange 2010 Storage Guidance Storage Type Direct Attached Storage (DAS) Storage Area Network (SAN): iSCSI Stand Alone Database Availability Group: 2 nodes, 2 Database copies Database Availability Group: 3+ nodes, 3+ Database copies Supported Supported Supported. Best Practice = Do not share physical Supported. Best Practice = Do not share physical disks disks backing Exchange data with other backing Exchange data with other applications. applications. Supported Supported. Best Practice = Do not share physical disks backing Exchange data with other applications. Storage Area Network (SAN): Fibre Channel (FC) Supported. Best Practice = Do not share physical Supported. Best Practice = Do not share physical disks disks backing Exchange data with other backing Exchange data with other applications. applications. Best Practice = Do not place both database copies on the same physical spindles. Supported. Best Practice = Do not share physical disks backing Exchange data with other applications. Best Practice = Do not place both database copies on the same physical spindles. Network Attached Storage (NAS): SMB Physical Disk Type SATA Not Supported Not Supported Supported, requires battery backed caching array Supported, requires battery backed caching array controller for data integrity controller for data integrity Supported, requires battery backed caching array controller for data integrity SAS FC SSD (Flash Disk) Physical Disk Write Caching (enabled) Storage RAID Supported Supported Supported Not Supported RAID recommended Supported Supported Supported Not Supported RAID recommended Supported Supported Supported Not Supported RAID optional EDB Volume Log Volume Disk Array RAID Stripe Size (kb) Storage Array Cache Settings RAID5/6, RAID10, RAID1 RAID1, RAID10 256KB 75% Write Cache, 25% Read Cache (with Battery Backed Cache) RAID5/6, RAID10, RAID1 RAID1, RAID10 256KB 75% Write Cache, 25% Read Cache (with Battery Backed Cache) JBOD, RAID5/6, RAID10, RAID1 JBOD, RAID1, RAID10 256KB 75% Write Cache, 25% Read Cache (with Battery Backed Cache) Preliminary Storage Guidance: Subject to Change! Database/Log file placement Database/Log Isolation Not Supported Best Practice (for recoverability) = separate Database file (.edb) and logs from same Database can database file (.edb) and logs from same Database share same volume and same physical disk. on to different volumes backed by different physical disks Database file (.edb) and logs from same Database can share same volume and same physical disk. This is a best practice for JBOD/RAID'less storage scenario where one or more volumes store the edb and log files backed by the same physical disk. Database Files/Volume Based on backup methodology Based on backup methodology Log Streams/Volume Based on backup methodology Based on backup methodology RAID = based on backup methodology, JBOD = one DB file/volume is recommended RAID = based on backup methodology, JBOD = one log stream/volume is recommended Recommended Supported Recommended Supported Recommended Supported File System NTFS Defragmentation NTFS Allocation Unit Size Recommended Supported Windows 2008 Default: 1MB Drive Letter or Mount Point (mount point host volume must be RAIDed) NTFS support only Not required, not recommended 64KB for both edb and log volumes Recommended Supported Windows 2008 Default: 1MB Drive Letter or Mount Point (mount point host volume must be RAIDed) NTFS support only Not required, not recommended 64KB for both edb and log volumes Recommended Supported Windows 2008 Default: 1MB Drive Letter or Mount Point (mount point host volume must be RAIDed) NTFS support only Not required, not recommended 64KB for both edb and log volumes NTFS Compression Not Supported for Exchange Database files Not Supported for Exchange Database files Not Supported for Exchange Database files NTFS Encrypted File System (EFS) Not Supported for Exchange Database files Not Supported for Exchange Database files Not Supported for Exchange Database files Windows Bitlocker (volume encryption) Supported for all Exchange database and log files Supported for all Exchange database and log files Windows Disk Type Basic Disk Dynamic Disk Partition Type GUID Partition Table (GPT) Master Boot Record (MBR) Partition Alignment Volume Path Supported for all Exchange database and log files