Best Practices of Exchange Server Preventative Maintenance Brett Johnson Brettjo@microsoft.com GTSC (UK) Exchange Escalation Engineer Microsoft Services UK Agenda What is Preventative Maintenance? Configuring Exchange Preparing for Exchange Maintaining Exchange Checklists QA What is preventative maintenance? Prevention IS better than cure > 60% of problems are people / process not technology Reliability of any system is subject to the reliability of its components Problems are the exception, not the rule. What is preventative maintenance? Around 90% Of Exchange Administrators never attempt maintenance until disaster hits…Why? Low understanding of issues and problems No time, resources, or budget to address maintenance tasks Impact of “testing” with their production servers Assume the risk of doing nothing outweighs the risk of pro-active maintenance Technically capable of performing maintenance tasks, however they realise that the process of maintenance can be boring and take a long time Configuring Microsoft Exchange Configuring Exchange Hardware Consistent Hardware Configurations Select High-Quality Hardware Updated Firmware and Drivers Hardware Compatibility List Memory Use ECC Memory Use Ample Memory (Max 4GB) Configuring Exchange Disk Correct Disk Caching Configuration Hardware RAID vs. Software RAID Disk Volume Configuration Recommended Disk Layouts C: SYSTEM DRIVE - RAID 1 (mirror) 2 Partitions: 1 For System Files, 1 For Page File and MTA Data (x400) E: SMTP QUEUES - RAID 0+1 (stripe / mirror) 1 Partition: SMTP mail queues. F: TRANSACTION LOGS - RAID 0+1 (stripe / mirror) 1 Partition, Exchange Transaction Logs. G: DATABASE FILES - RAID 0+1 (stripe /mirror) or RAID 5 1 Partition, Exchange EDB Files. Configuring Exchange Disk (Cont’d) Ensure Sufficient Disk Space Exchange Transaction Log Volumes Plan for storage and I/O Avoid Disk Compression Configuring Exchange Windows Server Consistent Software Versions Set Maximum Log Size to 16 MB and Overwrite Events as Needed Configure Dr. Watson as the Default Debugger Set Recovery Options Make Sure There Is DC Resilience If Running Win2003 Server And Have > 1GB Ram.. /3GB Configuring Exchange Microsoft Exchange Configuration Disable Circular Logging Set IS Maintenance Window (Staggered) Set Maximum Mailbox Quotas Permission With Groups Not Users Have A Solid Naming Convention Use The Administrative Notes Field ! Microsoft Exchange Servers Clustering Two Core Configurations: EDC: 7-Node (A/A/A/A/P/p/p) RDC: 5-Node (A/A/A/P/p) Mount Points Log – MP to SG Data SMTP – MP to SG1 Data Backup – MP to Backup drive Best Practices Exchange VS1 E: SG1 DATA F: SG2 DATA G: SG3 DATA H: SG4 DATA E\MP – SMTP E\MP – SG1 LOGS F\MP – SG2 LOGS G\MP – SG3 LOGS H\MP – SG4 LOGS > 50% Fewer Drive letters Standardized Naming Resources, Nodes, EVS, Disks, IP, Backup sets Microsoft Site Consolidation EX2K3 Topology (Goal) SAN SAN SAN SAN SAN SAN SAN N W Topology Data: •9 regional datacenters •72 physical sites •200 AD servers •215 Exchange servers •120 mailbox servers •100mb mailbox size S Do More With Less • 2 fewer regional datacenters (22% less) • 175 fewer servers (42% less) • 55 fewer physical sites (76% less) 6000 5000 E Topology Data: •7 regional datacenters •17 physical sites •< 140 AD servers •< 100 Exchange servers •31 mailbox servers •200mb mailbox size 250 5000 Average Users Per Server (Mbx) 212 Locations/Sites 200 Max Users Per Server (Mbx) Mailbox Servers 4000 4000 3500 150 2903 3000 120 110 100 2000 80 1000 1000 1000 80 50 50 31 500 189 7 0 0 Exchange 5.5 Exchange 2000 Exchange 2003 2004 Exchange 5.5 Exchange 2000 Exchange 2003 2004 Exchange Server Pre-Consolidation Measurement Mailbox TVP (2568) Site/RDC London (438) Stockholm (683) (6000 Users) Madrid (465) Amsterdam (792) Dublin Helsinki (142) Lisbon (210) Oslo (203) Copenhagen (232) Post-Consolidation Measurement Preparing For Microsoft Exchange Preparing For Exchange Server Configuration Log This includes: Firmware and BIOS revisions Installed software and version information Service packs Hot-fixes Symbols Hardware Services Network configuration Repair and Recovery information These records are useful in several ways: Enforce consistency, determine which servers require upgrades Good information to Microsoft support engineers Preparing For Exchange Create A Operations Log Create Test Accounts Acquire A Production Test Server Preparing For Exchange Maintenance Plan Maintenance Windows Upfront (Clustering) Plan Patch And Hot fix Processes Backup Develop a Backup/Restore Plan Standardize Tape Backup Formats Recovery and Troubleshooting Planning Backup Contingency Plan Plan for Oversize Stores Fix/Resolve All Hardware Problems Immediately Build a Spare Parts Inventory at Your Site Perform Periodic Server Recovery Drills Maintaining Microsoft Exchange…The Good Stuff Maintaining Exchange Backup and Restore Daily Full Online Backups of the Information and Directory Stores ***even if you are using VSS Online backups check the integrity of the Exchange stores by performing checksum verification Full online backups manage the size of the transaction log volume by purging transaction logs at the conclusion of the backup Users may continue to access mailboxes and public folders during an online backup Verify Backups Perform Recovery Drills Monitor Tape Drives for Maintenance Maintaining Exchange Daily Tasks Check Event Logs And Act On Them Check Backup Logs Check Perfmon Counters Implement MOM To Assist ! Check Disk Space Check Badmail And Queues Check For Updates Test Mail Flow Backup Up Server Maintaining For Exchange Monitoring Baseline Your Current System Will Enable To See If Issues Are Outside the norm Use Perfmon: Database MSExchangeIS / Mailbox Memory Physical Disk Process Processor Consider Implementing Management Tools What To Watch Database (Information Store) Counter Expected values Database\Log Record Stalls/sec Indicates the number of log records that cannot be added to the log buffers per second because the log buffers are full. NOTE: The default msExchESEParamLogBuffers for Exchange 2000 = 84 while the default value for Exchange 2003 = 500. The average value should be below 10 per second. Spikes (maximum values) should not be higher than 100 per second. Database\Log Threads Waiting Indicates the number of threads waiting to complete an update of the database by writing their data to the log. If this number is too high, the log may be a bottleneck. The average value should be below 10. What To Watch MSExchangeIS Counter Expected values MSExchangeIS\RPC Requests Indicates the number of MAPI RPC requests presently being serviced by the Microsoft Exchange Information Store service. The Microsoft Exchange Information Store service can service only 100 RPC requests (the default maximum value, unless configured otherwise) simultaneously before rejecting client requests. It should be below 30 at all times. MSExchangeIS\RPC Averaged Latency Indicates the RPC latency in milliseconds, averaged for the past 1024 packets. This is usually in the 10-20ms range on healthy servers It should be below 50 ms at all times. MSExchangeIS\RPC Operations/sec Indicates how many RPC operations are being asked of the Exchange store per second and how many it is actually responding to per second. RPC Operations/sec should rise and fall in conjunction with RPC Requests. <N/A> MSExchangeIS\Virus Scan Queue Length Current number of outstanding requests that are queued for virus scanning. <N/A> What To Watch MSExchangeIS Mailbox Counter Expected values Active Client Logons <N/A> Active Client Logons is the number of clients that performed any action within the last ten minute time interval. Baseline required, depends on number of users. Paging File Counter Expected values Paging File\% Usage Indicates the amount (as a percentage) of the paging file used during the sample interval. A high value indicates that you may need to increase the size of your Pagefile.sys file or add more RAM. This value should remain below 50%. What To Watch Memory Counter Expected values Memory\Available Mbytes (MB) Indicates the amount of physical memory (in MB) immediately available for allocation to a process or for system use. The amount of memory available is equal to the sum of memory assigned to the standby (cached), free, and zero page lists. During the test, there must be 50 MB of available memory at all times. Memory\Pages/sec Indicates the rate at which pages are read from or written to disk to resolve hard page faults. This counter is a primary indicator of the types of faults that cause system-wide delays. It includes pages retrieved to satisfy page faults in the file system cache. These pages are usually requested by applications. This counter should be below 1,000 at all times. Memory\Pool Nonpaged Bytes Indicates the number of bytes in the kernel memory nonpaged pool. The kernel memory nonpage pool is an area of system memory (that is, physical memory used by the operating system) for kernel objects that cannot be written to disk, but must remain in physical memory as long as the objects are allocated. There must be no more than 100 MB of non-paged pool memory being used. Memory\Pool Paged Bytes Indicates the number of bytes in the kernel memory paged pool. The kernel memory paged pool is an area of system memory for kernel objects that can be written to disk when they are not being used. Unless a backup or restore is taking place, there must be no more than 180 MB of paged pool memory being used. What PSS use time and time again What To Watch (SANS) PhysicalDisk Counter Expected values PhysicalDisk\Average Disk sec/Read Indicates the average time (in seconds) to read data from the disk. • • • DATABASE DRIVE The average value should be below 20 ms. Spikes (maximum values) should not be higher than 100 ms. LOG DRIVE The average value should be below 5 ms. Spikes (maximum values) should not be higher then 50 ms. SMTP DRIVE The average value should be below 10 ms. Spikes (maximum values) should not be higher than 50 ms. What To Watch (SANS) PhysicalDisk Counter Expected values PhysicalDisk\Average Disk sec/Write Indicates the average time (in seconds) to write data to the disk. • • • DATABASE DRIVE The average value should be below 20 ms. Spikes (maximum values) should not be higher than 100 ms. LOG DRIVE The average value should be below 10 ms. Spikes (maximum values) should not be higher than 50 ms. SMTP DRIVE The average value should be below 10 ms. Spikes (maximum values) should not be higher than 50 ms. Inside of Microsoft Storage I/O Design R & W IOps > 6000 Read latency < 10-15ms Write latency < 2-6ms Latency (ms) R (ms) W (ms) 10 00 20 00 30 00 40 00 50 00 60 00 70 00 80 00 90 00 10 00 0 1.5 IOps/user peak 3:1 Read:Write Ratio Focus on data LUNs Use JETStress to validate Design based on Monday Peaks! Storage I/O Metrics Maintaining Exchange What you just saw: Troubleshooting Exchange 2003 Performance http://www.microsoft.com/technet/prodtechnol/ exchange/2003/library/perfscalguide.mspx Exchange Technical Documentation Library: http://www.microsoft.com/technet/prodtechnol/ exchange/2003/library/default.mspx Maintaining Exchange Weekly Tasks Compare Server Against Baseline Config Verify Backed Up Data With Restore Monthly Tasks – On Restored Data ESEutil File Dump ESEutil Integrity Check ISInteg All Tests Default Mode Ad-Hoc Tasks ESEutil Defrag – 12 Months Or After Large Data Move Full Disaster Recovery Test Maintaining Exchange ESEUtil For Maintenance We Are Interested In: Defragmentation (/d) Integrity (/g) File Dump (/m) Copy File (/y) ESEutil is a powerful tool and needs to be used correctly, for these maintenance procedures we recommend only running against restored data not production boxes or in conjunction with PSS. Maintaining Exchange Defragmentation – Why? If you have deleted a large amount of mailboxes High Turn Over of Staff Migration If you had to run a hard repair of the database (ESEUtil /p - we do NOT recommend unless this is a last possible thing to do). If you are experiencing a specific issue and have found a reference that says offline defrag will fix it. As a general rule, only defrag to reclaim space if you're going to reclaim more than 30% of the space. You can look for Event ID1221 after nightly online defrag to get a conservative estimate of how much free space is in the database. Maintaining Exchange Defragmentation – How? The basic command line to defragment a database is: ESEUTIL /D database.edb To use this simple version of the command: There must be sufficient disk space (110%) on the local logical drive for the temporary defragmentation database. The streaming database must be in the same folder as the .EDB file. Maintaining Exchange Streaming database is in a different path ESEUTIL /D priv1.edb /Sd:\streaming\priv1.stm Insufficient local drive space for defragmentation ESEUTIL /D priv1.edb /T\\Server2\d$\defrag.edb /F\\Server2\d$\defrag.stm Automatic backup the original EDB + STM files /B\\Server2\d$\priv1.edb Skip defragmentation of the streaming database. ESEUTIL /D priv1.edb /I Maintaining Exchange ESE Integrity Check (/g) – Why? “Dry run” of the repair function. Problems that repair would address will be reported in the <database>.integ.raw file The .raw file logs results for all tables in the database, not just ones that have problems May abort prematurely if damage to the database is of such a nature that parts of the database must be repaired before other parts can be checked If it aborts before it finishes does not necessarily mean that repair is unlikely to succeed Maintaining Exchange Integrity Check (/g) – How? The basic command line syntax to run an integrity check with ESEUtil is: ESEUTIL /G database_filename.edb For example: ESEUTIL /G priv1.edb To use this simple version of the command: There must be disk space equivalent to 20% of combined size of the .EDB and .STM files. The streaming database must be in the same folder as the .EDB file. Maintaining Exchange File Dump– Why? View header information for database, streaming database, checkpoint and transaction log files. View header information for individual database pages. Validate that a series of transaction log files forms a matched set and that all files are undamaged. View space allocation inside the database and streaming database files. View metadata for all tables or for a specific table in the database file. Maintaining Exchange File Dump – How? To view the header of a database, streaming database file or online backup patch file: ESEUTIL /MH {filename.edb | filename.stm | filename.pat} To view the header of a checkpoint file: ESEUTIL /MK filename.chk To view the header of a transaction log file: ESEUTIL /ML filename.log Defrag, Integrity, File Dump Maintaining Exchange ISINTEG Focuses on logical database rather than physical (ESEutil). 2 major modes in ISinteg: Default mode: in which the tool runs the tests you specify and reports its findings. Fix mode: where you specify optional switches instructing ISinteg to run the specified tests and attempt to fix whatever it can. For maintenance work we use DEFAULT mode Maintaining Exchange ISINTEG – How? ISINTEG –S server_name –L logfile_path_and_name –TEST alltests You may also specify individual tests, separating them with commas, For example: -TEST folder,message,Msgref As a general rule, perform all tests in a single ISInteg command. Unless you are addressing a specific, limited problem in the database, running “alltests” is typically the most effective course to follow. Maintaining Exchange ISINTEG – Notes ISINTEG can be run against remote servers by specifying a remote server name. It is more efficient to run it on the server console. The Information Store service must be running in order for ISINTEG to work, but the database to be checked must be dismounted. You cannot run ISINTEG on raw database files or backups You must run it on a server with Exchange installed and the Information Store running. Checklists Q&A Attend a free chat or web cast http://www.microsoft.com/communities/chats/default.mspx http://www.microsoft.com/usa/webcasts/default.asp List of newsgroups http://communities2.microsoft.com/ communities/newsgroups/en-us/default.aspx MS Community Sites http://www.microsoft.com/communities/default.mspx Locate Local User Groups http://www.microsoft.com/communities/usergroups/default.mspx Community sites http://www.microsoft.com/communities/related/default.mspx