Achieving Excellence in Designing and Maintaining SQL Server Transactional Replication Environments How Microsoft IT improved stability and enhanced performance of transactional replication environments using a combination of best practices and design guidance, proactive monitoring, and good troubleshooting documentation. Technical White Paper Published: June 2012 The following content may no longer reflect Microsoft’s current position or infrastructure. This content should be viewed as reference documentation only, to inform IT business decisions within your own company or organization. Azhar Paul Taj CONTENTS Executive Summary ............................................................................................................ 4 Introduction ......................................................................................................................... 5 Situation 5 Solution ................................................................................................................................ 6 Creating Best Practices and Design Guidance for Transactional Replication Environments 6 Building a Transactional Replication Latency Monitoring Tool 6 Developing a Transactional Replication Troubleshooting Guide 9 Best Practices and Design Guidance ................................................................................ 10 1. Implement the Right Reinitialization Process 10 2. Save Replication Scripts 15 3. Know Your Replication Workload 15 4. See More of What Replication Is Doing 16 5. Consider Using a Dedicated Distribution SQL Server Instance 18 6. Break Up Large Transactions into Smaller Ones 18 7. Open Multiple Channels of Communication to the Subscriber 19 8. Evaluate Stored Procedure Execution Replication 19 9. Replication Agent Jobs Standards 20 10. Understand the Not for Replication Property 20 11. Don’t Always Push 20 12. Published Database Log File Considerations 21 13. Size the Distribution Database Appropriately 21 14. Enable Replication Latency Monitoring 21 15. Check Your Hardware 22 Benefits ................................................................................................................................ 23 Conclusion ........................................................................................................................... 24 For More Information .......................................................................................................... 25 Appendix A: Working With and Troubleshooting Transactional Replication Agent Issues ................................................................................................................................... 26 Working with Snapshot Agent Issues 26 Working with Log Reader Agent Issues 26 Working with Distribution Agent Issues 29 Situation The SQL Operations team within Microsoft IT was facing some challenges with SQL Server transactional replication. Replication would break without warning. Many applications had no replication monitoring. The team was having to work in a reactive mode upon replication failures. The team needed to create design guidance for applications using transactional replication so that applications would be developed with replication stability and performance in mind. The team also needed to develop a monitoring solution so that they would receive proactive alerts in case of replication issues. Solution The SQL Operations team defined and built a Microsoft IT standard that encompassed a consistent set of best practices and design guidance to be used for all transactional replication environments at Microsoft. They also built a transactional replication latency monitoring tool to proactively monitor replication latency and alert them before an issue caused business impact. Finally, they developed a comprehensive TSG to help their database administrators consistently and easily resolve known replication issues. Benefits The overall stability and performance of replication systems improved greatly. Less downtime allowed users to be more productive. A consistent process for replication monitoring improved service delivery. Application owners were better able to measure their SLAs. EXECUTIVE SUMMARY Replication is a set of technologies for copying and distributing data and database objects from one database to another, and then synchronizing the data between the databases to maintain consistency. Transactional replication is employed in server-to-server scenarios that generally involve high throughput. Typical usage examples include reporting and data warehousing, improving scalability and availability, integrating data from multiple sites, integrating heterogeneous data, and offloading batch processing. The SQL Operations team within Microsoft IT defined and developed a consistent set of guidelines and best practices for transactional replication environments. These guidelines and best practices have greatly helped in increasing stability and simplifying management while improving performance of replication environments at Microsoft. Furthermore, proactive monitoring and the creation of comprehensive troubleshooting guides has improved the support efficacy for the business applications that the team is responsible for managing. The purpose of this technical white paper is to share Microsoft knowledge, experiences, and best practices related to Microsoft® SQL Server® transactional replication. This paper is not intended to serve as a procedural guide. Each enterprise environment has unique characteristics and circumstances; therefore, each organization should adapt the plans and lessons learned, as described in this paper, to meet its specific needs. This paper assumes that readers are IT pros and technical decision-makers already familiar with SQL Server. Specifically, readers should possess a working knowledge of SQL Server transactional replication. Note: In this paper, the word replication refers to Microsoft SQL Server Transactional Replication unless explicitly stated otherwise. To review some of the basic concepts of SQL Server replication, see: SQL Server Replication at http://msdn.microsoft.com/en-us/library/ms151198.aspx Replication Agents Overview at http://technet.microsoft.com/enus/library/ms152501.aspx Detailed product information is available in the Microsoft SQL Server 2008 TechCenter at http://technet.microsoft.com/en-us/sqlserver/default.aspx. Note: For security reasons, the sample names of forests, domains, internal resources, organizations, and internally developed security file names used in this paper do not represent real resource names used within Microsoft and are for illustration purposes only. Products & Technologies SQL Server 2012 SQL Server 2008 R2 SQL Server 2008 SQL Server 2005 System Center 2012 Operations Manager Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 4 INTRODUCTION SQL Operations, a team within Microsoft IT, is charged with providing Microsoft® SQL Server® support to the entire Microsoft IT organization. The team’s goal is to improve application availability and performance by maintaining standardized SQL Server configurations as well as unified incident and problem management processes. Their core responsibilities include development of SQL Server standards and best practices for Microsoft IT, maintenance of SQL Server health and welfare, and incident response. Situation In late 2009, Microsoft IT was experiencing challenges in its transactional replication environments. Replication agents, which are implemented as SQL Server Agent jobs, were frequently running into errors and performing poorly. There was limited to no replication monitoring for many applications, and support teams had to work in a reactionary mode in response to replication failures. In many cases, replication issues would manifest themselves or be discovered at such an advanced stage that reinitialization would be required for resolution. Reinitialization is a lengthy process during which the subscription database is not available. Unavailability of subscription databases can be costly. For example, if a subscription database is being used for reporting needs, reinitialization will cause those business reports to be down for the entire duration of the reinitialization process. The SQL Operations team knew that they needed to address the challenges they were facing in stabilizing their replication environments from a few different angles. First, they needed well-designed applications from a SQL Server replication configuration standpoint. Second, they needed to be able to support those applications with proactive monitoring, so that when issues began to develop, they would be alerted and be able to resolve them before they progressed to an outage. Finally, detailed documentation was needed to provide all the teams receiving replication alerts the information required to resolve the problems. Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 5 SOLUTION To enhance the stability and performance of replication environments at Microsoft, the SQL Operations team developed a three-pronged approach. The activities included: Creation and publication of design guidance and best practices to help developers design transactional replication for their applications, employing strategies that promote robustness and performance. Development of a monitoring solution that could track replication latency and provide proactive alerts to incident teams before a complete replication system breakdown. Development of a detailed and comprehensive transactional replication troubleshooting guide (TSG) that would help database administrators (DBAs) consistently and conveniently resolve known replication issues. Creating Best Practices and Design Guidance for Transactional Replication Environments Business application developers design applications for Microsoft IT. They may choose to use SQL Server replication for reporting, availability, or other data-duplication business needs. These application servers are then on-boarded with the SQL Operations team, who then monitors the health and performance of those servers. When there are issues, the team engages with application owners to refine their configurations, reducing the possibility of the same problems happening again. These engagements between the SQL Operations team and the application owners have over the course of time helped the team better understand how replication behaves in different scenarios and how things can be designed and configured to mitigate some of the issues that have been seen on a more frequent basis. The SQL Operations team created a consolidated view of the information they accumulated from their experiences in supporting replication environments, the interactions with the application developers, product information that was dispersed across multiple sources, and through consultations with technical support engineers who provide product support to Microsoft customers. This consolidated view of development guidance and best practices that the SQL Operations team developed is represented in detail in the section, “Best Practices and Design Guidance,” later in this document. Building a Transactional Replication Latency Monitoring Tool SQL Server Replication Monitor, a graphical tool that comes with the product, can be used to view transactional replication latency. In addition, transactional replication provides the tracer token feature, which offers a convenient way to measure latency in transactional replication topologies and to validate the connections among the publisher, distributor, and subscriber with the help of system-supplied stored procedures. A token (a small amount of data) is written to the transaction log of the publication database, marked as though it were a typical replicated transaction, and then sent through the system. This process allows a calculation of the amount of time that elapses between: A transaction being committed at the publisher and the corresponding command being inserted in the distribution database at the distributor. A command being inserted in the distribution database and the corresponding transaction being committed at a subscriber. From these calculations, it can be determined: Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 6 Which subscribers are taking the longest time to receive a change from the publisher. Which of the subscribers expected to receive the tracer token have not received it. The SQL Operations team developed a transactional replication latency monitoring tool using Transact-SQL (built-in stored procedures and custom code), for maximum design flexibility. This tool uses self-discovery to determine the number of transactional replication publications and subscriptions in an environment. It allows DBAs to specify a custom replication latency threshold for each transactional replication subscription in an environment. The default latency threshold is two hours (120 minutes). The tool uses tracer tokens to monitor replication latency on production servers. Two types of tracer tokens are used: initial tokens—used for discovering the replication topology once per day—and regular tokens—used to check replication latency at more frequent intervals. By default, regular tokens are inserted in the publication database every 15 minutes. The system checks to see whether the inserted tokens are able to make their way to the subscription databases within the designated latency thresholds. An alert is raised through Microsoft System Center Operations Manager or database mail if a token fails to reach the subscription database within the configured latency threshold, as illustrated in the following sample alert: Error 60201. Transactional replication latency threshold exceeded. Publication Server: Publication_Server_Name. Publication [AdventureWorks_Db]: Pub_AdventureWorks. Subscriber: [Subscription_Server_Name].[AdventureWorks_Db]. Current Overall Latency: 125 min(s). Threshold: 120 min(s). Distributor Latency: "Null" sec(s). Subscriber Latency: "Null" sec(s). The last inserted tracer token has not made it to the subscriber yet. In fact, it has not even made it to the Distribution db. Please check the replication LogReader SqlServerAgent job. This tool is executed via a custom SQL Server Agent job called _SQL_TranReplicationLatencyMonitor, which is created when the SQL Server replication latency monitoring tool installation script is run. This job contains three steps that perform the three major sub-functions of the tool, as described in Table 1. Table 1. Steps of the _SQL_TranReplicationLatencyMonitor Job Step Job step function Frequency and activity 1. Replication Topology Discovery Frequency: Once per day Activity: The following are discovered: 2. Replication Latency Check Publisher name Publication database name(s) Publication name(s) Distribution server name Distribution database name Subscriber name(s) Subscription database name(s) Default replication latency thresholds are assigned for each subscription Frequency: Every 15 minutes Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 7 Step Job step function Frequency and activity Activity: 3. Alerting A check is performed to determine whether overall replication latency is equal to, or greater than, the specified threshold for each subscription. Replication latency is monitored at the subscription level. The system assigns default replication latency thresholds, but these can be customized for each subscription, as needed. Frequency: A check is performed every 15 minutes, but an alert is only raised if the same alert for the same publication– subscription pair has not been raised within the past 12 hours. Activity: If replication latency is over the threshold for any of the subscriptions and a replication alert for the same publication– subscription pair has not been raised within the past 12 hours, a replication latency error will be raised. This error can be seen in the SQL Server Error log as well as in the Application Event Viewer log. If the replication latency check determines that an alert condition exists, an alert will be raised via System Center Operations Manager, database mail, or both. Alert suppression of 12 hours is built into the tool. Customers have the option of receiving email alerts directly from this tool via database mail. This alternate method for receiving alerts can protect against System Center Operations Manager or similar monitoring tool issues. Benefits of Proactive Monitoring Most issues are resolved before they cause business impact. Proactive alerts are raised when subscriptions fall behind enough that their latency thresholds are crossed. In most cases, DBAs are able to resolve the issues causing latency long before latency becomes serious enough to affect the business. Fewer instances of reinitialization. Proactive monitoring reduces the number of replication latency issues that eventually require reinitialization and the extended downtime that might be needed for them. Again, most issues are resolved well before they reach a stage where reinitialization is the only option for resolution. Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 8 Users become more productive. Higher availability because of fewer and faster resolved issues directly translates to users having more time to get their work done. Better service level agreement (SLA) management for application owners. Business application owners can better manage their SLAs with their customers, because they can be immediately informed when latency thresholds are crossed. They can also analyze their historical data as to how many times latency thresholds were crossed within a specific period of time—for example, during a given month or quarter— to help them identify latency patterns or trends. Note: More information about the transactional replication latency monitoring tool, including installation instructions and the Transact-SQL installation scripts, can be downloaded at http://gallery.technet.microsoft.com/scriptcenter/SQL-Server-Transactional-e34ed1e8. Developing a Transactional Replication Troubleshooting Guide TSGs provide repeatable troubleshooting steps and resolutions for known issues that support teams can use. Well-written and comprehensive TSGs can greatly increase the knowledge and comfort level of teams supporting a particular technology. The SQL Operations team documented the various transactional replication issues for the Snapshot, Log Reader, and Distribution Agents that they came across in their environments. After they put together a comprehensive TSG to support the replication environments at Microsoft , the support teams were better able to consistently and confidently resolve issues as they arose. Note: Transactional replication issues that the SQL Operations team came across in Microsoft IT, along with potential solutions for them, can be seen in Appendix A of this document. This by no means represents a comprehensive list of all possible issues that one can encounter in transactional replication; however, the list is a great starting point for a new TSG. So, if your organization needs to create a new transactional replication TSG, you can start with the list from Appendix A; then, as and when you come across issues specific to your environment, add to the list as appropriate. Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 9 BEST PRACTICES AND DESIGN GUIDANCE This section provides details of the transactional replication best practices and design guidance that the SQL Operations team came up with for Microsoft IT applications. Following the applicable guidelines has helped improve the stability and performance of the replication environments within Microsoft IT. These guidelines have also helped improve customer satisfaction and reduce support incidents. 1. Implement the Right Reinitialization Process In transactional replication, reinitialization is the process by which the subscriber is synchronized with the publisher at the start of the process. When the initial data synchronization is successful, just the changes are moved from the publisher to the subscriber. Business applications should have a clear, detailed, thoroughly tested, and welldocumented process for the reinitialization method to be used. Frequently, the reinitialization process can take days or hours for large databases—those with, say, more than 500 GB of data. Several options are available that can directly affect performance: Reinitialization by running the replication Snapshot Agent. Reinitialization by using the Replication Support Only option with sp_addsubscription. Reinitialization by using the Initialize with backup option with sp_addsubscription Often, the process of reinitialization is not given careful thought. It is important for application teams to test and develop a reliable and repeatable reinitialization process that works well and serves the business needs. Many scenarios may warrant reinitialization; here are a few of the common ones: The server operating system, SQL Server, or the business application needs to be upgraded. Replication has run into errors, and the subscriber has fallen too far behind to ever be able to catch up. The subscription has been marked as inactive. Reinitialization by Running the Replication Snapshot Agent This method is generally useful for databases smaller than 300 GB, although there may be times when this is the only reinitialization option—for example, when a subscriber database needs to have only a subset of the tables or data present in the publication database. Using the replication Snapshot Agent, all of the subscriptions or just a single subscription can be reinitialized. To reinitialize all subscriptions for a particular publication, right-click the replication publication, and then click Reinitialize All Subscriptions, as shown in Figure 1. Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 10 Figure 1. Reinitializing all subscriptions However, in many cases, only one subscription may need to be reinitialized. As illustrated in Figure 2, right-click the replication subscription, and then click Reinitialize. Figure 2. Reinitializing a single subscription Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 11 Next, select the appropriate options in the Reinitialize Subscription(s) dialog box, shown in Figure 3. After clicking Mark for Reinitialization, run the replication Snapshot Agent to create the schema files for the published articles, index script files, and other objects. This process can sometimes be difficult for production systems, because shared locks are placed on tables during the initial part of the concurrent snapshot-generation process, and some blocking can occur. Next, all of the data from the published tables is bulk-copied using the Bcp utility and stored in a replication snapshot folder. Finally, the Distribution Agent takes all of the object-creation scripts and the Bcp data files from the snapshot folder and applies them to the subscriber. Figure 3. The Reinitialize Subscription(s) dialog box Depending on the size of the database as well as the quality and speed of the link between the publisher and the subscriber, this method of reinitialization can take a long time—hours or even days. This is another reason why this process must be well tested and properly documented, so that business teams and customers have the right expectations regarding how long the subscriber database might be offline. Note: For more information about reinitialization by running the replication Snapshot Agent and for tips on ensuring that the snapshot process is as efficient as possible, visit http://blogs.msdn.com/b/repltalk/archive/2010/03/07/tips-to-improve-performance-whenapplying-snapshot-in-transactional-replication.aspx Reinitialization by Using the Replication Support Only Option with sp_addsubscription If the subscriber database has already been synchronized with the publication database through a backup, copy, restore process or through log shipping or database mirroring, there may be no need to run the costly Snapshot Agent. In this case, support teams can use the Replication Support Only option for reinitialization. To use this option, when creating the replication subscription with the stored procedure sp_addsubscription, provide the value Replication Support Only for the @sync_type parameter. For this process to be successful, the publisher and the subscriber should be in sync with each other, and no new transactions should be coming into the publisher while the subscription is being set up. After the sp_addsubscription script has run successfully, new Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 12 data can then start coming into the publisher. The Log Reader and Distribution agents will begin to work right away, without the Snapshot Agent ever coming into play. Here is a sample invocation of the sp_addsubscription procedure: exec sp_addsubscription @publication = N'<Publication_Name>', @subscriber = N'<Subscription_Server_Name>', @destination_db = N'<Subscription_Database_Name>', @subscription_type = N'Push', @sync_type = N'Replication Support Only', -- default is 'automatic', which means run the snapshot agent. @article = N'all', @update_mode = N'read only', @subscriber_type = 0 Reinitialization by Using the Initialize with Backup Option with sp_addsubscription This option is useful for large databases, where the subscriber database contains most of the same data as the publication database. If the subscriber has only a subset of the tables or a subset of the data in the tables that are in the publisher, it may be difficult to use this method. One benefit of using this approach is that when replication is set up, new data can still be written into the publisher while setting up the subscription. Compared to the Replication Support Only method of reinitialization, no publication server downtime is required. The database on the subscriber can be restored in no-recovery mode, and then differential or log backups can be restored on top. When all the restores are finished, the database can be recovered. Replication will start from after the last restored log sequence number. To use this method, complete the following steps: 1. Use the backup, copy, and restore process to restore a full database backup on the subscription SQL Server instance. Do not recover the database. 2. Restore differential or transaction log backups to make the subscription database as current as possible, and then recover the database. 3. Run the following sample script on the publisher server to set up the Initialize with Backup subscription: Use <Published_Database_Name> -- Specify the published db name. GO DECLARE @PublicationNameVar NVARCHAR(200) ,@SubscriberVar SYSNAME ,@SubDatabaseVar SYSNAME Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 13 ,@BackupDevicePath NVARCHAR(200) IF @@SERVERNAME = N'<Publication_Server_Name>' Server Name. -- Publication BEGIN SET @PublicationNameVar Publication Name = SET @SubDatabaseVar Subscriber DB SET @SubscriberVar = N'<Publication_Name>' = -- N'<Subscription_Db>' -- N'<Subscription_Server_Name>' -- Subscriber server SET @BackupDevicePath = N'<E:\MSSQL\BAK\.....>' Path of the last restored differential or log backup file. -- END exec sp_addsubscription @publication = @PublicationNameVar, @subscriber = @SubscriberVar, @destination_db = @SubDatabaseVar, @sync_type = N'Initialize with Backup', @backupdevicetype = N'disk', @backupdevicename = @BackupDevicePath, @subscription_type = N'Push', @update_mode = N'Read Only' GO Post-reinitialization Considerations Regardless of the method of reinitialization used for an environment, there might still be additional considerations: Depending on how the subscription database is used, it may require a different set of indexes. If this is the case, the appropriate index-creation script should be run in the subscription database; otherwise, potential performance issues may occur while using the subscription database. Any triggers, foreign key constraints, or check constraints present in the tables in the subscription database can slow down replication. If these objects are already present in the publication database, check to see whether the presence of these objects in the subscription database will add any business value. If the answer is no, then either define Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 14 these objects with the Not for Replication property in the publication database or simply disable them on the subscriber. 2. Save Replication Scripts All production applications should have scripts available to set up replication and facilitate the process to repair replication if it is broken as well as for use during upgrades, should replication need to be set up again. Replication scripts should be stored in source code storage and the collaboration environment for the benefit of development and operation teams. Scripts should be developed such that individual replication components can be dropped and recreated separately. For example, if there are two publications and three subscriptions, the replication script should clearly call out which part of the script creates the first publication, which creates the first subscription and so forth. The idea is that if any individual part of replication (for example, a particular publication) needs to be dropped and re-created, the fastest, easiest, and most convenient way of doing so should be available. 3. Know Your Replication Workload Latency issues can often be seen when a large volume of transactions or commands is in the process of going from the publisher to the subscriber. The distribution database can be queried to see the volume of data being replicated. It is good to understand and analyze this type of data. Knowledge about replication usage peaks and valleys can help provide workload insights that may in turn help identify trends and potential opportunities for improving replication performance. Here are some examples: Replication falls behind whenever a certain number of commands come in per hour that need to be replicated. Replication falls behind on the weekends during the weekly index rebuild or when other maintenance jobs are running. It might be possible to improve performance by using a dedicated distribution server or by refining Replication Agent parameters, as discussed later in this section. Certain maintenance jobs like backups may perform better when the replication load is light. Note: For more information about understanding replication workloads and solutions to address periods of high latency related to high volumes of data, visit http://blogs.msdn.com/b/repltalk/archive/2010/10/20/determine-transactional-replicationworkload-to-help-resolve-data-latency.aspx. The following is an example of a distribution database script that can be used to see the volume of pending replication commands broken up by hour. Note that data in the distribution database does not persist forever, because it is purged periodically by the Distribution Clean up job. In Microsoft IT, while studying a particular replication environment, typically a permanent table is created, and then the output of this script is saved into that table on a periodic basis. This way, the team has data encompassing a longer period of time for analysis purposes. /* Display the # of pending replication commands, broken up by the hour. */ Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 15 if exists (select name from Tempdb.sys.objects where name like '#Results%') begin Drop table #Results end select t.publisher_database_id, t.xact_seqno, max(t.entry_time) as EntryTime, count(c.xact_seqno) as CommandCount into #Results FROM MSrepl_commands c with (nolock) LEFT JOIN msrepl_transactions t with (nolock) on t.publisher_database_id = c.publisher_database_id and t.xact_seqno = c.xact_seqno GROUP BY t.publisher_database_id, t.xact_seqno SELECT MPD.Publisher_Db ,datepart(year, R.EntryTime) as Year ,datepart(month, R.EntryTime) as Month ,datepart(day, R.EntryTime) as Day ,datepart(hh, R.EntryTime) as Hour ,sum(R.CommandCount) as CommandCountPerTimeUnit FROM #Results R inner join MSpublisher_databases MPD on R.publisher_database_id = MPD.Id GROUP BY MPD.Publisher_Db ,datepart(year, R.EntryTime) ,datepart(month, R.EntryTime) ,datepart(day, R.EntryTime) ,datepart(hh, R.EntryTime) ORDER BY MPD.publisher_db, 3, 4, 5 4. See More of What Replication Is Doing For more visibility into the activity and possible errors stemming from the Log Reader and Distribution agents, run them with the –HistoryVerboseLevel parameter with a value of 2. The default is 1. Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 16 Note: For more information about using the verbose history agent profile, visit http://blogs.msdn.com/b/repltalk/archive/2010/07/13/using-verbose-history-agent-profilewhile-troubleshooting-replication.aspx. Figure 4 and Figure 5 illustrate the difference in detail of Distribution Agent activity between – HistoryVerboseLevel 1 (the default) and –HistoryVerboseLevel 2. Note how much more detail and activity history is available with –HistoryVerboseLevel 2. In addition to greater detail, every 5 minutes, new Distribution Agent read and write performance statistics are made available through this interface. These statistics can be instrumental in identifying whether the distribution reader or writer thread is experiencing a performance problem. It is recommended that –HistoryVerboseLevel 2 be used while studying a replication environment. Then, when the study is complete, revert to the default value of 1 to reduce the overhead. One valuable insight from this output is the possible identification of large transactions with a great number of commands in them. If this is the case, then some optimization options may be available (e.g., the –MaxCmdsInTran parameter, discussed later on in this section). Figure 4. –HistoryVerboseLevel 1 Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 17 Figure 5. –HistoryVerboseLevel 2 5. Consider Using a Dedicated Distribution SQL Server Instance For systems with a heavy load on the publisher, a dedicated distribution server can be used to reduce some of the replication processing overhead, thereby allowing the publisher to perform better. If the publisher is also used as the distribution server, the Snapshot and Log Reader agents will run on the publisher. In this scenario, using push subscriptions, the Distribution Agent also runs on the publisher; while in the case of pull subscriptions, the Distribution Agent runs on the subscriber. With a dedicated distribution SQL Server instance, all replication agents that would typically have run on the publisher run on the distributor. If the number of commands needing to be replicated per hour is 100,000 or more, an organization might be said to have a busy replication environment and should consider using a separate distribution SQL Server instance. The script introduced in the section, “Know Your Replication Workload,” can be used to see the number of pending replication commands broken up by hour. Support teams should perform thorough testing to verify that adding a dedicated distribution SQL Server instance will help improve performance. 6. Break Up Large Transactions into Smaller Ones For applications experiencing performance challenges resulting from large transactions, breaking down the transactions into smaller ones can help improve performance. Microsoft recommends that as a first preference, the application development team modify the business application in question to use smaller transactions. Because it is not always possible to go back and modify an existing production application, the next best option is to break up large transactions within transactional replication by using the –MaxCmdsInTran parameter for the Log Reader Agent. This parameter has a downstream effect and helps the Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 18 Distribution Agent stay current, because now the Distribution Agent does not have to wait until the Log Reader Agent is finished writing huge transactions into the distribution database. Instead, the Distribution Agent can start distributing commands as soon as the smaller transaction (created within replication with the help of the –MaxCmdsInTran parameter) is written to the distribution database. This way, the Distribution Agent has to wait less and can process more. In Microsoft IT’s replication environment, there have been applications where the number of commands per transaction would sometimes be 100,000 or more, sometimes even more than 500,000. In several of those cases, –MaxCmdsInTran was used to specify smaller transaction sizes between 2,000 and 5,000, and performance benefits were observed. Note: For more information about enhancing transactional replication performance settings, visit http://msdn.microsoft.com/en-us/library/ms151762.aspx 7. Open Multiple Channels of Communication to the Subscriber In some cases, the Distribution Agent may struggle to keep up with the volume of data that it needs to distribute. To help speed up the Distribution Agent, use multiple parallel communication channels between the distributor and the subscriber by using the – SubscriptionStreams parameter. In many cases, doing so can greatly improve the performance of the Distribution Agent. A good number to start with is the number of processors on the subscription SQL Server instance. For example, if the number of processors on the subscriber is eight, start by trying –SubscriptionStreams 8. Try this setting as a baseline, then tune the value of –SubscriptionStreams up or down to see where the best possible performance occurs. Note: For more information about navigating the SQL Server replication SubscriptionStreams setting, visit http://blogs.msdn.com/b/repltalk/archive/2010/03/01/navigating-sql-replicationsubscriptionstreams-setting.aspx. 8. Evaluate Stored Procedure Execution Replication Typically, whatever data-modification statements (inserts, updates, and deletes) are run on the publisher, the same are also run on the subscriber, one by one. All the statements have to be sent over the network and applied on the subscriber. In some cases, it might be more efficient to make those modifications on the publisher with the help of stored procedures, and then publish the stored procedure executions so that the same stored procedures are also executed on the subscriber. This practice can greatly reduce network traffic. However, stored procedure replication may not be appropriate for all applications. If an article is filtered horizontally, so that there are different sets of rows at the publisher than at the subscriber, executing the same stored procedure on both sides will return different results. Similarly, if an update is based on a subquery of another, non-replicated table, executing the same stored procedure at both the publisher and the subscriber will return different results. SQL Server Books Online has a good discussion of some of the available choices while using stored procedure replication. Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 19 Note: For more information about publishing stored procedure execution in transactional replication, visit http://msdn.microsoft.com/en-us/library/ms152754.aspx and http://msdn.microsoft.com/en-us/library/ms151168.aspx. 9. Replication Agent Jobs Standards Microsoft recommends running the Log Reader and Distribution agents on a continuous basis rather than on a schedule, so that the subscriptions are as current as possible. By default, when replication agents are configured to run continuously, the Log Reader and Distribution agents are configured to retry 2,147,483,647 times upon failure. This many retries, with the default retry interval of once per minute, would take 4,085 years before the agent would finally declare failure. It might be better to configure those retry attempts for a more reasonable number—say, 60 or so, or whatever makes sense for the business. With retries set to 60, Log Reader and Distribution agents would fail in 1 hour or more (depending on how long each retry takes). The idea is that the agent should fail within a reasonable amount of time and be able to generate an agent failure alert. In Microsoft IT’s replication environment, the number of retries is not modified, because their sophisticated transactional replication latency monitoring tool generates alerts when subscriber databases fall behind publisher databases by more than the specified time threshold. Note: For more information, visit http://blogs.msdn.com/b/repltalk/archive/2010/08/25/sqlreplication-agent-will-retry-for-4085-years.aspx. 10. Understand the Not for Replication Property The presence of unnecessary triggers, foreign key, and check constraints on subscription database tables can cause a distribution agent bottleneck. As an example, if a business rule implemented via a constraint has been validated at the publisher, it might not be useful to perform the same validation at the subscriber. Such objects should be marked as Not for Replication or, if this option is not available, disabled at the subscriber. This option is also available for identity columns. Note: For more information about the Not for Replication property, visit http://blogs.msdn.com/b/repltalk/archive/2010/02/22/all-about-not-for-replication.aspx, and “Controlling Constraints, Identities and Triggers with “Not for Replication” at .http://msdn.microsoft.com/en-us/library/ms152529(v=SQL.105).aspx. 11. Don’t Always Push The Distribution Agent runs at the distributor for push subscriptions and at the subscriber for pull subscriptions. Evaluate push versus pull subscriptions to ensure that it makes the most sense for a specific application. In a local data center scenario, push subscriptions are preferred because of their manageability. When it is important to offload work from the distributor, for example, there are many subscribers or lower technical specifications on the distribution server, pull subscriptions are preferred. In scenarios that involve geo-replication across a WAN, pull subscriptions have proved to be more effective than push subscriptions, as detailed in the MSDN case study, Replication Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 20 Performance Gains with Microsoft SQL Server 2008 running on Windows 2008” at http://msdn.microsoft.com/en-us/library/dd263442(SQL.100).aspx. 12. Published Database Log File Considerations As with any database, there should be only one log file for the published database. The number of virtual log files (VLFs) should be as low as possible. To determine the number of VLFs, run Dbcc LogInfo(Published_Db_Name) on the publisher SQL Server instance. The number of rows returned is the number of VLFs the published database has. The greater the number of VLFs, the slower the Log Reader Agent performance, because it will take more time for the agent to scan through all the VLFs looking for transactions and commands marked for replication that need to be moved to the distribution database. In general, for a published database, having more than 10,000 VLFs can be detrimental for Log Reader Agent performance. Such an excessive number typically occurs when the log file autogrow value is configured for a small number. Depending on factors like server load and hardware, applications can experience Log Reader Agent slowness even with far fewer VLFs. An industry expert has even recommended keeping the number of VLFs to less than 50 for optimization purposes. This may be difficult to achieve, but the idea is to aim for keeping this number low. To resolve problems caused by a high number of VLFs, first ensure that there is only one log file. In the case of multiple log files, remove all but the first transaction log file. Next, shrink that one log file (to reduce the number of VLFs), and then specify a reasonable initial size for it. Specify a reasonable autogrow value for this log file to avoid a high VLF count again, because of repeated file growth, via autogrow. 13. Size the Distribution Database Appropriately Make sure the distribution database is sized properly and that the autogrow settings make sense for a production database. All too often, these considerations are not part of the design process. Every now and then, a production distribution database is observed where the lone data file is configured to autogrow in 1 MB increments (default). If the distribution database has to constantly autogrow to make room for replicated data coming in, then this may cause replication slowness. Just like the data file, autogrow for the distribution database log file should be changed from the default of 10 percent to a more reasonable, non-percentage number, which makes sense for the application. Note: For more information about configuring distribution, visit http://technet.microsoft.com/en-us/library/ms151860.aspx. 14. Enable Replication Latency Monitoring IT teams should have a mechanism to monitor replication latency and provide alerts when replication exceeds specified latency thresholds. As discussed earlier, the SQL Operations team at Microsoft created a custom tool that employs tracer tokens to help them monitor replication latency. Note: For more information about monitoring the health of SQL Server replication, visit http://blogs.msdn.com/b/repltalk/archive/2010/09/20/how-to-monitor-the-health-of-sql-serverreplication.aspx. Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 21 15. Check Your Hardware What replication does is not magic. It takes a finite amount of time and hardware resources to move the data from the publisher to the subscriber. Even the best SQL Server configuration setting refinements may not be enough to improve performance if the hardware on which the application is running is not tuned for the workload. Components such as I\O subsystem performance, memory, and network adapter drivers are all important factors to consider. For example, if there are many messages like the following in the SQL Server Error log for the drives on which the published, distribution, or subscription databases reside, they could potentially be causing bottlenecks for the Log Reader and Distribution agents: SQL Server has encountered 16 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [O:\ MSSQL\DATA\ProdDb.ldf] in database [ProdDb]. In the Performance Monitor, check the Avg Disk Sec\Read and Avg. Disk Sec\Write counters for the hard disks on which the alerts occur. If the times are consistently greater than 15– 20 ms, there may be an I\O bottleneck. Likewise, for the active network adapters, the Network Interface\Output Queue Length counter should consistently stay under 2. Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 22 BENEFITS Providing consistent processes and documentation for transactional replication design, monitoring, and troubleshooting has been helpful for the SQL Operations team and the applications they support: Greater awareness of design best practices has resulted in more robust and betterperforming applications. Proactive monitoring has resulted in less downtime for line-of-business applications, better SLA management for application owners, and higher productivity for users: Downtime has been reduced, because most issues are resolved well before they cause any business impact. Consequently, instances of issues where reinitialization would have been required have gone down significantly. Use of the replication latency monitoring tool has provided application owners the ability to specify custom replication latency thresholds, providing them alert notifications when the subscription databases fall behind publication databases. This has brought additional business value for application owners by allowing them to better measure against their SLAs. For example, if an applications owner’s SLA with his or her internal customers is that the data on their reporting site (from the subscription database) should never be more than 2 hours old, then having no replication latency alerts generated in a given time frame can help the owner validate that he or she was able to keep the SLA. A side benefit of overall reduced downtime is that now users have more time to be productive and actually perform the work for which these business applications have been designed. Consolidated troubleshooting documentation for known issues has resulted in faster resolution of issues and greater confidence for support professionals working with this technology. By using the three-pronged approach described in this paper, the SQL Operations team has seen about a 61 percent decrease in the number of transactional replication-related tickets since embarking on this approach. (The actual percentage reduction in this case is much more than the stated number, because the number of servers SQL Operations is monitoring has roughly doubled between 2010 and 2012, and the increase in the number of servers was not taken into account in this calculation.) Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 23 CONCLUSION The ongoing engagements between the SQL Operations team and application owners have over the course of time improved the overall understanding of how transactional replication works and how things can be designed and configured to mitigate some of the issues that were previously being seen. Best practices and design guidance, proactive monitoring, and good troubleshooting documentation form the three corners of an equilateral triangle, each corner of which is critical (see Figure 6). Figure 6. Cornerstones of a stable transactional replication environment IT support teams that can incorporate these three ideas into their operations and support methodology will be able to reap similar benefits to what SQL Operations saw in their environment. With sound implementation of those ideas, any support team can be well on its way toward achieving excellence in designing and maintaining their transactional replication environments. Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 24 FOR MORE INFORMATION For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada Order Centre at (800) 933-4750. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information via the World Wide Web, go to: http://www.microsoft.com http://www.microsoft.com/technet/itshowcase The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication. This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED, OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation. Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property. Unless otherwise noted, the example companies, organizations, products, domain names, e-mail addresses, logos, people, places, and events depicted herein are fictitious, and no association with any real company, organization, product, domain name, e-mail address, logo, person, place, or event is intended or should be inferred. © 2012 Microsoft Corporation. All rights reserved. Microsoft and SQL Server are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. All other trademarks are property of their respective owners. Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 25 APPENDIX A: WORKING WITH AND TROUBLESHOOTING TRANSACTIONAL REPLICATION AGENT ISSUES Transactional replication issues that the SQL Operations team came across at Microsoft, along with potential solutions for them, are described in this appendix. This by no means represents a comprehensive list of all possible issues that can be encountered during transactional replication; however, the list is a great starting point for a new TSG. So, if an organization needs to create a new transactional replication TSG, it can start with the list in this appendix and add additional scenarios specific to the organization’s environment. Working with Snapshot Agent Issues The Snapshot Agent job runs as an executable (snapshot.exe). To see the syntax and various parameters that can be specified with this job, visit http://msdn.microsoft.com/enus/library/ms146939.aspx. Table 2 shows some of the Snapshot Agent issues the SQL Operations team saw in Microsoft environments. Table 2. Snapshot Agent Issues and Their Solutions Snapshot Agent scenario Possible solution or further reading Snapshot Agent seems to hang. Try running the Snapshot Agent from a Command Prompt window to troubleshoot the issue. For more details, see http://blogs.msdn.com/b/repltalk/archive/2010/03/17/troubleshootingsnapshot-agent-hang.aspx. The snapshot process is slow and needs to be optimized further. To enhance the performance of the snapshot process, look at the ideas presented at http://blogs.msdn.com/b/repltalk/archive/2010/03/07/tips-toimprove-performance-when-applying-snapshot-in-transactionalreplication.aspx. Working with Log Reader Agent Issues 1. Run the command Exec sp_replcounters on the publisher against the published database. The results set will indicate whether the Log Reader Agent is lagging behind and if so, by how much. This is what the following columns mean in the output of the above mentioned stored procedure: Replicated Transactions. The number of transactions in the log awaiting delivery to the distribution database Replication Rate tran/sec. The average number of transactions per second delivered to the distribution database Replication Latency (sec). The average time in seconds that transactions were in the log before being distributed (Convert this number into hours or minutes, as appropriate.) This data can be shared with the application support team. To see how far behind the Log Reader Agent is by way of pending commands marked for replication, run the following query in the transaction log of the published database: select count(*) from fn_dblog(null, null) where Description = 'Replicate' Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 26 2. Run the command Dbcc LogInfo(Published_Db_Name) on the publisher SQL Server instance. The number of rows returned is the number of VLFs the published database has. For more information on this, review the section, “Published Database Log File Considerations,” earlier in this paper. Table 3 provides a summary of scenarios and issues that IT teams might encounter while working with the Log Reader Agent. The Log Reader Agent job runs as an executable (LogRead.exe). To see the syntax and various parameters that can be specified with file, see http://msdn.microsoft.com/en-us/library/ms146878.aspx. Table 3. Log Reader Agent Issues and Potential Solutions Log Reader Agent scenario Resolution and more information The Log Reader Agent is timing out. Add the –QueryTimeOut parameter to the Run Agent of the failing Log Reader SqlServerAgent job. The default value of this parameter is 1,800 seconds (30 minutes). Try putting in a higher number—for example, 0 specifies unlimited time. When the Log Reader Agent has caught up, remove this parameter. The Log Reader Agent is slow on a SQL Server instances where the published database is also being mirrored. The paper, “SQL Server Replication: Providing High Availability using Database Mirroring”, describes how the Log Reader Agent behaves in this situation. Log Reader Agent error: The process could not execute 'sp_replcmds' on servername. See the Microsoft Support article, at http://support.microsoft.com/kb/811030 The Log Reader Agent is slow, because the underlying disk subsystem is slow. Check the Logical Disk: Avg. Disk sec/Read and Logical Disk: Avg. Disk sec/write Performance Monitor counters for the disks on which the published database and especially its log files reside. Ideally, those numbers should be 15 ms or less. If there are several messages like the one below in the SQL Server Error log for the log file drives in question, it may indicate a disk bottleneck: http://download.microsoft.com/download/d/9/4/d948f981-926e40fa-a026-5bfcf076d9b9/ReplicationAndDBM.docx SQL Server has encountered 16 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [O:\ MSSQL\DATA\ProdDb.ldf] in database [ProdDb]. Receiving this message occasionally does not necessarily indicate that there is an issue. However, receiving many such messages at regular intervals in the SQL Server Error log can indicate a disk bottleneck. Also, look into the possibility of publication and distribution databases sharing the same drives and experiencing contention. The Log Reader Agent is slow or is failing with the generic error message, “Unable to execute sp_replcmds.” Look at the parameter settings for the Log Reader Agent provided in the MSDN article, “Enhance Transactional Replication Performance”, to see if tuning one or more of them makes sense: http://msdn.microsoft.com/en-us/library/ms151762.aspx –MaxCmdsInTran. This parameter specifies the maximum number of statements grouped into a transaction as the Log Reader Agent writes commands to Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 27 Log Reader Agent scenario Resolution and more information the distribution database. The default is 0 (unlimited). Specifying a smaller size can be useful in the case of large transactions. –ReadBatchSize. This parameter specifies the maximum number of transactions read out of the transaction log of the publishing database per processing cycle. The default is 500. When a large number of transactions is written to a publication database but only a small subset of those are marked for replication, use this parameter to increase the read batch size of the Log Reader Agent. For purposes of troubleshooting, however, specify a low number, even 1. –PollingInterval. This parameter specifies how often, in seconds, the log is queried for replicated transactions. The default is 5 seconds. Decrease this value to poll the log more frequently. Note that doing so can result in lower latency for the delivery of transactions from the publication database to the distribution database. Note: While making a change, measure and record the before and after values of the following Performance Monitor counters to determine whether the change helped: SQLServer: Replication LogReader: Delivered Cmds/sec SQLServer: Replication LogReader: Delivered Trans/sec SQLServer: Replication LogReader: Delivery Latency Log Reader error: Repldone log scan occurs before the current start of replication. See Chris Skorlinski’s blog, “Troubleshooting LogReader Error repldone log scan occurs before the current start of replication” at http://blogs.msdn.com/b/repltalk/archive/2010/04/11/troublesho oting-logreader-error-repldone-log-scan-occurs-before-thecurrent-start-of-replication.aspx. Log Reader error: The process could not execute sp_repldone/sp_replcounters. See Chris Skorlinski’s blog at http://blogs.msdn.com/b/repltalk/archive/2010/02/19/theprocess-could-not-execute-sp-repldone-sp-replcounters.aspx. Possible slow Log Reader Agent performance, with a high number of VLFs for the transaction log of the published database. Review the section, “Published Database Log File Considerations” discussed earlier in this paper. Get more information about the failing Log Reader Agent. Generate an output file by specifying the –Output and – OutputVerboseLevel parameters in the Run Agent of the Log Reader Agent job. Make sure that the path of the output file specified with –Output is valid. Use values of 3 or 4 with – OutputVerboseLevel. (The higher the number, the more output is generated; the default is 2.) Next, analyze the output file generated. See the Microsoft Support article, http://support.microsoft.com/kb/949523. When troubleshooting is finished, remove these parameters. Otherwise, they could fill up the drive on which the specified log file resides. Get further detailed troubleshooting. The Chris Skorlinski blog talks about transactional replication conversations. http://blogs.msdn.com/b/repltalk/archive/2010/02/07/repltalkstart-here.aspx. The first part of this article talks about the Log Reader Agent Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 28 Log Reader Agent scenario Resolution and more information and reader and writer threads. Working with Distribution Agent Issues The Distribution Agent job runs as an executable (Distrib.exe). To see the syntax and various parameters that can be specified with this file, visit http://msdn.microsoft.com/enus/library/ms147328.aspx. To see how far behind the Distribution Agent is in terms of the number of commands in the distribution database that need to be applied to the subscription database, use the Replication Monitor shown in Figure 7. Figure 7. Using the Replication Monitor Table 4 provides a summary of scenarios and issues that IT teams might encounter while working with the Distribution Agent. Table 4. Distribution Agent Issues and Potential Solutions Distribution Agent scenario Resolution or more information The Distribution Agent is timing out. Add the –QueryTimeOut parameter to the Run Agent of the failing Distribution SqlServerAgent job. The default value of this parameter is 1,800 seconds (30 minutes). Try putting in a higher number; 0 specifies unlimited time. When the Distribution Agent has caught up, remove this parameter. The row was not found at the subscriber when applying the More information about this error can be found at http://www.microsoft.com/products/ee/transform.aspx?ProdName= Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 29 Distribution Agent scenario Resolution or more information replicated command. (Source: MSSQLServer, Error number: 20598) Microsoft+SQL+Server&ProdVer=09.00&EvtSrc=MSSQLServer&Ev tID=20598. It may be possible to use the following steps to remove discrepancies between the problem table on the publisher and subscriber, in turn resolving the replication problem: 1. Use the following statement to find the replication error captured in system table distribution.dbo.MSrepl_errors: select top 300 * from distribution.dbo.MSrepl_errors (nolock) where time > getdate() - .05 order by time desc 2. Run sp_browsereplcommands on the distribution SQL Server instance in the Distribution database to see the table and command on which the replication failure is occurring. Use xact_seqno from the previous step for both @xact_start and @xact_seqno_end: Exec Distribution.dbo.sp_browsereplcmds @xact_seqno_start = '0x0000001800000533000400000000', @xact_seqno_end = '0x0000001800000533000400000000' --, @publisher_database_id = 10, @command_id = 1 3. Use the tablediff.exe utility typically found at C:\Program Files\Microsoft SQL Server\100\COM to find the discrepancies between this table on the publisher and the subscriber. Use the command column from the output of sp_browsereplcmds form the previous step to determine the name of the table with discrepancies. Optionally, use the –f switch to generate a script to insert the missing rows on the subscriber. Here is a sample invocation of this tool: tablediff -SourceServer AzharTaj1 SourceDatabase ProdDb -SourceTable People -SourceSchema dbo DestinationServer AzharTaj2 DestinationDatabase ProdDb DestinationTable People DestinationSchema dbo -c -o C:\Temp\Repl_Fix_Script.txt –f Look for the replication fix script generated in the output file provided above. Run the script in the subscription database in question to resolve the problem. Violation of the PRIMARY KEY constraint 'PK_TableName.’ Cannot insert duplicate key in object 'dbo.TableName.’ (Source: MSSQLServer, Error More information about this error can be found at http://www.microsoft.com/products/ee/transform.aspx?ProdName= Microsoft+SQL+Server&ProdVer=09.00&EvtSrc=MSSQLServer&Ev tID=2627. It may be possible to use the following steps to manually delete Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 30 Distribution Agent scenario Resolution or more information number: 2627) specific rows on the subscriber (that might have accidentally been inserted there and by virtue of their presence are causing primary key violations): Note: This can happen in transactional replication when a row already exists on the subscriber, perhaps in a scenario where someone accidentally inserted data on the subscriber. 1. Use the following statement to find the replication error captured in system table distribution.dbo.MSrepl_errors: select top 300 * from distribution.dbo.MSrepl_errors (nolock) where time > getdate() - .05 order by time desc 2. Run sp_browsereplcommands on the distribution SQL Server instance in the distribution database to see the table and command on which the replication failure is occurring. Use xact_seqno from the previous step for both @xact_start and @xact_seqno_end: Exec Distribution.dbo.sp_browsereplcmds @xact_seqno_start = '0x0000001800000533000400000000', @xact_seqno_end = '0x0000001800000533000400000000' --, @publisher_database_id = 10, @command_id = 1 3. The Distribution Agent is slow or is “erroring out.” Use the command column from the output of sp_browsereplcmds form the previous step to determine the name of the table in which the insert is occurring. From the column values of the insert statement, determine whether that record already exists on the subscriber. If it does, delete just that row. Doing so should resolve the replication issue. Look at the parameters for the Distribution Agent at http://msdn.microsoft.com/en-us/library/ms151762.aspx to determine whether one or more of them make sense in this scenario: SubscriptionStreams [0|1|2|...64]. This is the number of connections allowed per Distribution Agent to apply batches of changes in parallel to a subscriber while maintaining many of the transactional characteristics present when using a single thread. The default is 1 for a transactional SQL Server subscription. In case of a high incoming transaction rate from the publisher, this parameter may help. A good place to start is to set this value equal to the number of processors on the subscriber. For more information, see the “Navigating SQL Replication Subscription Streams setting blog”, at http://blogs.msdn.com/b/repltalk/archive/2010/03/01/navigating -sql-replication-subscriptionstreams-setting.aspx. CommitBatchSize. This is the number of transactions to be issued to the subscriber before a COMMIT statement is issued. The default is 100. Committing a set of transactions has a fixed overhead; by committing a larger number of transactions less frequently, the overhead is spread across a larger volume of data. PacketSize. This is the packet size in bytes. The default is Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 31 Distribution Agent scenario Resolution or more information 4,096 (bytes). Visit the “Tune Replication Performance using Packet Size” blog at http://blogs.msdn.com/b/repltalk/archive/2010/03/11/tunereplication-performance-using-packetsize.aspx, for the possibilities of tuning this value. SkipErrors native_error_id [: ...n]. This colon-separated list specifies the error numbers for the agent to skip. PollingInterval. This is how often, in seconds, the distribution database is queried for replicated transactions. The default is 5 seconds. Decrease this value to poll the distribution database more frequently. While making a change, measure and record the before and after values of the following Performance Monitor counters to determine whether the change helped: The Distribution Agent is slow, because the underlying disk subsystem is slow. SQLServer: Replication Dist: Delivered Cmds/sec SQLServer: Replication Dist: Delivered Trans/sec SQLServer: Replication Dist: Delivery Latency Check the Logical Disk: Avg. Disk sec/Read and Logical Disk: Avg. Disk sec/write Performance Monitor counters for the disks on which the published database and especially its log files reside. Ideally, those numbers should be 15 ms or less. Several messages like the one below in the SQL Server error log for the log file drives in question may indicate a disk bottleneck: SQL Server has encountered 16 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [H:\ MSSQL\DATA\ProdDb.ldf] in database [ProdDb]. Receiving this message occasionally does not necessarily indicate that there is an issue. However, many such messages received at regular intervals in the SQL Server error log can indicate a disk bottleneck. Also, look into the possibility of publication and distribution databases sharing the same drives and experiencing contention. Distribution Agent error: The process could not connect to Subscriber ‘SubscriberServerName.’ See Chris Skorlinski’s blog “Distribution Agent fails with: Error Locating Server/Instance Specified [xFFFFFFFF]” at http://blogs.msdn.com/b/repltalk/archive/2010/04/26/distributionagent-fails-with-error-locating-server-instance-specifiedxffffffff.aspx. Note: This error can also occur if there have been changes on the cluster and the SPN is not registered correctly. Distribution Agent error: Msg 0, Level 20, State 0, Line 0 A severe error occurred on the current command. The results, if any, should be discarded. Get more information about the failing Distribution Agent. See Chris Skorlinski’s blog, “Distribution Agent Fails with error Msg 0, Level 20, State 0, Line 0”, at http://blogs.msdn.com/b/repltalk/archive/2010/04/05/distributionagent-fails-with-error-msg-0-level-20-state-0-line-0.aspx. Generate an output file by specifying the –Output and – OutputVerboseLevel parameters in the Run Agent of the Distribution Agent job of concern. Make sure that the path of the output file specified with –Output is valid. A value of 2 is the highest value that can be specified for –OutputVerboseLevel in the case of the Distribution Agent and is also the default. Next, analyze the output file generated. Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 32 Distribution Agent scenario Resolution or more information When troubleshooting is complete, remove these parameters. Otherwise, the drive on which the specified log file resides might fill up. Get further detailed troubleshooting. Chris Skorlinski’s blog at http://blogs.msdn.com/b/repltalk/archive/2010/02/07/repltalk-starthere.aspx, talks about transactional replication conversations. The latter half of this article talks about the Distribution Agent reader and writer threads. Achieving Excellence in Designing and Maintaining Transactional Replication Environments Page 33