Achieving Excellence in
Designing and Maintaining
SQL Server Transactional
Replication Environments
How Microsoft IT improved stability and enhanced
performance of transactional replication
environments using a combination of best practices
and design guidance, proactive monitoring, and
good troubleshooting documentation.
Technical White Paper
Published: June 2012
The following content may no longer reflect Microsoft’s current position or infrastructure. This
content should be viewed as reference documentation only, to inform IT business decisions
within your own company or organization.
Azhar Paul Taj
CONTENTS
Executive Summary ............................................................................................................ 4
Introduction ......................................................................................................................... 5
Situation
5
Solution ................................................................................................................................ 6
Creating Best Practices and Design Guidance for Transactional Replication
Environments
6
Building a Transactional Replication Latency Monitoring Tool
6
Developing a Transactional Replication Troubleshooting Guide
9
Best Practices and Design Guidance ................................................................................ 10
1. Implement the Right Reinitialization Process
10
2. Save Replication Scripts
15
3. Know Your Replication Workload
15
4. See More of What Replication Is Doing
16
5. Consider Using a Dedicated Distribution SQL Server Instance
18
6. Break Up Large Transactions into Smaller Ones
18
7. Open Multiple Channels of Communication to the Subscriber
19
8. Evaluate Stored Procedure Execution Replication
19
9. Replication Agent Jobs Standards
20
10. Understand the Not for Replication Property
20
11. Don’t Always Push
20
12. Published Database Log File Considerations
21
13. Size the Distribution Database Appropriately
21
14. Enable Replication Latency Monitoring
21
15. Check Your Hardware
22
Benefits ................................................................................................................................ 23
Conclusion ........................................................................................................................... 24
For More Information .......................................................................................................... 25
Appendix A: Working With and Troubleshooting Transactional Replication Agent
Issues ................................................................................................................................... 26
Working with Snapshot Agent Issues
26
Working with Log Reader Agent Issues
26
Working with Distribution Agent Issues
29
Situation
The SQL Operations team within
Microsoft IT was facing some
challenges with SQL Server
transactional replication. Replication
would break without warning. Many
applications had no replication
monitoring. The team was having to
work in a reactive mode upon
replication failures. The team needed
to create design guidance for
applications using transactional
replication so that applications would
be developed with replication stability
and performance in mind. The team
also needed to develop a monitoring
solution so that they would receive
proactive alerts in case of replication
issues.
Solution
The SQL Operations team defined
and built a Microsoft IT standard that
encompassed a consistent set of best
practices and design guidance to be
used for all transactional replication
environments at Microsoft. They also
built a transactional replication latency
monitoring tool to proactively monitor
replication latency and alert them
before an issue caused business
impact. Finally, they developed a
comprehensive TSG to help their
database administrators consistently
and easily resolve known replication
issues.
Benefits
 The overall stability and
performance of replication systems
improved greatly.
 Less downtime allowed users to be
more productive.
 A consistent process for replication
monitoring improved service
delivery.
 Application owners were better
able to measure their SLAs.
EXECUTIVE SUMMARY
Replication is a set of technologies for copying and distributing data and database objects
from one database to another, and then synchronizing the data between the databases to
maintain consistency. Transactional replication is employed in server-to-server scenarios that
generally involve high throughput. Typical usage examples include reporting and data
warehousing, improving scalability and availability, integrating data from multiple sites,
integrating heterogeneous data, and offloading batch processing.
The SQL Operations team within Microsoft IT defined and developed a consistent set of
guidelines and best practices for transactional replication environments. These guidelines
and best practices have greatly helped in increasing stability and simplifying management
while improving performance of replication environments at Microsoft. Furthermore, proactive
monitoring and the creation of comprehensive troubleshooting guides has improved the
support efficacy for the business applications that the team is responsible for managing.
The purpose of this technical white paper is to share Microsoft knowledge, experiences, and
best practices related to Microsoft® SQL Server® transactional replication. This paper is not
intended to serve as a procedural guide. Each enterprise environment has unique
characteristics and circumstances; therefore, each organization should adapt the plans and
lessons learned, as described in this paper, to meet its specific needs. This paper assumes
that readers are IT pros and technical decision-makers already familiar with SQL Server.
Specifically, readers should possess a working knowledge of SQL Server transactional
replication.
Note: In this paper, the word replication refers to Microsoft SQL Server Transactional
Replication unless explicitly stated otherwise.
To review some of the basic concepts of SQL Server replication, see:

SQL Server Replication at http://msdn.microsoft.com/en-us/library/ms151198.aspx

Replication Agents Overview at http://technet.microsoft.com/enus/library/ms152501.aspx
Detailed product information is available in the Microsoft SQL Server 2008 TechCenter at
http://technet.microsoft.com/en-us/sqlserver/default.aspx.
Note: For security reasons, the sample names of forests, domains, internal resources,
organizations, and internally developed security file names used in this paper do not
represent real resource names used within Microsoft and are for illustration purposes only.
Products & Technologies
 SQL Server 2012
 SQL Server 2008 R2
 SQL Server 2008
 SQL Server 2005
 System Center 2012 Operations
Manager
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 4
INTRODUCTION
SQL Operations, a team within Microsoft IT, is charged with providing Microsoft®
SQL Server® support to the entire Microsoft IT organization. The team’s goal is to improve
application availability and performance by maintaining standardized SQL Server
configurations as well as unified incident and problem management processes. Their core
responsibilities include development of SQL Server standards and best practices for
Microsoft IT, maintenance of SQL Server health and welfare, and incident response.
Situation
In late 2009, Microsoft IT was experiencing challenges in its transactional replication
environments. Replication agents, which are implemented as SQL Server Agent jobs, were
frequently running into errors and performing poorly. There was limited to no replication
monitoring for many applications, and support teams had to work in a reactionary mode in
response to replication failures. In many cases, replication issues would manifest themselves
or be discovered at such an advanced stage that reinitialization would be required for
resolution. Reinitialization is a lengthy process during which the subscription database is not
available. Unavailability of subscription databases can be costly. For example, if a
subscription database is being used for reporting needs, reinitialization will cause those
business reports to be down for the entire duration of the reinitialization process.
The SQL Operations team knew that they needed to address the challenges they were facing
in stabilizing their replication environments from a few different angles. First, they needed
well-designed applications from a SQL Server replication configuration standpoint. Second,
they needed to be able to support those applications with proactive monitoring, so that when
issues began to develop, they would be alerted and be able to resolve them before they
progressed to an outage. Finally, detailed documentation was needed to provide all the
teams receiving replication alerts the information required to resolve the problems.
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 5
SOLUTION
To enhance the stability and performance of replication environments at Microsoft, the
SQL Operations team developed a three-pronged approach. The activities included:

Creation and publication of design guidance and best practices to help developers
design transactional replication for their applications, employing strategies that promote
robustness and performance.

Development of a monitoring solution that could track replication latency and provide
proactive alerts to incident teams before a complete replication system breakdown.

Development of a detailed and comprehensive transactional replication troubleshooting
guide (TSG) that would help database administrators (DBAs) consistently and
conveniently resolve known replication issues.
Creating Best Practices and Design Guidance for Transactional
Replication Environments
Business application developers design applications for Microsoft IT. They may choose to
use SQL Server replication for reporting, availability, or other data-duplication business
needs. These application servers are then on-boarded with the SQL Operations team, who
then monitors the health and performance of those servers. When there are issues, the team
engages with application owners to refine their configurations, reducing the possibility of the
same problems happening again. These engagements between the SQL Operations team
and the application owners have over the course of time helped the team better understand
how replication behaves in different scenarios and how things can be designed and
configured to mitigate some of the issues that have been seen on a more frequent basis.
The SQL Operations team created a consolidated view of the information they accumulated
from their experiences in supporting replication environments, the interactions with the
application developers, product information that was dispersed across multiple sources, and
through consultations with technical support engineers who provide product support to
Microsoft customers. This consolidated view of development guidance and best practices
that the SQL Operations team developed is represented in detail in the section, “Best
Practices and Design Guidance,” later in this document.
Building a Transactional Replication Latency Monitoring Tool
SQL Server Replication Monitor, a graphical tool that comes with the product, can be used to
view transactional replication latency. In addition, transactional replication provides the tracer
token feature, which offers a convenient way to measure latency in transactional replication
topologies and to validate the connections among the publisher, distributor, and subscriber
with the help of system-supplied stored procedures. A token (a small amount of data) is
written to the transaction log of the publication database, marked as though it were a typical
replicated transaction, and then sent through the system. This process allows a calculation of
the amount of time that elapses between:

A transaction being committed at the publisher and the corresponding command being
inserted in the distribution database at the distributor.

A command being inserted in the distribution database and the corresponding
transaction being committed at a subscriber.
From these calculations, it can be determined:
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 6

Which subscribers are taking the longest time to receive a change from the publisher.

Which of the subscribers expected to receive the tracer token have not received it.
The SQL Operations team developed a transactional replication latency monitoring tool using
Transact-SQL (built-in stored procedures and custom code), for maximum design flexibility.
This tool uses self-discovery to determine the number of transactional replication publications
and subscriptions in an environment. It allows DBAs to specify a custom replication latency
threshold for each transactional replication subscription in an environment. The default
latency threshold is two hours (120 minutes).
The tool uses tracer tokens to monitor replication latency on production servers. Two types of
tracer tokens are used: initial tokens—used for discovering the replication topology once per
day—and regular tokens—used to check replication latency at more frequent intervals. By
default, regular tokens are inserted in the publication database every 15 minutes. The system
checks to see whether the inserted tokens are able to make their way to the subscription
databases within the designated latency thresholds. An alert is raised through Microsoft
System Center Operations Manager or database mail if a token fails to reach the subscription
database within the configured latency threshold, as illustrated in the following sample alert:
Error 60201. Transactional replication latency threshold exceeded. Publication
Server: Publication_Server_Name. Publication [AdventureWorks_Db]:
Pub_AdventureWorks. Subscriber:
[Subscription_Server_Name].[AdventureWorks_Db]. Current Overall Latency: 125
min(s). Threshold: 120 min(s). Distributor Latency: "Null" sec(s). Subscriber
Latency: "Null" sec(s). The last inserted tracer token has not made it to the
subscriber yet. In fact, it has not even made it to the Distribution db. Please check
the replication LogReader SqlServerAgent job.
This tool is executed via a custom SQL Server Agent job called
_SQL_TranReplicationLatencyMonitor, which is created when the SQL Server replication
latency monitoring tool installation script is run. This job contains three steps that perform the
three major sub-functions of the tool, as described in Table 1.
Table 1. Steps of the _SQL_TranReplicationLatencyMonitor Job
Step
Job step function
Frequency and activity
1.
Replication Topology Discovery
Frequency: Once per day
Activity: The following are discovered:
2.
Replication Latency Check

Publisher name

Publication database name(s)

Publication name(s)

Distribution server name

Distribution database name

Subscriber name(s)

Subscription database name(s)

Default replication latency
thresholds are assigned for each
subscription
Frequency: Every 15 minutes
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 7
Step
Job step function
Frequency and activity
Activity:
3.
Alerting

A check is performed to determine
whether overall replication latency
is equal to, or greater than, the
specified threshold for each
subscription.

Replication latency is monitored at
the subscription level. The system
assigns default replication latency
thresholds, but these can be
customized for each subscription,
as needed.
Frequency: A check is performed every
15 minutes, but an alert is only raised if the
same alert for the same publication–
subscription pair has not been raised within
the past 12 hours.
Activity:

If replication latency is over the
threshold for any of the
subscriptions and a replication alert
for the same publication–
subscription pair has not been
raised within the past 12 hours, a
replication latency error will be
raised. This error can be seen in
the SQL Server Error log as well as
in the Application Event Viewer log.

If the replication latency check
determines that an alert condition
exists, an alert will be raised via
System Center Operations
Manager, database mail, or both.
Alert suppression of 12 hours is
built into the tool.

Customers have the option of
receiving email alerts directly from
this tool via database mail. This
alternate method for receiving alerts
can protect against System Center
Operations Manager or similar
monitoring tool issues.
Benefits of Proactive Monitoring

Most issues are resolved before they cause business impact. Proactive alerts are
raised when subscriptions fall behind enough that their latency thresholds are crossed.
In most cases, DBAs are able to resolve the issues causing latency long before latency
becomes serious enough to affect the business.

Fewer instances of reinitialization. Proactive monitoring reduces the number of
replication latency issues that eventually require reinitialization and the extended
downtime that might be needed for them. Again, most issues are resolved well before
they reach a stage where reinitialization is the only option for resolution.
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 8

Users become more productive. Higher availability because of fewer and faster
resolved issues directly translates to users having more time to get their work done.

Better service level agreement (SLA) management for application owners.
Business application owners can better manage their SLAs with their customers,
because they can be immediately informed when latency thresholds are crossed. They
can also analyze their historical data as to how many times latency thresholds were
crossed within a specific period of time—for example, during a given month or quarter—
to help them identify latency patterns or trends.
Note: More information about the transactional replication latency monitoring tool, including
installation instructions and the Transact-SQL installation scripts, can be downloaded at
http://gallery.technet.microsoft.com/scriptcenter/SQL-Server-Transactional-e34ed1e8.
Developing a Transactional Replication Troubleshooting Guide
TSGs provide repeatable troubleshooting steps and resolutions for known issues that support
teams can use. Well-written and comprehensive TSGs can greatly increase the knowledge
and comfort level of teams supporting a particular technology. The SQL Operations team
documented the various transactional replication issues for the Snapshot, Log Reader, and
Distribution Agents that they came across in their environments. After they put together a
comprehensive TSG to support the replication environments at Microsoft , the support teams
were better able to consistently and confidently resolve issues as they arose.
Note: Transactional replication issues that the SQL Operations team came across in
Microsoft IT, along with potential solutions for them, can be seen in Appendix A of this
document. This by no means represents a comprehensive list of all possible issues that one
can encounter in transactional replication; however, the list is a great starting point for a new
TSG. So, if your organization needs to create a new transactional replication TSG, you can
start with the list from Appendix A; then, as and when you come across issues specific to
your environment, add to the list as appropriate.
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 9
BEST PRACTICES AND DESIGN GUIDANCE
This section provides details of the transactional replication best practices and design
guidance that the SQL Operations team came up with for Microsoft IT applications. Following
the applicable guidelines has helped improve the stability and performance of the replication
environments within Microsoft IT. These guidelines have also helped improve customer
satisfaction and reduce support incidents.
1. Implement the Right Reinitialization Process
In transactional replication, reinitialization is the process by which the subscriber is
synchronized with the publisher at the start of the process. When the initial data
synchronization is successful, just the changes are moved from the publisher to the
subscriber. Business applications should have a clear, detailed, thoroughly tested, and welldocumented process for the reinitialization method to be used. Frequently, the reinitialization
process can take days or hours for large databases—those with, say, more than 500 GB of
data. Several options are available that can directly affect performance:

Reinitialization by running the replication Snapshot Agent.

Reinitialization by using the Replication Support Only option with sp_addsubscription.

Reinitialization by using the Initialize with backup option with sp_addsubscription
Often, the process of reinitialization is not given careful thought. It is important for application
teams to test and develop a reliable and repeatable reinitialization process that works well
and serves the business needs. Many scenarios may warrant reinitialization; here are a few
of the common ones:

The server operating system, SQL Server, or the business application needs to be
upgraded.

Replication has run into errors, and the subscriber has fallen too far behind to ever be
able to catch up.

The subscription has been marked as inactive.
Reinitialization by Running the Replication Snapshot Agent
This method is generally useful for databases smaller than 300 GB, although there may be
times when this is the only reinitialization option—for example, when a subscriber database
needs to have only a subset of the tables or data present in the publication database. Using
the replication Snapshot Agent, all of the subscriptions or just a single subscription can be
reinitialized. To reinitialize all subscriptions for a particular publication, right-click the
replication publication, and then click Reinitialize All Subscriptions, as shown in Figure 1.
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 10
Figure 1. Reinitializing all subscriptions
However, in many cases, only one subscription may need to be reinitialized. As illustrated in
Figure 2, right-click the replication subscription, and then click Reinitialize.
Figure 2. Reinitializing a single subscription
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 11
Next, select the appropriate options in the Reinitialize Subscription(s) dialog box, shown in
Figure 3. After clicking Mark for Reinitialization, run the replication Snapshot Agent to
create the schema files for the published articles, index script files, and other objects. This
process can sometimes be difficult for production systems, because shared locks are placed
on tables during the initial part of the concurrent snapshot-generation process, and some
blocking can occur. Next, all of the data from the published tables is bulk-copied using the
Bcp utility and stored in a replication snapshot folder. Finally, the Distribution Agent takes all
of the object-creation scripts and the Bcp data files from the snapshot folder and applies
them to the subscriber.
Figure 3. The Reinitialize Subscription(s) dialog box
Depending on the size of the database as well as the quality and speed of the link between
the publisher and the subscriber, this method of reinitialization can take a long time—hours or
even days. This is another reason why this process must be well tested and properly
documented, so that business teams and customers have the right expectations regarding
how long the subscriber database might be offline.
Note: For more information about reinitialization by running the replication Snapshot Agent
and for tips on ensuring that the snapshot process is as efficient as possible, visit
http://blogs.msdn.com/b/repltalk/archive/2010/03/07/tips-to-improve-performance-whenapplying-snapshot-in-transactional-replication.aspx
Reinitialization by Using the Replication Support Only Option with
sp_addsubscription
If the subscriber database has already been synchronized with the publication database
through a backup, copy, restore process or through log shipping or database mirroring, there
may be no need to run the costly Snapshot Agent. In this case, support teams can use the
Replication Support Only option for reinitialization. To use this option, when creating the
replication subscription with the stored procedure sp_addsubscription, provide the value
Replication Support Only for the @sync_type parameter.
For this process to be successful, the publisher and the subscriber should be in sync with
each other, and no new transactions should be coming into the publisher while the
subscription is being set up. After the sp_addsubscription script has run successfully, new
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 12
data can then start coming into the publisher. The Log Reader and Distribution agents will
begin to work right away, without the Snapshot Agent ever coming into play.
Here is a sample invocation of the sp_addsubscription procedure:
exec sp_addsubscription
@publication = N'<Publication_Name>',
@subscriber = N'<Subscription_Server_Name>',
@destination_db = N'<Subscription_Database_Name>',
@subscription_type = N'Push',
@sync_type = N'Replication Support Only',
-- default is 'automatic', which means run the snapshot agent.
@article = N'all',
@update_mode = N'read only',
@subscriber_type = 0
Reinitialization by Using the Initialize with Backup Option with
sp_addsubscription
This option is useful for large databases, where the subscriber database contains most of the
same data as the publication database. If the subscriber has only a subset of the tables or a
subset of the data in the tables that are in the publisher, it may be difficult to use this method.
One benefit of using this approach is that when replication is set up, new data can still be
written into the publisher while setting up the subscription. Compared to the Replication
Support Only method of reinitialization, no publication server downtime is required. The
database on the subscriber can be restored in no-recovery mode, and then differential or log
backups can be restored on top. When all the restores are finished, the database can be
recovered. Replication will start from after the last restored log sequence number.
To use this method, complete the following steps:
1.
Use the backup, copy, and restore process to restore a full database backup on the
subscription SQL Server instance. Do not recover the database.
2.
Restore differential or transaction log backups to make the subscription database as
current as possible, and then recover the database.
3.
Run the following sample script on the publisher server to set up the Initialize with
Backup subscription:
Use <Published_Database_Name>
-- Specify the published db name.
GO
DECLARE
@PublicationNameVar
NVARCHAR(200)
,@SubscriberVar
SYSNAME
,@SubDatabaseVar
SYSNAME
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 13
,@BackupDevicePath
NVARCHAR(200)
IF @@SERVERNAME = N'<Publication_Server_Name>'
Server Name.
-- Publication
BEGIN
SET @PublicationNameVar
Publication Name
=
SET @SubDatabaseVar
Subscriber DB
SET
@SubscriberVar
=
N'<Publication_Name>'
=
--
N'<Subscription_Db>'
--
N'<Subscription_Server_Name>'
-- Subscriber server
SET @BackupDevicePath
= N'<E:\MSSQL\BAK\.....>'
Path of the last restored differential or log backup
file.
--
END
exec sp_addsubscription
@publication = @PublicationNameVar,
@subscriber = @SubscriberVar,
@destination_db = @SubDatabaseVar,
@sync_type = N'Initialize with Backup',
@backupdevicetype = N'disk',
@backupdevicename = @BackupDevicePath,
@subscription_type = N'Push',
@update_mode = N'Read Only'
GO
Post-reinitialization Considerations
Regardless of the method of reinitialization used for an environment, there might still be
additional considerations:

Depending on how the subscription database is used, it may require a different set of
indexes. If this is the case, the appropriate index-creation script should be run in the
subscription database; otherwise, potential performance issues may occur while using
the subscription database.

Any triggers, foreign key constraints, or check constraints present in the tables in the
subscription database can slow down replication. If these objects are already present in
the publication database, check to see whether the presence of these objects in the
subscription database will add any business value. If the answer is no, then either define
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 14
these objects with the Not for Replication property in the publication database or simply
disable them on the subscriber.
2. Save Replication Scripts
All production applications should have scripts available to set up replication and facilitate the
process to repair replication if it is broken as well as for use during upgrades, should
replication need to be set up again. Replication scripts should be stored in source code
storage and the collaboration environment for the benefit of development and operation
teams.
Scripts should be developed such that individual replication components can be dropped and
recreated separately. For example, if there are two publications and three subscriptions, the
replication script should clearly call out which part of the script creates the first publication,
which creates the first subscription and so forth. The idea is that if any individual part of
replication (for example, a particular publication) needs to be dropped and re-created, the
fastest, easiest, and most convenient way of doing so should be available.
3. Know Your Replication Workload
Latency issues can often be seen when a large volume of transactions or commands is in the
process of going from the publisher to the subscriber. The distribution database can be
queried to see the volume of data being replicated. It is good to understand and analyze this
type of data. Knowledge about replication usage peaks and valleys can help provide
workload insights that may in turn help identify trends and potential opportunities for
improving replication performance. Here are some examples:

Replication falls behind whenever a certain number of commands come in per hour that
need to be replicated.

Replication falls behind on the weekends during the weekly index rebuild or when other
maintenance jobs are running.

It might be possible to improve performance by using a dedicated distribution server or
by refining Replication Agent parameters, as discussed later in this section.

Certain maintenance jobs like backups may perform better when the replication load is
light.
Note: For more information about understanding replication workloads and solutions to
address periods of high latency related to high volumes of data, visit
http://blogs.msdn.com/b/repltalk/archive/2010/10/20/determine-transactional-replicationworkload-to-help-resolve-data-latency.aspx.
The following is an example of a distribution database script that can be used to see the
volume of pending replication commands broken up by hour. Note that data in the distribution
database does not persist forever, because it is purged periodically by the Distribution
Clean up job. In Microsoft IT, while studying a particular replication environment, typically a
permanent table is created, and then the output of this script is saved into that table on a
periodic basis. This way, the team has data encompassing a longer period of time for
analysis purposes.
/* Display the # of pending replication commands, broken up by the
hour. */
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 15
if exists (select name from Tempdb.sys.objects where name like
'#Results%')
begin
Drop table #Results
end
select t.publisher_database_id, t.xact_seqno,
max(t.entry_time) as EntryTime, count(c.xact_seqno) as CommandCount
into #Results
FROM MSrepl_commands c with (nolock)
LEFT JOIN msrepl_transactions t with (nolock)
on t.publisher_database_id = c.publisher_database_id
and t.xact_seqno = c.xact_seqno
GROUP BY t.publisher_database_id, t.xact_seqno
SELECT MPD.Publisher_Db
,datepart(year, R.EntryTime) as Year
,datepart(month, R.EntryTime) as Month
,datepart(day, R.EntryTime) as Day
,datepart(hh, R.EntryTime) as Hour
,sum(R.CommandCount) as CommandCountPerTimeUnit
FROM #Results R inner join MSpublisher_databases MPD
on R.publisher_database_id = MPD.Id
GROUP BY MPD.Publisher_Db
,datepart(year, R.EntryTime)
,datepart(month, R.EntryTime)
,datepart(day, R.EntryTime)
,datepart(hh, R.EntryTime)
ORDER BY MPD.publisher_db, 3, 4, 5
4. See More of What Replication Is Doing
For more visibility into the activity and possible errors stemming from the Log Reader and
Distribution agents, run them with the –HistoryVerboseLevel parameter with a value of 2.
The default is 1.
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 16
Note: For more information about using the verbose history agent profile, visit
http://blogs.msdn.com/b/repltalk/archive/2010/07/13/using-verbose-history-agent-profilewhile-troubleshooting-replication.aspx.
Figure 4 and Figure 5 illustrate the difference in detail of Distribution Agent activity between –
HistoryVerboseLevel 1 (the default) and –HistoryVerboseLevel 2. Note how much more
detail and activity history is available with –HistoryVerboseLevel 2. In addition to greater
detail, every 5 minutes, new Distribution Agent read and write performance statistics are
made available through this interface. These statistics can be instrumental in identifying
whether the distribution reader or writer thread is experiencing a performance problem. It is
recommended that –HistoryVerboseLevel 2 be used while studying a replication
environment. Then, when the study is complete, revert to the default value of 1 to reduce the
overhead. One valuable insight from this output is the possible identification of large
transactions with a great number of commands in them. If this is the case, then some
optimization options may be available (e.g., the –MaxCmdsInTran parameter, discussed
later on in this section).
Figure 4. –HistoryVerboseLevel 1
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 17
Figure 5. –HistoryVerboseLevel 2
5. Consider Using a Dedicated Distribution SQL Server Instance
For systems with a heavy load on the publisher, a dedicated distribution server can be used
to reduce some of the replication processing overhead, thereby allowing the publisher to
perform better. If the publisher is also used as the distribution server, the Snapshot and Log
Reader agents will run on the publisher. In this scenario, using push subscriptions, the
Distribution Agent also runs on the publisher; while in the case of pull subscriptions, the
Distribution Agent runs on the subscriber. With a dedicated distribution SQL Server instance,
all replication agents that would typically have run on the publisher run on the distributor.
If the number of commands needing to be replicated per hour is 100,000 or more, an
organization might be said to have a busy replication environment and should consider using
a separate distribution SQL Server instance. The script introduced in the section, “Know Your
Replication Workload,” can be used to see the number of pending replication commands
broken up by hour. Support teams should perform thorough testing to verify that adding a
dedicated distribution SQL Server instance will help improve performance.
6. Break Up Large Transactions into Smaller Ones
For applications experiencing performance challenges resulting from large transactions,
breaking down the transactions into smaller ones can help improve performance. Microsoft
recommends that as a first preference, the application development team modify the
business application in question to use smaller transactions. Because it is not always
possible to go back and modify an existing production application, the next best option is to
break up large transactions within transactional replication by using the –MaxCmdsInTran
parameter for the Log Reader Agent. This parameter has a downstream effect and helps the
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 18
Distribution Agent stay current, because now the Distribution Agent does not have to wait
until the Log Reader Agent is finished writing huge transactions into the distribution database.
Instead, the Distribution Agent can start distributing commands as soon as the smaller
transaction (created within replication with the help of the –MaxCmdsInTran parameter) is
written to the distribution database. This way, the Distribution Agent has to wait less and can
process more.
In Microsoft IT’s replication environment, there have been applications where the number of
commands per transaction would sometimes be 100,000 or more, sometimes even more
than 500,000. In several of those cases, –MaxCmdsInTran was used to specify smaller
transaction sizes between 2,000 and 5,000, and performance benefits were observed.
Note: For more information about enhancing transactional replication performance settings,
visit http://msdn.microsoft.com/en-us/library/ms151762.aspx
7. Open Multiple Channels of Communication to the Subscriber
In some cases, the Distribution Agent may struggle to keep up with the volume of data that it
needs to distribute. To help speed up the Distribution Agent, use multiple parallel
communication channels between the distributor and the subscriber by using the –
SubscriptionStreams parameter. In many cases, doing so can greatly improve the
performance of the Distribution Agent. A good number to start with is the number of
processors on the subscription SQL Server instance. For example, if the number of
processors on the subscriber is eight, start by trying –SubscriptionStreams 8. Try this
setting as a baseline, then tune the value of –SubscriptionStreams up or down to see
where the best possible performance occurs.
Note: For more information about navigating the SQL Server replication
SubscriptionStreams setting, visit
http://blogs.msdn.com/b/repltalk/archive/2010/03/01/navigating-sql-replicationsubscriptionstreams-setting.aspx.
8. Evaluate Stored Procedure Execution Replication
Typically, whatever data-modification statements (inserts, updates, and deletes) are run on
the publisher, the same are also run on the subscriber, one by one. All the statements have
to be sent over the network and applied on the subscriber. In some cases, it might be more
efficient to make those modifications on the publisher with the help of stored procedures, and
then publish the stored procedure executions so that the same stored procedures are also
executed on the subscriber.
This practice can greatly reduce network traffic. However, stored procedure replication may
not be appropriate for all applications. If an article is filtered horizontally, so that there are
different sets of rows at the publisher than at the subscriber, executing the same stored
procedure on both sides will return different results. Similarly, if an update is based on a
subquery of another, non-replicated table, executing the same stored procedure at both the
publisher and the subscriber will return different results. SQL Server Books Online has a
good discussion of some of the available choices while using stored procedure replication.
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 19
Note: For more information about publishing stored procedure execution in transactional
replication, visit http://msdn.microsoft.com/en-us/library/ms152754.aspx and
http://msdn.microsoft.com/en-us/library/ms151168.aspx.
9. Replication Agent Jobs Standards
Microsoft recommends running the Log Reader and Distribution agents on a continuous
basis rather than on a schedule, so that the subscriptions are as current as possible. By
default, when replication agents are configured to run continuously, the Log Reader and
Distribution agents are configured to retry 2,147,483,647 times upon failure. This many
retries, with the default retry interval of once per minute, would take 4,085 years before the
agent would finally declare failure. It might be better to configure those retry attempts for a
more reasonable number—say, 60 or so, or whatever makes sense for the business. With
retries set to 60, Log Reader and Distribution agents would fail in 1 hour or more (depending
on how long each retry takes). The idea is that the agent should fail within a reasonable
amount of time and be able to generate an agent failure alert.
In Microsoft IT’s replication environment, the number of retries is not modified, because their
sophisticated transactional replication latency monitoring tool generates alerts when
subscriber databases fall behind publisher databases by more than the specified time
threshold.
Note: For more information, visit http://blogs.msdn.com/b/repltalk/archive/2010/08/25/sqlreplication-agent-will-retry-for-4085-years.aspx.
10. Understand the Not for Replication Property
The presence of unnecessary triggers, foreign key, and check constraints on subscription
database tables can cause a distribution agent bottleneck. As an example, if a business rule
implemented via a constraint has been validated at the publisher, it might not be useful to
perform the same validation at the subscriber. Such objects should be marked as Not for
Replication or, if this option is not available, disabled at the subscriber. This option is also
available for identity columns.
Note: For more information about the Not for Replication property, visit
http://blogs.msdn.com/b/repltalk/archive/2010/02/22/all-about-not-for-replication.aspx, and
“Controlling Constraints, Identities and Triggers with “Not for Replication” at
.http://msdn.microsoft.com/en-us/library/ms152529(v=SQL.105).aspx.
11. Don’t Always Push
The Distribution Agent runs at the distributor for push subscriptions and at the subscriber for
pull subscriptions. Evaluate push versus pull subscriptions to ensure that it makes the most
sense for a specific application. In a local data center scenario, push subscriptions are
preferred because of their manageability. When it is important to offload work from the
distributor, for example, there are many subscribers or lower technical specifications on the
distribution server, pull subscriptions are preferred.
In scenarios that involve geo-replication across a WAN, pull subscriptions have proved to be
more effective than push subscriptions, as detailed in the MSDN case study, Replication
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 20
Performance Gains with Microsoft SQL Server 2008 running on Windows 2008” at
http://msdn.microsoft.com/en-us/library/dd263442(SQL.100).aspx.
12. Published Database Log File Considerations
As with any database, there should be only one log file for the published database. The
number of virtual log files (VLFs) should be as low as possible. To determine the number of
VLFs, run Dbcc LogInfo(Published_Db_Name) on the publisher SQL Server instance. The
number of rows returned is the number of VLFs the published database has. The greater the
number of VLFs, the slower the Log Reader Agent performance, because it will take more
time for the agent to scan through all the VLFs looking for transactions and commands
marked for replication that need to be moved to the distribution database. In general, for a
published database, having more than 10,000 VLFs can be detrimental for Log Reader Agent
performance. Such an excessive number typically occurs when the log file autogrow value is
configured for a small number. Depending on factors like server load and hardware,
applications can experience Log Reader Agent slowness even with far fewer VLFs. An
industry expert has even recommended keeping the number of VLFs to less than 50 for
optimization purposes. This may be difficult to achieve, but the idea is to aim for keeping this
number low.
To resolve problems caused by a high number of VLFs, first ensure that there is only one log
file. In the case of multiple log files, remove all but the first transaction log file. Next, shrink
that one log file (to reduce the number of VLFs), and then specify a reasonable initial size for
it. Specify a reasonable autogrow value for this log file to avoid a high VLF count again,
because of repeated file growth, via autogrow.
13. Size the Distribution Database Appropriately
Make sure the distribution database is sized properly and that the autogrow settings make
sense for a production database. All too often, these considerations are not part of the design
process. Every now and then, a production distribution database is observed where the lone
data file is configured to autogrow in 1 MB increments (default). If the distribution database
has to constantly autogrow to make room for replicated data coming in, then this may cause
replication slowness. Just like the data file, autogrow for the distribution database log file
should be changed from the default of 10 percent to a more reasonable, non-percentage
number, which makes sense for the application.
Note: For more information about configuring distribution, visit
http://technet.microsoft.com/en-us/library/ms151860.aspx.
14. Enable Replication Latency Monitoring
IT teams should have a mechanism to monitor replication latency and provide alerts when
replication exceeds specified latency thresholds. As discussed earlier, the SQL Operations
team at Microsoft created a custom tool that employs tracer tokens to help them monitor
replication latency.
Note: For more information about monitoring the health of SQL Server replication, visit
http://blogs.msdn.com/b/repltalk/archive/2010/09/20/how-to-monitor-the-health-of-sql-serverreplication.aspx.
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 21
15. Check Your Hardware
What replication does is not magic. It takes a finite amount of time and hardware resources to
move the data from the publisher to the subscriber. Even the best SQL Server configuration
setting refinements may not be enough to improve performance if the hardware on which the
application is running is not tuned for the workload. Components such as I\O subsystem
performance, memory, and network adapter drivers are all important factors to consider. For
example, if there are many messages like the following in the SQL Server Error log for the
drives on which the published, distribution, or subscription databases reside, they could
potentially be causing bottlenecks for the Log Reader and Distribution agents:
SQL Server has encountered 16 occurrence(s) of I/O requests taking longer than
15 seconds to complete on file [O:\ MSSQL\DATA\ProdDb.ldf] in database [ProdDb].
In the Performance Monitor, check the Avg Disk Sec\Read and Avg. Disk Sec\Write counters
for the hard disks on which the alerts occur. If the times are consistently greater than 15–
20 ms, there may be an I\O bottleneck. Likewise, for the active network adapters, the
Network Interface\Output Queue Length counter should consistently stay under 2.
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 22
BENEFITS
Providing consistent processes and documentation for transactional replication design,
monitoring, and troubleshooting has been helpful for the SQL Operations team and the
applications they support:

Greater awareness of design best practices has resulted in more robust and betterperforming applications.

Proactive monitoring has resulted in less downtime for line-of-business applications,
better SLA management for application owners, and higher productivity for users:


Downtime has been reduced, because most issues are resolved well before they
cause any business impact. Consequently, instances of issues where reinitialization
would have been required have gone down significantly.

Use of the replication latency monitoring tool has provided application owners the
ability to specify custom replication latency thresholds, providing them alert
notifications when the subscription databases fall behind publication databases.
This has brought additional business value for application owners by allowing them
to better measure against their SLAs. For example, if an applications owner’s SLA
with his or her internal customers is that the data on their reporting site (from the
subscription database) should never be more than 2 hours old, then having no
replication latency alerts generated in a given time frame can help the owner
validate that he or she was able to keep the SLA.

A side benefit of overall reduced downtime is that now users have more time to be
productive and actually perform the work for which these business applications have
been designed.
Consolidated troubleshooting documentation for known issues has resulted in faster
resolution of issues and greater confidence for support professionals working with this
technology.
By using the three-pronged approach described in this paper, the SQL Operations team has
seen about a 61 percent decrease in the number of transactional replication-related tickets
since embarking on this approach. (The actual percentage reduction in this case is much
more than the stated number, because the number of servers SQL Operations is monitoring
has roughly doubled between 2010 and 2012, and the increase in the number of servers was
not taken into account in this calculation.)
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 23
CONCLUSION
The ongoing engagements between the SQL Operations team and application owners have
over the course of time improved the overall understanding of how transactional replication
works and how things can be designed and configured to mitigate some of the issues that
were previously being seen. Best practices and design guidance, proactive monitoring, and
good troubleshooting documentation form the three corners of an equilateral triangle, each
corner of which is critical (see Figure 6).
Figure 6. Cornerstones of a stable transactional replication environment
IT support teams that can incorporate these three ideas into their operations and support
methodology will be able to reap similar benefits to what SQL Operations saw in their
environment. With sound implementation of those ideas, any support team can be well on its
way toward achieving excellence in designing and maintaining their transactional replication
environments.
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 24
FOR MORE INFORMATION
For more information about Microsoft products or services, call the Microsoft Sales
Information Center at (800) 426-9400. In Canada, call the Microsoft Canada Order Centre at
(800) 933-4750. Outside the 50 United States and Canada, please contact your local
Microsoft subsidiary. To access information via the World Wide Web, go to:
http://www.microsoft.com
http://www.microsoft.com/technet/itshowcase
The information contained in this document represents the current view of Microsoft Corporation on the issues
discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it
should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the
accuracy of any information presented after the date of publication.
This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS,
IMPLIED, OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT.
Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under
copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or
transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for
any purpose, without the express written permission of Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights
covering subject matter in this document. Except as expressly provided in any written license agreement from
Microsoft, the furnishing of this document does not give you any license to these patents, trademarks,
copyrights, or other intellectual property.
Unless otherwise noted, the example companies, organizations, products, domain names, e-mail addresses,
logos, people, places, and events depicted herein are fictitious, and no association with any real company,
organization, product, domain name, e-mail address, logo, person, place, or event is intended or should be
inferred.
© 2012 Microsoft Corporation. All rights reserved.
Microsoft and SQL Server are either registered trademarks or trademarks of Microsoft Corporation in the United
States and/or other countries.
All other trademarks are property of their respective owners.
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 25
APPENDIX A: WORKING WITH AND TROUBLESHOOTING
TRANSACTIONAL REPLICATION AGENT ISSUES
Transactional replication issues that the SQL Operations team came across at Microsoft,
along with potential solutions for them, are described in this appendix. This by no means
represents a comprehensive list of all possible issues that can be encountered during
transactional replication; however, the list is a great starting point for a new TSG. So, if an
organization needs to create a new transactional replication TSG, it can start with the list in
this appendix and add additional scenarios specific to the organization’s environment.
Working with Snapshot Agent Issues
The Snapshot Agent job runs as an executable (snapshot.exe). To see the syntax and
various parameters that can be specified with this job, visit http://msdn.microsoft.com/enus/library/ms146939.aspx.
Table 2 shows some of the Snapshot Agent issues the SQL Operations team saw in
Microsoft environments.
Table 2. Snapshot Agent Issues and Their Solutions
Snapshot Agent
scenario
Possible solution or further reading
Snapshot Agent
seems to hang.
Try running the Snapshot Agent from a Command Prompt window to
troubleshoot the issue.
For more details, see
http://blogs.msdn.com/b/repltalk/archive/2010/03/17/troubleshootingsnapshot-agent-hang.aspx.
The snapshot
process is slow
and needs to be
optimized further.
To enhance the performance of the snapshot process, look at the ideas
presented at http://blogs.msdn.com/b/repltalk/archive/2010/03/07/tips-toimprove-performance-when-applying-snapshot-in-transactionalreplication.aspx.
Working with Log Reader Agent Issues
1.
Run the command Exec sp_replcounters on the publisher against the published
database. The results set will indicate whether the Log Reader Agent is lagging behind
and if so, by how much. This is what the following columns mean in the output of the
above mentioned stored procedure:

Replicated Transactions. The number of transactions in the log awaiting delivery
to the distribution database

Replication Rate tran/sec. The average number of transactions per second
delivered to the distribution database

Replication Latency (sec). The average time in seconds that transactions were in
the log before being distributed (Convert this number into hours or minutes, as
appropriate.)
This data can be shared with the application support team. To see how far behind the
Log Reader Agent is by way of pending commands marked for replication, run the
following query in the transaction log of the published database:
select count(*) from fn_dblog(null, null)
where Description = 'Replicate'
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 26
2.
Run the command Dbcc LogInfo(Published_Db_Name) on the publisher SQL Server
instance.
The number of rows returned is the number of VLFs the published database has. For
more information on this, review the section, “Published Database Log File
Considerations,” earlier in this paper.
Table 3 provides a summary of scenarios and issues that IT teams might encounter while
working with the Log Reader Agent. The Log Reader Agent job runs as an executable
(LogRead.exe). To see the syntax and various parameters that can be specified with file, see
http://msdn.microsoft.com/en-us/library/ms146878.aspx.
Table 3. Log Reader Agent Issues and Potential Solutions
Log Reader Agent scenario
Resolution and more information
The Log Reader Agent is timing out.
Add the –QueryTimeOut parameter to the Run Agent of the
failing Log Reader SqlServerAgent job. The default value of
this parameter is 1,800 seconds (30 minutes). Try putting in a
higher number—for example, 0 specifies unlimited time. When
the Log Reader Agent has caught up, remove this parameter.
The Log Reader Agent is slow on a
SQL Server instances where the
published database is also being
mirrored.
The paper, “SQL Server Replication: Providing High
Availability using Database Mirroring”, describes how the Log
Reader Agent behaves in this situation.
Log Reader Agent error: The
process could not execute
'sp_replcmds' on servername.
See the Microsoft Support article, at
http://support.microsoft.com/kb/811030
The Log Reader Agent is slow,
because the underlying disk
subsystem is slow.
Check the Logical Disk: Avg. Disk sec/Read and Logical Disk:
Avg. Disk sec/write Performance Monitor counters for the disks
on which the published database and especially its log files
reside. Ideally, those numbers should be 15 ms or less. If there
are several messages like the one below in the SQL Server
Error log for the log file drives in question, it may indicate a
disk bottleneck:
http://download.microsoft.com/download/d/9/4/d948f981-926e40fa-a026-5bfcf076d9b9/ReplicationAndDBM.docx
SQL Server has encountered 16 occurrence(s) of
I/O requests taking longer than 15 seconds to
complete on file [O:\ MSSQL\DATA\ProdDb.ldf] in
database [ProdDb].
Receiving this message occasionally does not necessarily
indicate that there is an issue. However, receiving many such
messages at regular intervals in the SQL Server Error log can
indicate a disk bottleneck.
Also, look into the possibility of publication and distribution
databases sharing the same drives and experiencing
contention.
The Log Reader Agent is slow or is
failing with the generic error
message, “Unable to execute
sp_replcmds.”
Look at the parameter settings for the Log Reader Agent
provided in the MSDN article, “Enhance Transactional
Replication Performance”, to see if tuning one or more of them
makes sense:
http://msdn.microsoft.com/en-us/library/ms151762.aspx

–MaxCmdsInTran. This parameter specifies the
maximum number of statements grouped into a
transaction as the Log Reader Agent writes commands to
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 27
Log Reader Agent scenario
Resolution and more information
the distribution database. The default is 0 (unlimited).
Specifying a smaller size can be useful in the case of
large transactions.

–ReadBatchSize. This parameter specifies the maximum
number of transactions read out of the transaction log of
the publishing database per processing cycle. The default
is 500. When a large number of transactions is written to
a publication database but only a small subset of those
are marked for replication, use this parameter to increase
the read batch size of the Log Reader Agent. For
purposes of troubleshooting, however, specify a low
number, even 1.

–PollingInterval. This parameter specifies how often, in
seconds, the log is queried for replicated transactions.
The default is 5 seconds. Decrease this value to poll the
log more frequently. Note that doing so can result in lower
latency for the delivery of transactions from the
publication database to the distribution database.
Note: While making a change, measure and record the before
and after values of the following Performance Monitor counters
to determine whether the change helped:

SQLServer: Replication LogReader: Delivered Cmds/sec

SQLServer: Replication LogReader: Delivered Trans/sec

SQLServer: Replication LogReader: Delivery Latency
Log Reader error: Repldone log
scan occurs before the current start
of replication.
See Chris Skorlinski’s blog, “Troubleshooting LogReader Error
repldone log scan occurs before the current start of replication”
at
http://blogs.msdn.com/b/repltalk/archive/2010/04/11/troublesho
oting-logreader-error-repldone-log-scan-occurs-before-thecurrent-start-of-replication.aspx.
Log Reader error: The process
could not execute
sp_repldone/sp_replcounters.
See Chris Skorlinski’s blog at
http://blogs.msdn.com/b/repltalk/archive/2010/02/19/theprocess-could-not-execute-sp-repldone-sp-replcounters.aspx.
Possible slow Log Reader Agent
performance, with a high number of
VLFs for the transaction log of the
published database.
Review the section, “Published Database Log File
Considerations” discussed earlier in this paper.
Get more information about the
failing Log Reader Agent.
Generate an output file by specifying the –Output and –
OutputVerboseLevel parameters in the Run Agent of the Log
Reader Agent job. Make sure that the path of the output file
specified with –Output is valid. Use values of 3 or 4 with –
OutputVerboseLevel. (The higher the number, the more
output is generated; the default is 2.) Next, analyze the output
file generated.
See the Microsoft Support article,
http://support.microsoft.com/kb/949523.
When troubleshooting is finished, remove these parameters.
Otherwise, they could fill up the drive on which the specified
log file resides.
Get further detailed troubleshooting.
The Chris Skorlinski blog talks about transactional replication
conversations.
http://blogs.msdn.com/b/repltalk/archive/2010/02/07/repltalkstart-here.aspx.
The first part of this article talks about the Log Reader Agent
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 28
Log Reader Agent scenario
Resolution and more information
and reader and writer threads.
Working with Distribution Agent Issues
The Distribution Agent job runs as an executable (Distrib.exe). To see the syntax and various
parameters that can be specified with this file, visit http://msdn.microsoft.com/enus/library/ms147328.aspx.
To see how far behind the Distribution Agent is in terms of the number of commands in the
distribution database that need to be applied to the subscription database, use the
Replication Monitor shown in Figure 7.
Figure 7. Using the Replication Monitor
Table 4 provides a summary of scenarios and issues that IT teams might encounter while
working with the Distribution Agent.
Table 4. Distribution Agent Issues and Potential Solutions
Distribution Agent scenario
Resolution or more information
The Distribution Agent is
timing out.
Add the –QueryTimeOut parameter to the Run Agent of the failing
Distribution SqlServerAgent job. The default value of this
parameter is 1,800 seconds (30 minutes). Try putting in a higher
number; 0 specifies unlimited time. When the Distribution Agent has
caught up, remove this parameter.
The row was not found at the
subscriber when applying the
More information about this error can be found at
http://www.microsoft.com/products/ee/transform.aspx?ProdName=
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 29
Distribution Agent scenario
Resolution or more information
replicated command. (Source:
MSSQLServer, Error number:
20598)
Microsoft+SQL+Server&ProdVer=09.00&EvtSrc=MSSQLServer&Ev
tID=20598.
It may be possible to use the following steps to remove
discrepancies between the problem table on the publisher and
subscriber, in turn resolving the replication problem:
1.
Use the following statement to find the replication error
captured in system table distribution.dbo.MSrepl_errors:
select top 300 * from
distribution.dbo.MSrepl_errors
(nolock)
where time > getdate() - .05
order by time desc
2.
Run sp_browsereplcommands on the distribution
SQL Server instance in the Distribution database to see the
table and command on which the replication failure is
occurring. Use xact_seqno from the previous step for both
@xact_start and @xact_seqno_end:
Exec Distribution.dbo.sp_browsereplcmds
@xact_seqno_start =
'0x0000001800000533000400000000',
@xact_seqno_end =
'0x0000001800000533000400000000' --,
@publisher_database_id = 10,
@command_id = 1
3.
Use the tablediff.exe utility typically found at C:\Program
Files\Microsoft SQL Server\100\COM to find the discrepancies
between this table on the publisher and the subscriber. Use the
command column from the output of sp_browsereplcmds
form the previous step to determine the name of the table with
discrepancies. Optionally, use the –f switch to generate a
script to insert the missing rows on the subscriber. Here is a
sample invocation of this tool:
tablediff -SourceServer AzharTaj1 SourceDatabase ProdDb -SourceTable
People -SourceSchema dbo DestinationServer AzharTaj2 DestinationDatabase ProdDb DestinationTable People DestinationSchema dbo -c -o
C:\Temp\Repl_Fix_Script.txt –f
Look for the replication fix script generated in the output file
provided above. Run the script in the subscription database in
question to resolve the problem.
Violation of the PRIMARY KEY
constraint 'PK_TableName.’
Cannot insert duplicate key in
object 'dbo.TableName.’
(Source: MSSQLServer, Error
More information about this error can be found at
http://www.microsoft.com/products/ee/transform.aspx?ProdName=
Microsoft+SQL+Server&ProdVer=09.00&EvtSrc=MSSQLServer&Ev
tID=2627.
It may be possible to use the following steps to manually delete
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 30
Distribution Agent scenario
Resolution or more information
number: 2627)
specific rows on the subscriber (that might have accidentally been
inserted there and by virtue of their presence are causing primary
key violations):
Note: This can happen in
transactional replication when
a row already exists on the
subscriber, perhaps in a
scenario where someone
accidentally inserted data on
the subscriber.
1.
Use the following statement to find the replication error
captured in system table distribution.dbo.MSrepl_errors:
select top 300 * from
distribution.dbo.MSrepl_errors
(nolock)
where time > getdate() - .05
order by time desc
2.
Run sp_browsereplcommands on the distribution
SQL Server instance in the distribution database to see the
table and command on which the replication failure is
occurring. Use xact_seqno from the previous step for both
@xact_start and @xact_seqno_end:
Exec Distribution.dbo.sp_browsereplcmds
@xact_seqno_start =
'0x0000001800000533000400000000',
@xact_seqno_end =
'0x0000001800000533000400000000' --,
@publisher_database_id = 10,
@command_id = 1
3.
The Distribution Agent is slow
or is “erroring out.”
Use the command column from the output of
sp_browsereplcmds form the previous step to determine the
name of the table in which the insert is occurring. From the
column values of the insert statement, determine whether that
record already exists on the subscriber. If it does, delete just
that row. Doing so should resolve the replication issue.
Look at the parameters for the Distribution Agent at
http://msdn.microsoft.com/en-us/library/ms151762.aspx to
determine whether one or more of them make sense in this
scenario:

SubscriptionStreams [0|1|2|...64]. This is the number of
connections allowed per Distribution Agent to apply batches of
changes in parallel to a subscriber while maintaining many of
the transactional characteristics present when using a single
thread. The default is 1 for a transactional SQL Server
subscription. In case of a high incoming transaction rate from
the publisher, this parameter may help. A good place to start is
to set this value equal to the number of processors on the
subscriber. For more information, see the “Navigating SQL
Replication Subscription Streams setting blog”, at
http://blogs.msdn.com/b/repltalk/archive/2010/03/01/navigating
-sql-replication-subscriptionstreams-setting.aspx.

CommitBatchSize. This is the number of transactions to be
issued to the subscriber before a COMMIT statement is issued.
The default is 100. Committing a set of transactions has a fixed
overhead; by committing a larger number of transactions less
frequently, the overhead is spread across a larger volume of
data.

PacketSize. This is the packet size in bytes. The default is
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 31
Distribution Agent scenario
Resolution or more information
4,096 (bytes). Visit the “Tune Replication Performance using
Packet Size” blog at
http://blogs.msdn.com/b/repltalk/archive/2010/03/11/tunereplication-performance-using-packetsize.aspx, for the
possibilities of tuning this value.

SkipErrors native_error_id [: ...n]. This colon-separated list
specifies the error numbers for the agent to skip.

PollingInterval. This is how often, in seconds, the distribution
database is queried for replicated transactions. The default is
5 seconds. Decrease this value to poll the distribution
database more frequently.
While making a change, measure and record the before and after
values of the following Performance Monitor counters to determine
whether the change helped:
The Distribution Agent is slow,
because the underlying disk
subsystem is slow.

SQLServer: Replication Dist: Delivered Cmds/sec

SQLServer: Replication Dist: Delivered Trans/sec

SQLServer: Replication Dist: Delivery Latency
Check the Logical Disk: Avg. Disk sec/Read and Logical Disk: Avg.
Disk sec/write Performance Monitor counters for the disks on which
the published database and especially its log files reside. Ideally,
those numbers should be 15 ms or less. Several messages like the
one below in the SQL Server error log for the log file drives in
question may indicate a disk bottleneck:
SQL Server has encountered 16 occurrence(s) of I/O
requests taking longer than 15 seconds to complete on
file [H:\ MSSQL\DATA\ProdDb.ldf] in database [ProdDb].
Receiving this message occasionally does not necessarily indicate
that there is an issue. However, many such messages received at
regular intervals in the SQL Server error log can indicate a disk
bottleneck. Also, look into the possibility of publication and
distribution databases sharing the same drives and experiencing
contention.
Distribution Agent error: The
process could not connect to
Subscriber
‘SubscriberServerName.’
See Chris Skorlinski’s blog “Distribution Agent fails with: Error
Locating Server/Instance Specified [xFFFFFFFF]” at
http://blogs.msdn.com/b/repltalk/archive/2010/04/26/distributionagent-fails-with-error-locating-server-instance-specifiedxffffffff.aspx.
Note: This error can also occur if there have been changes on the
cluster and the SPN is not registered correctly.
Distribution Agent error:
Msg 0, Level 20, State 0, Line
0
A severe error occurred on the
current command. The results,
if any, should be discarded.
Get more information about
the failing Distribution Agent.
See Chris Skorlinski’s blog, “Distribution Agent Fails with error Msg
0, Level 20, State 0, Line 0”, at
http://blogs.msdn.com/b/repltalk/archive/2010/04/05/distributionagent-fails-with-error-msg-0-level-20-state-0-line-0.aspx.
Generate an output file by specifying the –Output and –
OutputVerboseLevel parameters in the Run Agent of the
Distribution Agent job of concern. Make sure that the path of the
output file specified with –Output is valid. A value of 2 is the highest
value that can be specified for –OutputVerboseLevel in the case
of the Distribution Agent and is also the default. Next, analyze the
output file generated.
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 32
Distribution Agent scenario
Resolution or more information
When troubleshooting is complete, remove these parameters.
Otherwise, the drive on which the specified log file resides might fill
up.
Get further detailed
troubleshooting.
Chris Skorlinski’s blog at
http://blogs.msdn.com/b/repltalk/archive/2010/02/07/repltalk-starthere.aspx, talks about transactional replication conversations. The
latter half of this article talks about the Distribution Agent reader and
writer threads.
Achieving Excellence in Designing and Maintaining Transactional Replication Environments
Page 33