Replicating your precious data to a variety of sites empowers distributed computing
Microsoft SQL Server version 6.0 provides strong support for replication right out of the box. It's tightly integrated with the core database engine and, for a first release, surprisingly flexible and easy to use. Even better, it's free: No add-ons or extra fees apply.
I've been testing the product for several months and have been impressed by its feature set and stability. Just as importantly, new graphical user interface (GUI) tools have streamlined setup and administration tasks. Using Enterprise Manager, a new distributed management toolset, I added new subscribers to my central publisher by dragging them into my Replication Topology window. I had replication up and running in about 15 minutes.
This article explains basic SQL Server 6.0 replication functionality and discusses its underlying architecture. I've tried to address many of the questions my colleagues and clients have asked. Hopefully, I've covered many of yours as well. Armed with this information, you'll be one step closer to implementing replication in your own environment.
SQL Server replication allows asynchronous distribution of data from a master copy known as the publisher to any number of read-only slave sites called subscribers. Publishers and subscribers are not kept 100% in synch, since data is distributed asynchronously. But a guaranteed delivery mechanism exists, so all data will eventually reach its destination. With asynchronous replication, processing continues if a network link or server goes down. This may not solve all distributed data needs, but "real-enough-time" data allows greater availability and scalability than can be achieved with a two-phase commit (2PC) strategy.
Replication is implemented using a Publish and Subscribe metaphor. Servers are set up as publishers, subscribers, or, in some cases, both. Datasets, known as publications, are created and made available to subscription servers. Publications are groups of tables marked for replication; they are defined by using one or many articles. An article refers to a single table and is the basic unit of replication. Servers are not limited to one role or the other, as a single table can publish and subscribe at the same time, permitting very flexible configurations. The classic example of rolling up data into a corporate repository and then redistributing the results back to the branch offices is one reason you might want to use replication.
SQL Server publications can include almost any type of data, although a table must have a unique primary key to be part of an article. (That's not a serious limitation, since tables without primary keys usually don't occur in well designed databases.) Other exceptions include system tables and columns containing text or image data types. Tables containing the text or image data types can be included in articles, but the text/image data will be passed to subscribers as NULL.
Publications cannot include system objects, such as stored procedures and triggers. Replication is still useful without this functionality, but it would be even more powerful with it. A single database acting as a master repository for programmable objects would be invaluable in many large production and development environments. I hope Microsoft adds this capability to SQL Server soon.
To keep things flexible, SQL Server articles support horizontal and vertical partitioning. Imagine a marketing system with a table called Customer. An article, Customer_Address_From_Virginia, could be horizontally partitioned to include only rows for Virginia customers. A vertical partition might limit the replicated columns to address specific information. You can use a simple where clause or a stored procedure created with FOR REPLICATION to enforce the horizontal partition. SQL Server creates an internal view to support the vertical partition.
In Listing 1, each row of the table is examined based on a Boolean condition. Only rows that evaluate to true are marked for replication. The NOLOCK clause in this example is a new enhancement which tells the optimizer to perform a dirty read. When a replication filter is executed, it requests a lock on the published table which can produce deadlock situations. You should use the NOLOCK clause when creating a manual horizontal partition to avoid this problem.
Subscriptions are just what they sound like and can be made to an entire publication or a specific article. Administration and configuration may be initiated by the SQL Server system administrator (SA) of either the publishing or the subscribing machine. A publishing administrator can create a "push subscription" on any server where he or she has SA rights. The data is "pushed" out to the subscriber with no further administration required by the destination machine. Alternatively, the SA of a subscription server can create a "pull subscription" by subscribing to an existing publication. Both methods produce exactly the same results. Push subscriptions simplify administration from a central location, while pull subscriptions preserve a degree of autonomy, since publishing and subscribing can be split between SAs. If security is a big concern, "pull subscriptions" may be the way to go. Remember that subscriptions are read-only copies of a publication. SQL Server does not enforce this restriction, so it's up to you. Failure to do so will cause administrative nightmares.
SQL Server supports two types of replication: snapshots and log-based replication. Snapshots refresh an entire table on a scheduled basis. Conceptually, the table is dropped and rebuilt based on the publisher's schema, and all data is copied from the publisher to the subscriber. SQL Server provides several advanced options for managing snapshots. Log-based replication, on the other hand, is incremental in nature. The replication log-reader scans the publishing database's log and forwards transactions to a distribution server where they are parceled out to subscribers. By default, the log-reader runs continuously but can be configured to "wake up" at specific intervals.
Three Replication Server Roles
Servers play three distinct roles in a replication environment: publication, distribution, and subscription. The Publication Server maintains databases available for publication. This server contains the source data that will be replicated to subscribers and is the only place where data should be changed. The log of each database enabled for publication is scanned on a regular basis, and new, committed transactions are sent to the Distribution Server.
The Distribution Server is the heart and soul of the replication process. It runs three main processes: synch navigation, log-reader, and distribution. It also maintains the distribution database, which stores replicated transactions until all subscribers have committed them.
The Subscription Server receives and executes SQL commands sent from the Distribution Server. The Mslast_job_info table is the only thing added to a Subscription Server to support replication. This table allows SQL Server to determine the last replicated transaction committed by a particular subscriber. It is queried when the distribution process wakes up. Since no specific processes are run on the subscriber, replicating to non-SQL Server subscribers becomes a no-brainer via Open Database Connectivity (ODBC).
SQL Server currently supports replication to ODBC data sources, even though this functionality is not yet documented. I've configured replication for Access data sources, and everything seemed to work fine. I assume there are a few kinks to iron out; otherwise Microsoft would have documented this key piece of functionality, which will be fully supported when SQL Server 6.5 is released (planned for later this year).
A single machine can perform the roles of all three servers, or they can be split across multiple boxes. Running Publication and Distribution Servers from the same machine will be common, but high-volume environments will benefit by distributing these tasks. The Distribution Server is the real work-horse, and a heavy replication load will become a serious resource hog when run from a Publication Server that supports high-volume on-line transaction processing (OLTP) or Decision Support Systems (DSS) (see Figure 1
To run a Publication Server and a Distribution Server from the same machine, you'll need at least 32MB of RAM--even more to achieve reasonable performance. I'd suggest at least 48MB for any type of production environment, but based on your needs, significantly more may be required. If the processes are distributed, 16MB per machine is the bare-bones minimum; 32MB is more realistic.
Disk space on the Distribution Server is another key consideration. The distribution database transaction log cannot be truncated beyond the point of the last transaction uncommitted by a subscriber. If a subscriber goes down at 9:00 AM Monday, the distribution log cannot be truncated past that point and has the potential to grow very quickly. There are ways to deal with this situation, such as temporarily disabling replication, but it is an important issue to consider.
Architecture and Internals
Replication is a component of SQL Executive, a turbo-charged management/ scheduling facility which replaces the old SQL Monitor. It is tightly integrated with Windows NT and consists of the following managers: Alert Manager, Event Manager, Replication Manager, and Task Manager. All four play a role in replication, but only the last two are actually required to make it work.
The Replication Manager runs three critical processes. Each process (synchronization, log-reader, and distribution) is physically executed on the Distribution Server. The Task Manager independently controls each task and is responsible for running them at the scheduled time. Feedback is provided based on success or failure. Informational events can be generated to the Windows NT event log, or a message can be sent to SAs via email.
The synchronization process ensures that the schema and data located on each subscriber is an exact copy of the article or publication being replicated. This process wakes up at predefined intervals searching for new subscribers. If a subscriber is found, two files are created on the Distribution Server. The first file contains the SQL script used to recreate the table, and the second file contains the data which will be loaded into the subscriber using the Bulk Copy Program (BCP).
Automatic synchronization might not be practical when replicating a very large table over a slow WAN. The process could take too long or place too much stress on a fragile WAN link. You can perform manual synchronization in these situations. Schema and data files are still created, but the user is required to make sure they are applied. Be careful! Until SQL Server is informed that manual synchronization is complete, replication to the destination database is stopped, even for publications and articles not waiting on a synchronization event.
Subscribers can receive replicated transactions as soon as the synchronization process completes. The log-reader scans each published database searching for committed transactions that are marked for replication. These transactions are forwarded to the Distribution Server and stored until applied to all subscribers. The log on the publishing database can be truncated using normal procedures after marked transactions have been sent to the Distribution Server.
When the log-reader finds an uncommitted transaction marked for replication, it reconstructs the original SQL statement and forwards it to the Distribution Server. Sending SQL commands rather than data reduces network traffic and allows SQL Server to replicate changes to ODBC data sources. Replication to all subscribers, including SQL Server 6.0, is done via ODBC.
By default, all INSERT, DELETE, or UPDATE activity within a publication is captured by the log-reader and sent on its way. This is the requirement in most situations, but you can configure the log-reader to support special needs. SQL Server allows you to specify normal, customized, or no replication for INSERT, DELETE, or UPDATE transactions. You can also implement custom processing for all INSERT statements, or you might decide not to replicate DELETE statements at all.
The distribution process is responsible for applying replicated transactions to a subscription database. It defaults to running continuously but can be set to run only at specific intervals. This technique might be useful when replicating data over a WAN link unavailable at certain times of the day.
The distribution process searches for any new transactions sent by the log-reader and applies them to all subscribers. When a subscriber commits a transaction, the jobid column of its Mslast_job_info table is updated. This table is queried at the beginning of each distribution event. Since SQL Server knows the last transaction applied to a subscriber, it can easily identify transactions that are pending delivery. Once committed by all subscribers, the transaction can be immediately deleted from the distribution database or maintained for a specified period to facilitate subscriber recovery and the generation of historical information (see Figure 2
Transaction latency, or the time it takes between commit by the publisher and commit by the subscriber, will vary in real-life situations based on any number of factors, but the whole process can happen very quickly. My testing shows that transactions can be replicated in less than two seconds. While this test was performed under "ideal" conditions, the modular nature of SQL Server replication should allow it to meet most needs.
The log-reader and distribution process are both multithreaded. The log-reader uses a separate thread to scan the log of each database enabled for publishing. The distribution task uses a separate thread to send transactions to each subscription database it services. Early information from Microsoft indicates that the log-reader can sustain throughput of approximately 60 transactions/second for each individual database. Transactions can be distributed to each subscriber at approximately the same speed. Throughput is currently limited by the single-thread-per-publisher and single-thread-per-subscriber architecture.
In theory, a backlog could occur if the publisher commits transactions faster than they can be distributed to subscribers. I haven't had the opportunity to fully test this scenario, but I would suggest exploring it further before attempting to replicate large amounts of data in a high-volume transaction environment. (Note: These comments are based on my testing of a beta version of the product. The final release, which should be available by press time, may offer greater throughput.)
Sophisticated and Free!
Replication in SQL Server 6.0 for Windows NT wraps sophisticated features in an easy-to-use package. Combined with SQL Server's aggressive pricing and the fact that replication is bundled for free, I expect its use to grow dramatically over the next year, forcing Microsoft's competitors to rethink their packaging of it as an expensive add-on.