Disaster-recovery plans need to evolve as implementation matures
In April 2012, Microsoft SharePoint 2010 reached its two-year anniversary. Although I still hear about new SharePoint deployments, organizations that have been using the product for a year or longer are broadening its scope and increasing their dependency on it. Whereas many started off using SharePoint for basic collaboration, other workloads -- such as those for business intelligence (BI), Enterprise Content Management (ECM), and social networking -- are becoming common. Many organizations are also enhancing SharePoint with third-party products or developing their own custom solutions on top of it. As more and more content goes into SharePoint, its storage footprint also expands. And with plenty being written and reported about SharePoint governance, we know that SharePoint is starting to mature as a solution.
By now, I'm guessing that most of you have read something about SharePoint disaster recovery. I'm hoping all of you who are in charge of a production implementation already have a solid, tested plan. If not, SharePoint Pro Magazine has some good primers, which you'll find in the "Related Resources" box. The goal of this article is to go a bit deeper and illustrate how your disaster-recovery plans need to evolve as your SharePoint implementation matures. Specifically, I'll discuss how disaster recovery is affected by
- custom code
- very large content databases
- Remote BLOB Storage (RBS)
From my experience, these factors represent the most common complications for the recovery process. They are also common trends that are found in more mature SharePoint implementations. Although the information in this article is intended for SharePoint 2010 environments, most of it also applies to Windows SharePoint Services (WSS) 3.0 and Microsoft Office SharePoint Server (MOSS) 2007. WSS and MOSS don't support RBS, but a related form, called External BLOB Storage (EBS), can be used with SharePoint 2007 Service Pack 1 (SP1) and later.
How custom code affects disaster recovery
To me, one of SharePoint's most amazing achievements is its flexibility as an out-of-the-box product. Colleagues of mine often refer to SharePoint as a Swiss Army knife or Play-Doh, and I agree: It's a universal tool that can be molded into a seemingly infinite number of forms. However, SharePoint's greatest flexibility comes not from its rich UI but from its underlying technology platform. SharePoint's object model (i.e., API) is the technology platform that allows developers and ISVs to build powerful business solutions. A testament to this flexibility is the teeming ecosystem of SharePoint vendors and products, which are built on top of SharePoint's platform.
Of course, this flexibility has a downside, assuming that we're talking about custom code that is deployed directly to web servers. Such code can drastically complicate disaster recovery. Let's look at a simple and fairly common situation.
Your company is looking for some custom Web Parts to use in a knowledge-base solution that will be developed in SharePoint. The developer team writes the code, tests the Web Parts, and deploys them into production. For each Web Front End (WFE), an assembly DLL is copied to the global assembly cache, web.config files are modified, and Web Part-related files are added to the SharePoint root (i.e., 14 hive). Months later, one of the load-balanced WFE servers crashes and is replaced with a new server, which is joined to the farm. The next day, users report sporadic problems with the knowledge base. After a couple hours of troubleshooting, you find that only requests from this new server are causing the problem. Only then do you remember that some custom code had been deployed to the original server. After you manually deploy the files and update web.config, the problem is solved.
Fortunately, a much better solution exists in this scenario. Instead of manually deploying code, you can use SharePoint Solution Packages (WSPs) to automate the deployment of custom code and configuration changes. If you use a farm-based solution package for custom Web Parts, then custom code is deployed automatically just after the server joins the farm. My advice is to require a WSP for any custom-code deployments. Even better, develop the code as a sandboxed solution (also a WSP) whenever possible.
What if you have third-party applications installed on the WFE? These applications will probably need to be reinstalled. In most cases, you should reinstall them either just before or just after the server joins the farm, but check the apps' installation guides.
What if you make manual configuration changes to a server? For example, suppose you change the docicon.xml file to add icon support for PDF files, or you modify web.config for a web application that will use forms-based authentication. In these cases, it's best to maintain a log that documents manual changes that are made to servers. Be sure to save the file in a recoverable location, not just in SharePoint -- you might need access to the file when SharePoint is offline. This way, if the server is ever replaced, you'll know exactly what to do, saving a lot of time and frustration during the recovery. If your WFE servers are virtualized, you have the luxury of taking and restoring a snapshot. Of course, you can always take a complete OS backup of the server, as well. Just be sure that at least one of these approaches is part of your recovery strategy.
How large content databases affect disaster recovery
In July 2011, Microsoft released revised guidance for sizing content databases. (For details, see "SharePoint Server 2010 capacity management: Software boundaries and limits.") To summarize, the supportable limit for a single content database increased to 4TB; there is no explicit limit for document archives such as a record center. Note that a number of caveats, including possible changes to the performance of your database storage layer and the need to adjust your disaster-recovery plan, apply to these revised limits.
In respect to disaster recovery, the problem with large content databases is how long they take to back up and restore. As SharePoint matures in organizations, it often increases in importance. This importance usually translates into tighter recovery objectives. In fact, I'm starting to hear about cases in which recovery time objectives (RTOs) are being reduced to just a few minutes. Yikes! Assuming that most recovery operations (excluding the Recycle Bin, of course) start with a content database restore, how do you meet your service level agreements (SLAs) when it takes 6 hours just to restore the database?
The first solution is to do whatever you can to limit the size of your databases. In most scenarios, I still recommend that content databases be kept on a 200GB diet. When structuring site collections within content databases, look at your usage patterns and isolate unique patterns to separate databases. For example, do not store your write-intensive team sites within the same content database that holds your read-centric intranet portal. My Sites (remember, each My Site is a site collection) should always be stored in separate content databases, and preferably associated with a separate web application, to better control in which database new My Sites are created. Large archives, such as a record or document center, should each have its own separate content database. With this approach in place, you not only have smaller content databases, you can also choose to back up read-centric or less-important content databases less often.
Another solution is to be sure that you have deployed SP1 for SharePoint 2010. One of the service pack's features is the ability to store deleted websites in the second-stage (site-collection) Recycle Bin. Now, when that blog site is accidentally deleted, a site-collection administrator can restore it. This capability will save you a lot of time and effort.
As your total content continues to grow, you'll probably learn that SharePoint's native backup and restore features are just too limiting. It's often necessary to invest in third-party backup and restore solutions to meet SLA requirements. A number of great products are available and will pay for themselves if they can help you to recover quickly from just one disaster, big or small. And these tools are wonderful for day-to-day recovery needs such as item-level recovery, helping you to meet your recovery objectives.
Of course, high-availability protection, such as Microsoft SQL Server clustering, database mirroring, or the brand-new AlwaysOn feature of SQL Server 2012, should be part of your complete disaster recovery strategy. Although these options won't help when you need to restore content, they will keep your farm running if a database server goes down.
How RBS affects disaster recovery
As content databases grow, so does the underlying storage. With content databases being stored on premium, tier 1 storage, managing storage costs becomes its own challenge. One solution is to use RBS to store documents outside the content database in more affordable storage, as Figure 1 shows. Not only does this option reduce storage costs, it might also speed your SQL Servers by offloading taxing read/write requests for documents.
The disaster-recovery challenge is that you now have two entities to back up and restore: the content database and the location of the external documents, called the BLOB store. With RBS, the content database uses pointers to reference the external BLOBs, and these pointers must be kept in sync. Although this might sound difficult, it doesn't need to be. Let's look inside RBS to understand why.
First, you should know that files in the BLOB store are immutable, meaning that they never change. Editing a document in SharePoint creates a new BLOB, rather than replacing the document in the existing BLOB. This process always occurs separately from any versioning settings. Deleting a document from a library (and the Recycle Bin) removes the metadata and flags the pointer as deleted, but the BLOB is kept for a designated period. Over time, you end up with extra, orphaned files in the BLOB store.
RBS uses maintenance jobs to clean out old pointers and these orphaned files. This process is called garbage collection and effectively resyncs everything. You're probably asking, what does "old" mean? The value is configurable and commonly set to 30 days, meaning that only files that are orphaned for more than 30 days are removed. This value can be adjusted and is often based on recovery SLAs.
Having this window (officially called the garbage collection time window) is convenient for day-to-day restore operations when using RBS. For example, let's say that a document is accidentally deleted on Monday morning. SharePoint no longer has the document's metadata, but the BLOB file is still intact. Simply by restoring Sunday night's database backup, you can restore the item's metadata and BLOB pointer, which still points to the existing file in the BLOB store. In other words, you do not need to restore from the BLOB store.
In a few scenarios, however, you do need to restore from the BLOB store, so you must back it up. These scenarios include a BLOB store failure or the need to restore from a backup that's beyond your time window (e.g., 45 days ago). In these cases, recovery is more difficult, but these are not usual restore operations.
You can schedule your content database and BLOB store backups to run together or back to back. Either way, start the content database backup first and schedule it so that it finishes before the BLOB store backup finishes. If and when you need to restore a database and BLOB store together, do so in reverse order; that is, start the BLOB store restore first and then start the content database backup.
It's also worth mentioning that many storage providers, such as EMC and NetApp, allow you to mirror a BLOB store. In such cases, your disaster-recovery steps are easier and faster, but be sure to check with your storage vendor so you know your options.
If you decide to use RBS, know that features vary among vendors. For example, Microsoft offers a free RBS provider called FILESTREAM. When FILESTREAM runs in local mode (i.e., when the BLOB store is on the local SQL Server), a database backup includes the BLOBs. However, there are drawbacks to FILESTREAM, such as a more challenging installation and maintenance steps. When choosing a provider, do your homework and consider all aspects of your environment, including disaster recovery.
One final note: RBS cannot be used to extend Microsoft's supported limits for content databases. For example, externalizing 5TB from a single content database that is used for project team sites would be an unsupportable design. When planning your maximum database sizes, be sure to include both internal content (the actual database size) and external content from the BLOB store.
To learn more about the inner workings of RBS, you can download the excellent Microsoft SQL Server white paper "Remote BLOB Storage". To learn more about RBS, its benefits, and its considerations, see the AvePoint white paper "Optimize SharePoint Storage with BLOB Externalization."
Has your SharePoint environment matured to the point where custom code, large content databases, or RBS solutions are being considered? If not, just give it some time -- it probably will. When I talk to enterprises and medium-sized companies, one of their top issues is how to manage large content databases and whether RBS can help; most of these environments also have some form of custom code, whether an in-house or off-the-shelf product. But don't fear: Use these suggestions to take control of SharePoint, before it takes control of you.