Generally it's a good idea to keep firmware up to date for components like storage controllers. But just like you would never put an Exchange 2013 cumulative update into production without testing, you would never apply a fireware update without checking... or would you? Well, if you use IBM storage you might have run into a situation where DAG replication was affected by a firmware update. We can all learn from this - like what to do when replication goes bad and how to manage firmware updates better.
The IBM service bulletin titled “Affected ServeRAID controllers inadvertently supported hard drive sector size 512e and 4k - IBM Servers” has got to be a masterpiece of how to obscure a support statement. I’m sure it is written by someone who is awfully literate in all aspects of storage but it sure does an excellent job of disguising the issues in a mass of incoherent text.
To save you the problem of reading through the advisory, the issue is that IBM updated the firmware for some of its ServeRAID storage controllers and promptly affected the way that Database Availability Group (DAG) members replicate updates for database copies. The advisory says “In anticipation of the future availability of 512e hard drives with 4096 physical sector size and 512 byte logical sector sizes, LSI made firmware changes in versions 12.12.0-0111, 12.12.0-0126 and 12.12-0-0133 code for the IBM ServeRAID M5014, M5015 and M5025 SAS/SATA controllers to have all volumes indicate a physical sector size of 4096 bytes.”
The bulletin then goes on to say that “It is recommended to allow Microsoft to assist in guiding you as to when to apply the latest IBM ServeRAID firmware firmware (sic) across each node.” And later “Work the plan that Microsoft offers to ensure each server is running the same version of firmware of 12.13.0-0179 or later.” In essence, it’s a Microsoft problem because Exchange doesn’t support our firmware so get with them to figure out what to do…
If you read the Exchange 2013 storage configuration options page, we find that:
“Support requires that all copies of a database reside on the same physical disk type. For example, it is not a supported configuration to host one copy of a given database on a 512-byte sector disk and another copy of that same database on a 512e disk. Also be aware that 4-kilobyte (KB) sector disks are not supported for any version of Microsoft Exchange and 512e disks are not supported for any version of Exchange prior to Exchange Server 2010 SP1.”
So you have to be careful about the disk sectors that you use with Exchange and you have to be careful that all of the databases in a DAG use the same disk sector. The 4K sector is an “Advanced Format” designed to facilitate the processing of larger files. However, the big benefit of the older format is that it is supported on just about every storage system that exists today. In addition, I haven’t heard any great cry of protest because Microsoft doesn’t support 4K sectors, so I assume that the need is not widely felt within the customer base.
In this case, IBM changed the disk sector type from 512e to 512 in the firmware update. That's fine because both sector sizes are now supported by Exchange. The problem occurs when Exchange realizes that the sector size has changed for one of the copies for a database.
Troubleshooting guru Tim McMichael blogged about the issue of changing disk sectors in DAGs in 2013. That post applies to Exchange 2010 DAGs but its meaning and impact also carry forward to DAGs. There’s tons of good information in that post (his knowledge on the most obscure parts of Exchange high availability is one of the reasons why you should come to hear Tim speak at Exchange Connections in September), but the two biggest points that I take from it are:
- All DAG members need to use disks configured with the same sector size. If you don’t have this in place, replication will become confused and continuous mode (or block-mode) replication will stop working when the sector mismatch is detected.
- To fix the problem, you have to disable continuous replication on all DAG members, update the firmware or disks, and then re-enable continuous replication. That could be quite a lot of work, which is one good reason why you should get disks set up right in the first place (and be careful with firmware updates).
Block-mode replication first appeared in Exchange 2010 SP1 and is controlled automatically by the Microsoft Exchange Replication service. The Replication service always starts in file mode, which means that replication depends on copying and replaying transaction logs. But if network connections between DAG members are healthy and no problems exist in copying transaction logs, then the Replication service switches to block-mode replication and sends transactions to other DAG members as soon as database buffers fill. It’s a very efficient and effective way of making sure that DAG members stay as up-to-date as possible.
No UI exists to control block-mode replication and no EMS cmdlet is available either, so you have to revert to the time-honored method of a registry hack to force the Replication service to only use file mode replication until the problem is fixed. To do this, you create a new DWORD value called DisableGranularReplication at HKLM\Microsoft\ExchangeServer\Vxx\Replay\Parameters (Vxx = V14 for Exchange 2010 or V15 for Exchange 2013). Set the value to 1 and restart the Replication service to enforce the change.
Eventually, after you have uncovered the mysteries contained in whatever storage service bulletin you’ve been sent and applied whatever fixes are necessary, you can remove the registry value and restart the Replication service. This time the Replication service will behave as normal and start in file mode before transitioning to block-mode if replication proceeds smoothly.
It would be nice if hardware vendors tested their code before it was released, but as we know that is not always possible when a particular piece of hardware is used with many different applications. Deciding how to test new code is often very difficult. But it would be nice if the service advisories were written in simple, clear English so that their meaning shone through loud and clear. Is that too much to ask?
Follow Tony @12Knocksinna