How flawed firmware can really give your DAG some replication headaches

Generally it's a good idea to keep firmware up to date for components like storage controllers. But just like you would never put an Exchange 2013 cumulative update into production without testing, you would never apply a fireware update without checking... or would you? Well, if you use IBM storage you might have run into a situation where DAG replication was affected by a firmware update. We can all learn from this - like what to do when replication goes bad and how to manage firmware updates better.

The IBM service bulletin titled “Affected ServeRAID controllers inadvertently supported hard drive sector size 512e and 4k - IBM Servers” has got to be a masterpiece of how to obscure a support statement. I’m sure it is written by someone who is awfully literate in all aspects of storage but it sure does an excellent job of disguising the issues in a mass of incoherent text.

To save you the problem of reading through the advisory, the issue is that IBM updated the firmware for some of its ServeRAID storage controllers and promptly affected the way that Database Availability Group (DAG) members replicate updates for database copies. The advisory says “In anticipation of the future availability of 512e hard drives with 4096 physical sector size and 512 byte logical sector sizes, LSI made firmware changes in versions 12.12.0-0111, 12.12.0-0126 and 12.12-0-0133 code for the IBM ServeRAID M5014, M5015 and M5025 SAS/SATA controllers to have all volumes indicate a physical sector size of 4096 bytes.”

The bulletin then goes on to say that “It is recommended to allow Microsoft to assist in guiding you as to when to apply the latest IBM ServeRAID firmware firmware (sic) across each node.” And later “Work the plan that Microsoft offers to ensure each server is running the same version of firmware of 12.13.0-0179 or later.” In essence, it’s a Microsoft problem because Exchange doesn’t support our firmware so get with them to figure out what to do…

If you read the Exchange 2013 storage configuration options page, we find that:

Support requires that all copies of a database reside on the same physical disk type. For example, it is not a supported configuration to host one copy of a given database on a 512-byte sector disk and another copy of that same database on a 512e disk. Also be aware that 4-kilobyte (KB) sector disks are not supported for any version of Microsoft Exchange and 512e disks are not supported for any version of Exchange prior to Exchange Server 2010 SP1.”

So you have to be careful about the disk sectors that you use with Exchange and you have to be careful that all of the databases in a DAG use the same disk sector. The 4K sector is an “Advanced Format” designed to facilitate the processing of larger files. However, the big benefit of the older format is that it is supported on just about every storage system that exists today. In addition, I haven’t heard any great cry of protest because Microsoft doesn’t support 4K sectors, so I assume that the need is not widely felt within the customer base.

In this case, IBM changed the disk sector type from 512e to 512 in the firmware update. That's fine because both sector sizes are now supported by Exchange. The problem occurs when Exchange realizes that the sector size has changed for one of the copies for a database.

Troubleshooting guru Tim McMichael blogged about the issue of changing disk sectors in DAGs in 2013. That post applies to Exchange 2010 DAGs but its meaning and impact also carry forward to Exchange 2013 DAGs. There’s tons of good information in that post (his knowledge on the most obscure parts of Exchange high availability is one of the reasons why you should come to hear Tim speak at Exchange Connections in September), but the two biggest points that I take from it are:

  1. All DAG members need to use disks configured with the same sector size. If you don’t have this in place, replication will become confused and continuous mode (or block-mode) replication will stop working when the sector mismatch is detected.
  2. To fix the problem, you have to disable continuous replication on all DAG members, update the firmware or disks, and then re-enable continuous replication. That could be quite a lot of work, which is one good reason why you should get disks set up right in the first place (and be careful with firmware updates).

Block-mode replication first appeared in Exchange 2010 SP1 and is controlled automatically by the Microsoft Exchange Replication service. The Replication service always starts in file mode, which means that replication depends on copying and replaying transaction logs. But if network connections between DAG members are healthy and no problems exist in copying transaction logs, then the Replication service switches to block-mode replication and sends transactions to other DAG members as soon as database buffers fill. It’s a very efficient and effective way of making sure that DAG members stay as up-to-date as possible.

No UI exists to control block-mode replication and no EMS cmdlet is available either, so you have to revert to the time-honored method of a registry hack to force the Replication service to only use file mode replication until the problem is fixed. To do this, you create a new DWORD value called DisableGranularReplication at HKLM\Microsoft\ExchangeServer\Vxx\Replay\Parameters (Vxx = V14 for Exchange 2010 or V15 for Exchange 2013). Set the value to 1 and restart the Replication service to enforce the change.

Eventually, after you have uncovered the mysteries contained in whatever storage service bulletin you’ve been sent and applied whatever fixes are necessary, you can remove the registry value and restart the Replication service. This time the Replication service will behave as normal and start in file mode before transitioning to block-mode if replication proceeds smoothly.

It would be nice if hardware vendors tested their code before it was released, but as we know that is not always possible when a particular piece of hardware is used with many different applications. Deciding how to test new code is often very difficult. But it would be nice if the service advisories were written in simple, clear English so that their meaning shone through loud and clear. Is that too much to ask?

Follow Tony @12Knocksinna

Discuss this Blog Entry 3

on Jun 5, 2014

Hi Tony,
Been there, done that. I am running IBM System x 3630 M3 servers for my DAG in my main site (M4 servers in my DR site) and I encountered this. As you can imagine, I have a fun story about this exact situation.

A little over a year ago I tried to update my firmware on the ServeRAID m5015 cards in my Exchange 2010 servers (on a Saturday night) and I ran into this issue; replication wasn't occurring. Since it was about 4:00a I just down-graded the firmware to the previous version and everything was fine. The following Monday I did research on the issue and discovered the blog posts (both on the EHLO blog and Tim's blog) and even added comments.
The following Saturday I was ready to pull the trigger and upgrade the firmware level again and do all the massaging that Tim explained, but I thought "I better check IBM's website to see if they released an update over the last week", and they did. So I applied it and it CHANGED THE SECTOR SIZE BACK TO 512B! Yeah, it was nice that I didn't have to do all of the clean-up, but it set a very bad precedent in my mind: I do not know when the 4K support will be added or removed with each firmware revision. Additionally, three months later I added a server to our DR site with the ServeRAID m5025, and its firmware level was set with Advanced Format, so I had to back-track through a few versions before I found one that would work with 512B sectors.

You'd think you can get that info from the release notes, but no, it wasn't in the notes that I read at the time. Hell, the support site has even REMOVED some older versions from the release notes (ones that I still have the old firmware files for), maybe to make it seem like they didn't exist.

So as you can surmise, I have been too scared to try and update the RAID controller firmware on my Exchange servers for a year now. I'm going to have to do it (probably next month), but this whole process has left me very leery of doing so. The IBM article you shared (which did not exist last year when I ran into this issue) does state "There are no plans to support 512e or 4096 sector size hard drives for the IBM ServeRAID M5014, M5015 or M5025 SAS/SATA controllers.", so maybe it won't be a big deal now, but I still have my concerns...

on Jun 6, 2014

@PNewell, what a tale of woe you tell! It's a great story though because it underlines the need for administrators to test every aspect of their hardware before putting it into production. And like we are told so often, the devil is often in the detail, like a firmware update....

on Jun 6, 2014

Absolutely! I'm just glad I didn't update the firmware on all of my DAG members that night! :p
Never a dull moment. :)

Please or Register to post comments.

What's Tony Redmond's Exchange Unwashed Blog?

On-premises and cloud-based Microsoft Exchange Server and all the associated technology that runs alongside Microsoft's enterprise messaging server.

Contributors

Tony Redmond

Tony Redmond is a senior contributing editor for Windows IT Pro and the author of Microsoft Exchange Server 2010 Inside Out (Microsoft Press) and Microsoft Exchange Server 2013 Inside Out: Mailbox...
Blog Archive

Sponsored Introduction Continue on to (or wait seconds) ×