How Microsoft IT runs one of the world’s largest federation services
If you’ve been reading this column for a while, you’re realizing that sooner or later you’ll need to implement some kind of federation service in your identity infrastructure. This service will allow you to provide single sign-on (SSO) to cloud-based services—both on-premises and in the public cloud—for your enterprise users, using their enterprise credentials. If you don’t provide SSO, your users will be forced to find their own ways of using these cloud service providers, and probably not in a way you’d prefer. In this column, I’ll review the production federation service of a well-known enterprise: Microsoft.
To find out how Microsoft runs its federation service, I sat down with my friend and ex-Directory Services MVP, Laura Hunter, at the Cloud Identity Summit. Laura is an ex-MVP because she accepted a position as identity and access management architect for Microsoft IT, specifically for federation services. Besides her principal responsibilities with the federation infrastructure, she speaks at various conferences to show IT pros how federation is managed in what's probably the largest production federation environment in the world.
Microsoft started “dogfooding” federation with the release of Active Directory Federation Services (AD FS) 1.0 at the time of Windows Server 2003 R2. The company’s original reason for implementing AD FS wasn’t to provide access to what we now think of as cloud applications (remember, this was around 2005), but to make it easier for its employees to access Microsoft’s external providers. The first federated trusts for the company were payroll, HR, employee benefits, and the Microsoft company store. Establishing these trusts made it possible for employees to use their enterprise Microsoft credentials to access the provider’s resources.
In 2010, Microsoft IT’s upgrade of its federation service to AD FS 2.0 with its support of the widely used SAML protocol—coupled with the rise of cloud computing—resulted in an explosion of use for this service. Microsoft developers began creating new applications, and re-architecting existing applications, to use claims-based authentication instead of traditional integrated Windows authentication. Laura estimates that Microsoft IT is currently managing approximately 900 relying party trusts, though not all of them are for production services. (There might be as many as six trusts needed to support a production service at each stage of its lifecycle, such as proof of concept, development, customer test, integration test, and so on.)
Perhaps surprisingly, a large number of these applications are on premises within the Microsoft network. An important feature of claims-aware applications is that, to the applications, the traditional corporate firewall (the “flaming brick wall,” as security expert Gunnar Peterson puts it) doesn’t exist because all the application’s traffic goes over always-open ports 80 (HTTP) or 443 (HTTPS). As a result, claims-aware applications are very portable and are equally comfortable inside or outside that corporate firewall.
Figure 1 shows an overview of Microsoft IT’s identity and access management (IAM) environment. It consists of three major areas: Microsoft’s internal network, called CorpNet; its extranet (DMZ), for collaboration with partners; and cloud services. Let’s look at CorpNet first. Naturally, Microsoft uses all the identity tools at its disposal, so it uses Forefront Identity Manager (FIM) to integrate the company’s HR database into the product’s metaverse. This metaverse is “upstream” of its AD environment and feeds select HR data into it. As you might suspect, a company like Microsoft with tens of thousands of developers has a pretty complicated AD configuration.
It’s important to remember than when the phrase Log on using your enterprise credentials is casually tossed around in federation scenarios, this authentication process is often a lot more complicated than it sounds. Many companies don’t have a single domain, or forest, that contains everyone’s user accounts. For a variety of reasons, user accounts might be scattered across multiple forests. Microsoft, for example, has eight different AD production forests comprising 18 production domains, any one of which might contain a user’s corporate-sanctioned credentials. (Of course there are many test and development forests with separate, isolated credentials.) Because it’s not cost- or labor-intensive to provide separate federation services for each credential store, Microsoft has configured its major account forests to use forest trusts with selective authentication where required, to allow users to access resources—like federation—across the forests. Along with the multi-forest AD environment, IT’s production AD FS service interacts with other claims sources (e.g., physical security), authorization services, and more than 2500 IT-supported line-of-business (LOB) applications.
Microsoft’s extranet environment exists to allow Microsoft employees to sponsor credentials for partners and vendors for collaboration purposes, and to allow these partners to access resources such as SharePoint. An AD FS proxy is another key component of the extranet, which I’ll review in more detail later.
Finally, Microsoft’s cloud computing environment is an enormous and vitally important facet of Microsoft’s computing story. This environment falls into three categories. —Microsoft’s Software as a Service (SaaS) version of its most popular desktop and server applications—is used by Microsoft internally (in addition to the service’s external customers) and uses the Dirsync service to synchronize identities between corporate Office 365 users and the service. Windows Azure is Microsoft’s Platform as a Service (PaaS) offering. PaaS provides a platform for developing SaaS applications. It was the first Microsoft cloud computing product for the simple reason that Microsoft’s own developers needed a platform for creating SaaS versions of the company’s enterprise software. As you might expect, Windows Azure is very heavily used at Microsoft, and AD FS—along with the Windows Azure AppFabric Access Control Service (ACS)—facilitates this. Finally, Microsoft uses a wide variety of third-party cloud computing service providers and partners (such as the previously mentioned payroll service).
Even though federation is a new service in the IT world, don’t make the mistake of thinking it isn’t an important service. One way to think of a federation service is as a gateway between the Kerberos world and the claims-based world. Claims-based authentication uses claims wrapped in a digitally signed token. The standard for enterprise authentication is AD, of course, and it uses Kerberos tickets. Making enterprise authentication work with claims-aware applications means that tickets must be transformed to tokens, and vice versa. This transformation is the main function of the Security Token Service (STS) component of a federation service such as AD FS. This means that as companies begin to use claims-aware applications both externally and internally, the federation service quickly becomes part of the mission-critical infrastructure. Just count the number of arrows leading to and from AD FS and its proxy service in Figure 1 to see how critical it is to Microsoft!
Laura’s advice to companies planning a federation service (that should be most of you) is to look at your requirements, because those requirements will determine what kind of federation architecture you need. She says, “At the end of the day, federation is pretty simple. It’s about my people accessing your stuff, or your people accessing my stuff, or my people accessing a provider’s stuff. Who are your customers? Who are you trying to authenticate to what applications?” An enterprise that wants to authenticate its users to SaaS apps should probably have an on-premises federation service. An ISV that wants to make it easy for users to authenticate to a cloud-based application should probably host its federation service in the cloud, too.
Laura likes to joke, “If you’re having trouble setting up AD FS, it’s either a problem with PKI or a typo.” On a more serious note, she recommends that you build your federation service with the end state in mind—in other words, plan for high availability from the beginning. Based on my AD experience, I’d suggest that you build in lifecycle management for your federated trusts from the start, just like you should be doing lifecycle management for AD users, groups, and computers.
Don’t forget to also take the requirements for an AD FS proxy into account. You’ll want an AD FS proxy (an AD FS installation option) as part of your architecture in addition to the core AD FS service. Why do you need a proxy? Unlike the AD FS service itself, the proxy doesn’t have to be joined to a domain; it’s usually used in a DMZ to forward external authentication requests to the AD FS service. In Microsoft’s case, it’s used to allow employees outside the corporate network to use claims-aware applications. It also allows extranet partners to use some of these applications. Like the core AD FS service, it should also be configured for high availability.
Federation isn’t a “nice to have” add-on. It will quickly become a mandatory high-availability service of your IT infrastructure. Leading by example, Microsoft IT demonstrates federation’s importance. To quote Microsoft Technical Fellow John Shewchuk, “Identity is the glue that binds federated IT together.” And a federation service, whether it’s maintained on premises or hosted in the cloud, is the glue that binds your AD and claims-aware applications together to help create a federated IT.