Microsoft’s Azure AD authentication outage: What went wrong

azureadoutage.jpg

Credit: Microsoft

On September 28 and September 29 this week, a number of Microsoft customers worldwide were impacted by a cascading series of problems resulting in many being unable to access their Microsoft apps and services. On October 1, Microsoft posted its post-mortem about the outages, outlining what happened and next steps it plans to take to head this kind of issue off in the future.

Starting around 5:30 p.m. ET on Monday, September 28, customers began reporting they couldn’t sign into Microsoft and third-party applications which used Azure Active Directory (Azure AD) for authentication. (Yes, this means Office 365 and other Microsoft cloud services.) Those who were already signed in were less likely to have had issues. According to Microsoft’s report, users in the Americas and Australia were likely to be impacted more than those in Europe and Asia.

Microsoft acknowledged it was a service update targeting an internal validation test ring that caused a crash in Azure AD backend services. “A latent code defect in the Azure AD backend service Safe Deployment Process (SDP) system caused this to deploy directly into our production environment, by passing our normal validation process,” officials said.

Azure AD is designed to be geo-distributed and deployed with multiple partitions across multiple data centers around the world, and is built with isolation boundaries. Microsoft normally applies changes across a validation ring that doesn’t include customer data, followed by four additional rings over the course of several days before they hit production. But this week the SDP didn’t correctly target the validation ring due to a defect and all rings were targeted concurrently causing service availability to degrade, Microsoft’s report says.

Microsoft engineering knew within five minutes of the problem that something was wrong. During the next 30 minutes, Microsoft started taking steps to expedite