This typo sparked a Microsoft Azure outage

Microsoft Azure DevOps, a suite of application lifecycle services, stopped working in the South Brazil region for about ten hours on Wednesday due to a basic code error.

On Friday Eric Mattingly, principal software engineering manager, offered an apology for the disruption and revealed the cause of the outage: a simple typo that deleted seventeen production databases.

Mattingly explained that Azure DevOps engineers occasionally take snapshots of production databases to look into reported problems or test performance improvements. And they rely on a background system that runs daily and deletes old snapshots after a set period of time.

During a recent sprint – a group project in Agile jargon – Azure DevOps engineers performed a code upgrade, replacing deprecated Microsoft.Azure.Managment.* packages with supported Azure.ResourceManager.* NuGet packages.

The result was a large pull request of changes that swapped API calls in the old packages for those in the newer packages. The typo occurred in the pull request – a code change that has to be reviewed and merged into the applicable project. And it led the background snapshot deletion job to delete the entire server.

“Hidden within this pull request was a typo bug in the snapshot deletion job which swapped out a call to delete the Azure SQL Database to one that deletes the Azure SQL Server that hosts the database,” said Mattingly.

Azure DevOps has tests to catch such issues, but according to Mattingly, the errant code only runs under certain conditions and thus isn’t well covered under existing tests. Those conditions, presumably, require the presence of a database snapshot that is old enough to be caught by the deletion script.

Mattingly said Sprint 222 was deployed internally (Ring 0) without incident due to the absence of any snapshot databases. Several days later, the software changes were deployed to the customer environment (Ring 1) for the South Brazil scale unit (a cluster of servers for a specific role). That environment had a snapshot database old enough to trigger the bug, which led the background job to delete the “entire Azure SQL Server and all seventeen production databases” for the scale unit.

The data has all been recovered, but it took more than ten hours. There are several reasons for that, said Mattingly.

One is that since customers can’t revive Azure SQL Servers themselves, on-call Azure engineers had to handle that, a process that took about an hour for many.

Another reason is that the databases had different backup configurations: some were configured for Zone-redundant backup and others were set up for the more recent Geo-zone-redundant backup. Reconciling this mismatch added many hours to the recovery process.

“Finally,” said Mattingly, “Even after databases began coming back online, the entire scale unit remained inaccessible even to customers whose data was in those databases due to a complex set of issues with our web servers.”

These issues arose from a server warmup task that iterated through the list of available databases with a test call. Databases in the process of being recovered chucked up an error that led the warm-up test “to perform an exponential backoff retry resulting in warmup taking ninety minutes on average, versus sub-second in a normal situation.”

Further complicating matters, this recovery process was staggered and once one or two of the servers started taking customer traffic again, they’d get overloaded, and go down. Ultimately, restoring service required blocking all traffic to the South Brazil scale unit until everything was sufficiently ready to rejoin the load balancer and handle traffic.

Various fixes and reconfigurations have been put in place to prevent the issue from recurring.

“Once again, we apologize to all the customers impacted by this outage,” said Mattingly. ®

READ SOURCE