Windows network connection validating identity
If a clean VM (one in which no customer code has executed) times out its GA connection three times in a row, the HA decides that a hardware problem must be the cause since the GA would otherwise have reported an error.
The HA then reports to the FC that the server is faulty and the FC moves it to a state called Human Investigate (HI).
To prevent a cascading software bug from causing the outage of an entire cluster, the FC has an HI threshold, that when hit, essentially moves the whole cluster to a similar HI state.
When a GA doesn’t connect within that timeout, the HA reinitializes the VM’s OS and restarts it.As part of its standard autonomic failure recovery operations for a server in the HI state, the FC will service heal any VMs that were assigned to the failed server by reincarnating them to other servers.In a case like this, when the VMs are moved to available servers the leap day bug will reproduce during GA initialization, resulting in a cascade of servers that move to HI.Windows Azure comprises many different services, including Compute, Storage, Networking and higher-level services like Service Bus and SQL Azure.This partial service outage impacted Windows Azure Compute and dependent services: Access Control Service (ACS), Windows Azure Service Bus, SQL Azure Portal, and Data Sync Services.