![]() "In parallel, we're investigating more immediate actions that can be taken to provide relief to users." ![]() "We're currently analyzing error patterns to understand what factors are contributing to the issue," they tweeted. Within an hour after acknowledging the second outage, the Microsoft engineers said they were taking a two-pronged approach to the issue. The Beast of Redmond publishes full assessments of outages within fourteen days, and The Register awaits that document with interest – as, we imagine, will Azure customers.Then came the second glitch, acknowledged by Microsoft at around 1615 ET in a tweet. Microsoft also admitted "our automation was incorrectly approving stale requests, and marking some healthy nodes as unhealthy, which slowed storage recovery efforts."Īnd that's just the stuff the tech giant was able to discover in its immediate post-incident review, compiled within three days of an incident. Some kit needed to be replaced, while some components needed to be installed in different servers. ![]() "As a result, our onsite datacenter team needed to remove components manually, and re-seat them one by one to identify which particular component(s) were preventing each node from booting," the report states. Storage hardware damaged by the data hall temperatures "required extensive troubleshooting" but Microsoft's diagnostic tools could not find relevant data because the storage servers were down. Microsoft also had trouble understanding why its storage infrastructure didn't come back online. "Moving forward, we are evaluating ways to ensure that the load profiles of the various chiller subsets can be prioritized so that chiller restarts will be performed for the highest load profiles first," the document states. The analysis also suggests the prepared emergency procedures did not include provisions for an incident of this sort. ![]() "We have temporarily increased the team size from three to seven, until the underlying issues are better understood, and appropriate mitigations can be put in place." "Due to the size of the datacenter campus, the staffing of the team at night was insufficient to restart the chillers in a timely manner," the report states. The software colossus's report offers a very detailed timeline of events that shows how its on-site team made it onto the datacenter's roof to inspect chillers exactly an hour after the power sag, and that the chillers' manufacturer had boots on the ground two hours and 39 minutes after the incident commenced.īut the document also notes that Microsoft had just three of its own people on site on the night of the outage, and admits that was too few. Voyager 2 found! Deep Space Network hears it chattering in space.Bank of Ireland outage sees customers queue for 'free' cash – or maybe any cash.Cisco's Duo Security suffers major authentication outage.UK flights disrupted by 'technical issue' with air traffic computer system.Which is when bits of Azure and other Microsoft cloud services started to evaporate. With just one chiller working in data halls that need five, "thermal loads had to be reduced by shutting down servers." "We had two chillers that were in standby which attempted to restart automatically – one managed to restart and came back online, the other restarted but was tripped offline again within minutes," Microsoft's report states.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |