In a lengthy update to its service status page, Microsoft has explained the causes and resolutions to the extended downtime Outlook.com and Exchange ActiveSync users experienced earlier this week. The company says that it has "restored service so all customers should have normal access from all of their devices," though as The Next Web notes, there is still an issue for "a small percentage of mobile users" as of this writing.
Microsoft's explanation details the triage work system administrators needed to go through to identify and resolve the outage. The main issue was "a failure in a caching service that interfaces with devices using Exchange ActiveSync." That failure caused a cascade effect where devices flooded Microsoft's servers with traffic that they weren't able to handle, taking down Outlook and SkyDrive for some users. To fix it, Microsoft was forced to block Exchange ActiveSync for a short time, giving it breathing room to fix web access before turning EAS back on. That caused a backlog for mobile devices, but more importantly Microsoft says it needed to change its infrastructure to prevent this issue from cropping up again — which the company says it has done.
This incident was a result of a failure in a caching service that interfaces with devices using Exchange ActiveSync, including most smart phones. The failure caused these devices to receive an error and continuously try to connect to our service. This resulted in a flood of traffic that our services did not handle properly, with the effect that some customers were unable to access their Outlook.com email and unable to share their SkyDrive files via email.
Microsoft's says "we apologize for letting ... customers down this week." The outage began three days ago on August 14th and though web access was restored that day, it wasn't until today that Microsoft got through the ActiveSync backlog. Hopefully the still-pending issue for mobile users will be resolved quickly.