About the 5-hour Microsoft Outage

What self-sabotaging action could take your company down for 5 hours?

This is related to but not exactly part of my series on Automating Cybersecurity Metrics. The Code.

Free Content on Jobs in Cybersecurity | Sign up for the Email List

I’m taking a break today to write about the recent Microsoft outage because it is so important for all of us to learn from these incidents and figure out how we can prevent and withstand future outages.

I was waiting for Microsoft to publish a report on the matter publicly but ended up obtaining the information from other sources. I’m considering the impact of this incident and what might have prevented it in this blog post. Also, what can you do about outages like this that affect your business?

Initially, I wondered if the attack was related to the West’s recent agreement to send tanks to Ukraine — something they had been haggling over for months. But no, it was a self-imposed disaster, apparently. That’s all I can tell at the moment, unless Microsoft reports some other root cause below the surface of what appears to be the reason for the incident. I can think of a few which are listed below, but so far it seems to be a case of human error.

What caused the Microsoft Outage?

The technical source of this outage is that someone changed a router configuration. A router is a device that controls where packets get sent on the Internet. But not just one router — the person pushed a command that ended up going to a whole bunch of routers and that changed caused them all to send traffic to the wrong place. That resulted in a 5-hour Microsoft outage that impacted many people who use Microsoft products who were trying to get work done.

Technically, the outage was caused by a change to BGP, a very error-prone protocol that helps automatically push new routes out to network devices. The problem is that when the wrong routes are pushed, they can spread very quickly as well.

I first saw this report by ThousandEyes who deduced the problem by analyzing packet loss, disruptions and BGP routes:

Microsoft Outage Analysis: January 25, 2023

ThousandEyes actively monitors the reachability and performance of thousands of services and networks across the global…

www.thousandeyes.com

Organizations cannot put many technical guardrails around these changes within BGP itself. However, companies can architect proper governance and controls around BGP.

How do the engineers that make the changes to BGP systems get access to the systems? How many people are involved in the change? Have automated processes and testing been architected around these changes to prevent errors? Are processes simply a matter of trust — or are they automated with proper checks and balances? Are the people making the changes fully qualified? Are employees rotated on a regular basis to prevent collusion?

I am not sure of the legality of this point, so consult a lawyer. But do companies test employees to determine if they are willing to make changes when offered money by third parties when they manage highly critical infrastructure? Also, are employees trained on how they might be approached and a person may “make friends” or become romantically involved with them to obtain information or in some cases, offer to pay them to make a change? See the link to the Tesla article below.

Are Clouds Considered Critical Infrastructure?

They should be.

I don’t know about you, but when I can’t see my calendar or access communication systems, my day is pretty shot. That’s why I have multiple means of accessing those systems. But if the system itself is down, I’m a bit out of luck. I might not be able to log into certain systems if I cannot access my email or text messages, for example.

Think about how much of America runs on the major cloud providers these days? What sort of impact would that have on US operations (and other countries as well)?

What is critical infrastructure in the US?

CISA defines critical infrastructure as follows:

Critical infrastructure describes the physical and cyber systems and assets that are so vital to the United States that their incapacity or destruction would have a debilitating impact on our physical or economic security or public health or safety. The Nation’s critical infrastructure provides the essential services that underpin American society.

INFRASTRUCTURE SECURITY

Critical infrastructure describes the physical and cyber systems and assets that are so vital to the United States that…

www.cisa.gov

Cloud providers control the access to much of our nation’s data. This data runs many systems that help the US function on a daily basis. That’s why it is so, so important for global technology companies who support these systems not to make a mistake.

We have enough problems with attackers and cybersecurity that we don’t need more self-imposed disasters. And if we take a look through some of the largest outages in cloud provider history — that’s exactly what is causing some of the biggest disruptions.

In addition, should a mistake happen, how quickly can a cloud provider recover? Do they plan for and test rollbacks or corrections to misconfigurations?

Cloud Outage History

AWS has had a couple of S3 bucket outages, among other things, that took down large parts of the Internet because so many systems are reliant on S3. I think the first major incident that made people understand how reliant we are on cloud providers occurred in 2017 with AWS S3:

Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region

We'd like to give you some additional information about the service disruption that occurred in the Northern Virginia…

aws.amazon.com

In the above outage, the error was caused by a typo. A human error. One single person took out a major portion of the Internet. Now mind you, I’m not blaming that person, I’m saying you should take a look at the processes that could allow such an incident to happen in your organization.

Google also has had some outages, with one of the largest in 2020.

Google explains the cause of the recent YouTube, Gmail outage

Google says that the global authentication system outage which affected most consumer-facing series on Monday was…

www.bleepingcomputer.com

Google says that the global authentication system outage which affected most consumer-facing series on Monday was caused by a bug in the automated quota management system impacting the Google User ID Service.

Once again, this massive cloud outage was a self-imposed error, not a malicious cyber attack. In this case it was not a human configuration mistake, but somewhere along the way improper testing — by a human — allowed a bug to slip through that caused this problem. I wrote about proper testing a number of times on this blog.

Better testing for better outcomes

Infrastructure, disaster recovery, product, and penetration testing

medium.com

CloudFlare had an incident in June 2022 caused by —a BGP change:

Massive Cloudflare outage caused by network configuration error

Cloudflare says a massive outage that affected more than a dozen of its data centers and hundreds of major online…

www.bleepingcomputer.com

Cloudflare outage on June 21, 2022

Today, June 21, 2022, Cloudflare suffered an outage that affected traffic in 19 of our data centers. Unfortunately…

blog.cloudflare.com

Microsoft Outages and Incidents

Having monitored cloud outages for a while and being a user of all the major cloud providers, I seem to recall Microsoft having the biggest number of problems with outages, cybersecurity incidents, and downtime. I haven’t sat down to record and count the incidents and data breaches, though I’ve thought about it.

I wrote about my own experience not even being able to create a VM for days here:

When the cloud runs out of VMs

Consider reserving capacity if you’re counting on Azure

medium.com

Here’s an example of customer data exposed through a misconfiguration of a storage bucket:

Investigation Regarding Misconfigured Microsoft Storage Location

October 28, 2022 update: Added a Customer FAQ section. Security researchers at SOCRadar informed Microsoft on September…

msrc-blog.microsoft.com

Here’s another one caused by a server misconfiguration:

250 Million Microsoft Customer Support Records Exposed Online

Unprotected Microsoft Database Exposed 250 Million Customer Support Records Online

thehackernews.com

The Firewall Times has compiled a list of breaches, some of which are actual cyberattacks:

Microsoft Data Breaches: Full Timeline Through 2022

The most recent Microsoft breach occurred in October 2022, when data on over 548,000 users was found on an…

firewalltimes.com

The number of visible incidents could relate to different approaches to governance at different cloud providers. See more on governance below, but first, let’s look at some other recent incidents and their root causes.

Taking a look at the root cause of incidents in general

You can view a few recent outage reports here, and I’m very thankful that Microsoft is transparent about this information. I’m simply looking for solutions to problems in this post by finding the root cause. What is causing the most cloud outages and security incidents? In a lot of cases, it is human error. We can take a look at the incidents that affected Microsoft Azure in January:

Azure status history

What happened? Between 07:08 UTC and 12:43 UTC on 25 January 2023, customers experienced issues with network…

status.azure.com

1/31/2023 — Lack of alerting and/or response to resource exhaustion

1/25/2023 — Proper steps not followed during a change.

1/23/2023 — Hardware fault (not a human error)

1/18/2023 — BGP change

Two of the above (50%) are human errors.

I would argue that resource exhaustion should be monitored and addressed before that problem occurs, but I don’t have all the details. I’m sure it’s not simple but it can be done.

As for the hardware failure, these things happen, but it seems like it should have been possible to remove the failing device from rotation even if it’s management interface was not available. Perhaps Microsoft needs to consider how those systems are managed and perform some additional automation and testing.

Running a cloud platform is not easy, and especially one of the scale of the major cloud providers. (Running systems in one rack in a data center was not easy for me and I didn’t enjoy it.) But if these cloud providers are so critical to so many systems we need to evaluate mistakes after they happen and find ways to prevent them in the future.

By the way you can also view the incident history for other cloud providers as well to see what types of incidents they are facing and how they are addressing them.

AWS incident history:

Post-Event Summaries

The AWS Health Dashboard provides the latest health information for AWS services across the globe, as well as the…

aws.amazon.com

Google incident history:

Google Cloud Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the…

status.cloud.google.com

Cloud Governance at Cloud Providers

The AWS outage above resulted from a human typo that took down the S3 service. Since that time, AWS has learned from this mistake and has taken steps to limit human error. Though I’ve still heard of a few incidents related to human error at AWS, it seems like there are far fewer of these at AWS than Azure.

Perhaps the reason is because AWS has implemented better governance, as I wrote about in these posts:

Cloud Governance

Stories on Cloud Governance by Teri Radichel

medium.com

At some point after that incident I head Stephen Schmidt, CISO at AWS, say in a presentation that it takes two people to make a critical change now. Microsoft should do the same.

I’ve been writing a lot about separation of duties in my latest blog series and why it matters. Separation of duties not only helps prevent data breaches, it helps prevent human error that can lead to outages.

Automating Cybersecurity Metrics (ACM)

A series of blog posts on cybersecurity metrics and security automation

medium.com

Sometimes an outage can have a similar impact to a data breach, and availability is one of the key factors in the security CIA triad — confidentiality, integrity, and availability.

Another reason why the possibility of these human errors are so concerning, is the fact that attackers use phishing and social engineering to trick insiders — or even pay them to make changes.

In this case at Tesla, an insider got offered a substantial amount of money to try to insert malware into the Tesla network:

Tesla Insider Works with FBI to Turn the Tables on Russia's Million Dollar Attempt to Hijack the…

On August 25, the Department of Justice announced the arrest of Egor Igorevich Kriuchkov, a citizen of Russia, for…

news.clearancejobs.com

Kriuchkov asked the insider to insert into the organization’s computer network malware provided by Kriuchkov and his associates. Following insertion, a distributed denial of service attack would occur which was designed to occupy the information security team. Meanwhile the malware would be inserted into the network and corporate data would be extracted. The company would be asked to pay to get their data back. For his cooperation, the insider and Kriuchkov would both be well compensated.

How much would an insider need to be paid to get them to make a rogue BGP change? To spy on mobile network traffic at a telecom company? To steal a mobile SIM card? To insert a proxy to spy on traffic as it passes through routing devices on major Internet backbone devices?

Governance architecture in your own environment

Separation of duties applies not just to major cloud providers. Have you considered what changes within your own organization could cause a major outage and take down your systems? I’ve been considering some changes that could take over cloud environments in my latest series like this one:

Backdoors and Privilege Escalation Via Cloud Account Users

ACM.134 Preventing a user from leveraging another user with permissions they don’t have

medium.com

In subsequent posts, I proposed a solution and am now going through testing a potential solution to that problem. I simply paused to write this post as it is timely.

Have you thought through the ways in which a single insider could take out your organization through:

A mistake?
Compromised credentials?
A social engineering or phishing attack?

What kind of governance and security architecture have you put in place to prevent these egregious errors that cause significant damage? Do you have automated governance that prevents mistakes rather than simply trusting people to do the right thing? People make mistakes, can be tricked, or may simply be offered a lot of money to make an unwanted change. Your reliance on people may be misguided, as I wrote about in my book at the bottom of this post.

Architect proper governance — which is really what my blog series has turned into and partially where it will end up.

Architect for failure

In addition to segregation of duties and proper governance — have you architected systems that can withstand failure, where possible?

Even though AWS went down in the above example, it only went down in one region. Companies who are relying on cloud for their infrastructure and could not afford that outage should have built multi-region support into their cloud infrastructure. I was hosting a meetup in Seattle at the time (yes, I’m working on it — stay tuned in March) and Netflix came to speak. I asked them if they went down during the above AWS S3 outage.

No.

They had designed their infrastructure to withstand that outage. Someone else told me that customers could not start new movies but existing movie-watchers were unaffected.

Where possible, architect systems for failure. Think through what could go wrong and how systems can automatically recover or withstand the impact of an outage. Test your architecture and your recovery plan.

When you don’t control the infrastructure — ask the right questions

Where possible, customers need to architect and plan for cloud failures. But in the case of Microsoft Teams and O365, customers do not have control over that infrastructure or architecture since those are SAAS products. You are at Microsoft’s mercy. Have you performed a proper security assessment and asked them about their portion of the cloud security responsibility model as I wrote about here?

What are AWS’s Security Responsibilities, Anyway?

ACM.144 A deeper dive into the shared responsibility model

medium.com

I suppose I should add BGP to the list in that post.

Now back to trying to architect separation of duties into authentication and authorization in a cloud environment. I’m taking a look at Okta in future posts to see if and how that system may be able to help.

Follow for updates.

About Teri Radichel:
~~~~~~~~~~~~~~~~~~~~
⭐️ Author: Cybersecurity Books
⭐️ Presentations: Presentations by Teri Radichel
⭐️ Recognition: SANS Award, AWS Security Hero, IANS Faculty
⭐️ Certifications: SANS ~ GSE 240
⭐️ Education: BA Business, Master of Software Engineering, Master of Infosec
⭐️ Company: Penetration Tests, Assessments, Phone Consulting ~ 2nd Sight Lab

Need Help With Cybersecurity, Cloud, or Application Security?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
🔒 Request a penetration test or security assessment
🔒 Schedule a consulting call
🔒 Cybersecurity Speaker for Presentation

Follow for more stories like this:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
❤️ Sign Up my Medium Email List
❤️ Twitter: @teriradichel
❤️ LinkedIn: https://www.linkedin.com/in/teriradichel
❤️ Mastodon: @teriradichel@infosec.exchange
❤️ Facebook: 2nd Sight Lab
❤️ YouTube: @2ndsightlab