avatarTeri Radichel

Summary

The provided content discusses the implications of human error in IT, as exemplified by the recent 5-hour Microsoft Outage caused by a router configuration change, and emphasizes the importance of proper governance, controls, and architecture to prevent and withstand such outages in critical infrastructure, particularly in cloud services.

Abstract

The article reflects on the significant impact of human error in critical infrastructure, taking the recent Microsoft Outage as a case study. It highlights that a single configuration error can have widespread repercussions, affecting many users and services. The author stresses the need for better governance and automated processes to prevent such errors, suggesting that companies like Microsoft should implement stricter controls and checks, such as requiring two people for critical changes. The piece also touches on the role of cloud providers as critical infrastructure and the potential for outages to disrupt society, drawing parallels to cybersecurity attacks. It further explores the history of cloud outages, the importance of learning from past incidents, and the necessity of designing systems that can withstand failures, with Netflix's approach to infrastructure resilience as an example. The author advocates for a shift from trust-based systems to automated governance that can prevent mistakes, and for a thorough understanding of the shared responsibility model in cloud security.

Opinions

  • Human error is a significant risk to critical infrastructure, as evidenced by the Microsoft Outage being caused by a single person's router configuration change.
  • Companies should architect proper governance and controls around BGP (Border Gateway Protocol) to prevent errors, including implementing separation of duties and requiring multiple individuals for critical changes.
  • Cloud providers are considered critical infrastructure, and their outages can have a debilitating impact on society and national security.
  • The author suggests that organizations should not only rely on people doing the right thing but also implement automated governance to prevent mistakes, which could be caused by human error, compromised credentials, or social engineering attacks.
  • There is a need for companies to consider how a single insider could cause a major outage and to put in place security architecture to mitigate such risks.
  • The article advocates for architecting systems for failure, using Netflix's approach during the AWS S3 outage as a positive example of infrastructure resilience.
  • Customers of SAAS products like Microsoft Teams and O365 are at the mercy of the provider's architecture and should perform thorough security assessments.
  • The shared responsibility model in cloud security is highlighted, with the author suggesting that BGP should be included in the list of security responsibilities for cloud providers.
  • The author plans to explore the integration of Okta for authentication and authorization in cloud environments in future posts, indicating a focus on practical solutions to enhance security.

About the 5-hour Microsoft Outage

What self-sabotaging action could take your company down for 5 hours?

This is related to but not exactly part of my series on Automating Cybersecurity Metrics. The Code.

Free Content on Jobs in Cybersecurity | Sign up for the Email List

I’m taking a break today to write about the recent Microsoft outage because it is so important for all of us to learn from these incidents and figure out how we can prevent and withstand future outages.

I was waiting for Microsoft to publish a report on the matter publicly but ended up obtaining the information from other sources. I’m considering the impact of this incident and what might have prevented it in this blog post. Also, what can you do about outages like this that affect your business?

Initially, I wondered if the attack was related to the West’s recent agreement to send tanks to Ukraine — something they had been haggling over for months. But no, it was a self-imposed disaster, apparently. That’s all I can tell at the moment, unless Microsoft reports some other root cause below the surface of what appears to be the reason for the incident. I can think of a few which are listed below, but so far it seems to be a case of human error.

What caused the Microsoft Outage?

The technical source of this outage is that someone changed a router configuration. A router is a device that controls where packets get sent on the Internet. But not just one router — the person pushed a command that ended up going to a whole bunch of routers and that changed caused them all to send traffic to the wrong place. That resulted in a 5-hour Microsoft outage that impacted many people who use Microsoft products who were trying to get work done.

Technically, the outage was caused by a change to BGP, a very error-prone protocol that helps automatically push new routes out to network devices. The problem is that when the wrong routes are pushed, they can spread very quickly as well.

I first saw this report by ThousandEyes who deduced the problem by analyzing packet loss, disruptions and BGP routes:

Organizations cannot put many technical guardrails around these changes within BGP itself. However, companies can architect proper governance and controls around BGP.

How do the engineers that make the changes to BGP systems get access to the systems? How many people are involved in the change? Have automated processes and testing been architected around these changes to prevent errors? Are processes simply a matter of trust — or are they automated with proper checks and balances? Are the people making the changes fully qualified? Are employees rotated on a regular basis to prevent collusion?

I am not sure of the legality of this point, so consult a lawyer. But do companies test employees to determine if they are willing to make changes when offered money by third parties when they manage highly critical infrastructure? Also, are employees trained on how they might be approached and a person may “make friends” or become romantically involved with them to obtain information or in some cases, offer to pay them to make a change? See the link to the Tesla article below.

Are Clouds Considered Critical Infrastructure?

They should be.

I don’t know about you, but when I can’t see my calendar or access communication systems, my day is pretty shot. That’s why I have multiple means of accessing those systems. But if the system itself is down, I’m a bit out of luck. I might not be able to log into certain systems if I cannot access my email or text messages, for example.

Think about how much of America runs on the major cloud providers these days? What sort of impact would that have on US operations (and other countries as well)?

What is critical infrastructure in the US?

CISA defines critical infrastructure as follows:

Critical infrastructure describes the physical and cyber systems and assets that are so vital to the United States that their incapacity or destruction would have a debilitating impact on our physical or economic security or public health or safety. The Nation’s critical infrastructure provides the essential services that underpin American society.

Cloud providers control the access to much of our nation’s data. This data runs many systems that help the US function on a daily basis. That’s why it is so, so important for global technology companies who support these systems not to make a mistake.

We have enough problems with attackers and cybersecurity that we don’t need more self-imposed disasters. And if we take a look through some of the largest outages in cloud provider history — that’s exactly what is causing some of the biggest disruptions.

In addition, should a mistake happen, how quickly can a cloud provider recover? Do they plan for and test rollbacks or corrections to misconfigurations?

Cloud Outage History

AWS has had a couple of S3 bucket outages, among other things, that took down large parts of the Internet because so many systems are reliant on S3. I think the first major incident that made people understand how reliant we are on cloud providers occurred in 2017 with AWS S3:

In the above outage, the error was caused by a typo. A human error. One single person took out a major portion of the Internet. Now mind you, I’m not blaming that person, I’m saying you should take a look at the processes that could allow such an incident to happen in your organization.

Google also has had some outages, with one of the largest in 2020.

Google says that the global authentication system outage which affected most consumer-facing series on Monday was caused by a bug in the automated quota management system impacting the Google User ID Service.

Once again, this massive cloud outage was a self-imposed error, not a malicious cyber attack. In this case it was not a human configuration mistake, but somewhere along the way improper testing — by a human — allowed a bug to slip through that caused this problem. I wrote about proper testing a number of times on this blog.

CloudFlare had an incident in June 2022 caused by —a BGP change:

Microsoft Outages and Incidents

Having monitored cloud outages for a while and being a user of all the major cloud providers, I seem to recall Microsoft having the biggest number of problems with outages, cybersecurity incidents, and downtime. I haven’t sat down to record and count the incidents and data breaches, though I’ve thought about it.

I wrote about my own experience not even being able to create a VM for days here:

Here’s an example of customer data exposed through a misconfiguration of a storage bucket:

Here’s another one caused by a server misconfiguration:

The Firewall Times has compiled a list of breaches, some of which are actual cyberattacks:

The number of visible incidents could relate to different approaches to governance at different cloud providers. See more on governance below, but first, let’s look at some other recent incidents and their root causes.

Taking a look at the root cause of incidents in general

You can view a few recent outage reports here, and I’m very thankful that Microsoft is transparent about this information. I’m simply looking for solutions to problems in this post by finding the root cause. What is causing the most cloud outages and security incidents? In a lot of cases, it is human error. We can take a look at the incidents that affected Microsoft Azure in January:

1/31/2023 — Lack of alerting and/or response to resource exhaustion

1/25/2023 — Proper steps not followed during a change.

1/23/2023 — Hardware fault (not a human error)

1/18/2023 — BGP change

Two of the above (50%) are human errors.

I would argue that resource exhaustion should be monitored and addressed before that problem occurs, but I don’t have all the details. I’m sure it’s not simple but it can be done.

As for the hardware failure, these things happen, but it seems like it should have been possible to remove the failing device from rotation even if it’s management interface was not available. Perhaps Microsoft needs to consider how those systems are managed and perform some additional automation and testing.

Running a cloud platform is not easy, and especially one of the scale of the major cloud providers. (Running systems in one rack in a data center was not easy for me and I didn’t enjoy it.) But if these cloud providers are so critical to so many systems we need to evaluate mistakes after they happen and find ways to prevent them in the future.

By the way you can also view the incident history for other cloud providers as well to see what types of incidents they are facing and how they are addressing them.

AWS incident history:

Google incident history:

Cloud Governance at Cloud Providers

The AWS outage above resulted from a human typo that took down the S3 service. Since that time, AWS has learned from this mistake and has taken steps to limit human error. Though I’ve still heard of a few incidents related to human error at AWS, it seems like there are far fewer of these at AWS than Azure.

Perhaps the reason is because AWS has implemented better governance, as I wrote about in these posts:

At some point after that incident I head Stephen Schmidt, CISO at AWS, say in a presentation that it takes two people to make a critical change now. Microsoft should do the same.

I’ve been writing a lot about separation of duties in my latest blog series and why it matters. Separation of duties not only helps prevent data breaches, it helps prevent human error that can lead to outages.

Sometimes an outage can have a similar impact to a data breach, and availability is one of the key factors in the security CIA triad — confidentiality, integrity, and availability.

Another reason why the possibility of these human errors are so concerning, is the fact that attackers use phishing and social engineering to trick insiders — or even pay them to make changes.

In this case at Tesla, an insider got offered a substantial amount of money to try to insert malware into the Tesla network:

Kriuchkov asked the insider to insert into the organization’s computer network malware provided by Kriuchkov and his associates. Following insertion, a distributed denial of service attack would occur which was designed to occupy the information security team. Meanwhile the malware would be inserted into the network and corporate data would be extracted. The company would be asked to pay to get their data back. For his cooperation, the insider and Kriuchkov would both be well compensated.

How much would an insider need to be paid to get them to make a rogue BGP change? To spy on mobile network traffic at a telecom company? To steal a mobile SIM card? To insert a proxy to spy on traffic as it passes through routing devices on major Internet backbone devices?

Governance architecture in your own environment

Separation of duties applies not just to major cloud providers. Have you considered what changes within your own organization could cause a major outage and take down your systems? I’ve been considering some changes that could take over cloud environments in my latest series like this one:

In subsequent posts, I proposed a solution and am now going through testing a potential solution to that problem. I simply paused to write this post as it is timely.

Have you thought through the ways in which a single insider could take out your organization through:

  • A mistake?
  • Compromised credentials?
  • A social engineering or phishing attack?

What kind of governance and security architecture have you put in place to prevent these egregious errors that cause significant damage? Do you have automated governance that prevents mistakes rather than simply trusting people to do the right thing? People make mistakes, can be tricked, or may simply be offered a lot of money to make an unwanted change. Your reliance on people may be misguided, as I wrote about in my book at the bottom of this post.

Architect proper governance — which is really what my blog series has turned into and partially where it will end up.

Architect for failure

In addition to segregation of duties and proper governance — have you architected systems that can withstand failure, where possible?

Even though AWS went down in the above example, it only went down in one region. Companies who are relying on cloud for their infrastructure and could not afford that outage should have built multi-region support into their cloud infrastructure. I was hosting a meetup in Seattle at the time (yes, I’m working on it — stay tuned in March) and Netflix came to speak. I asked them if they went down during the above AWS S3 outage.

No.

They had designed their infrastructure to withstand that outage. Someone else told me that customers could not start new movies but existing movie-watchers were unaffected.

Where possible, architect systems for failure. Think through what could go wrong and how systems can automatically recover or withstand the impact of an outage. Test your architecture and your recovery plan.

When you don’t control the infrastructure — ask the right questions

Where possible, customers need to architect and plan for cloud failures. But in the case of Microsoft Teams and O365, customers do not have control over that infrastructure or architecture since those are SAAS products. You are at Microsoft’s mercy. Have you performed a proper security assessment and asked them about their portion of the cloud security responsibility model as I wrote about here?

I suppose I should add BGP to the list in that post.

Now back to trying to architect separation of duties into authentication and authorization in a cloud environment. I’m taking a look at Okta in future posts to see if and how that system may be able to help.

Follow for updates.

Teri Radichel | © 2nd Sight Lab 2023

About Teri Radichel:
~~~~~~~~~~~~~~~~~~~~
⭐️ Author: Cybersecurity Books
⭐️ Presentations: Presentations by Teri Radichel
⭐️ Recognition: SANS Award, AWS Security Hero, IANS Faculty
⭐️ Certifications: SANS ~ GSE 240
⭐️ Education: BA Business, Master of Software Engineering, Master of Infosec
⭐️ Company: Penetration Tests, Assessments, Phone Consulting ~ 2nd Sight Lab
Need Help With Cybersecurity, Cloud, or Application Security?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
🔒 Request a penetration test or security assessment
🔒 Schedule a consulting call
🔒 Cybersecurity Speaker for Presentation
Follow for more stories like this:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
❤️ Sign Up my Medium Email List
❤️ Twitter: @teriradichel
❤️ LinkedIn: https://www.linkedin.com/in/teriradichel
❤️ Mastodon: @teriradichel@infosec.exchange
❤️ Facebook: 2nd Sight Lab
❤️ YouTube: @2ndsightlab
Microsoft Outage
Bgp
Governance
Insider
Incident
Recommended from ReadMedium