Incremental Service Control Policy Rollouts to Prevent Production Outages

ACM.203 Testing and rolling out new Service Control Policies in a safe and controlled manner

Part of my series on Automating Cybersecurity Metrics. Cloud Governance. AWS Organizations. The Code.

Free Content on Jobs in Cybersecurity | Sign up for the Email List

In the last post I deployed a working organizational CloudTrail using CloudFormation.

Create an Organization CloudTrail with CloudFormation

ACM. 202 Automating deployment of a trail to monitor events across all AWS accounts

medium.com

I can do more to fix that but I’m tired of CloudTrail and KMS keys at the moment. I may revisit it later. At least we have logs across our organization.

Sandbox environment

In this post I’m creating a “Sandbox” OU and account for testing things that fall outside of my governance rules. I’m also going to add a backup account because one of the things I mentioned I’m trying to do is move over resources from an old to a new account so I can use my new account structure and shut down the old account. I used my existing OU and account code from prior posts.

Why a Sandbox?

A company recently had a new Service Control Policy they implemented take down a production environment. I’m sure they tested that before deploying it to the entire organization, but somehow something was missed. It is complicated to roll out changes to an organization because there may be many different types of applications running in different environments.

Ideally you want to have a mirror of your production environment in an alternate account so you can test your deployments. I wrote about that here:

Defining AWS Accounts and Organizational Units

ACM.180 Defining accounts and organizational units based on by trust boundaries and roles to protect critical assets

medium.com

In addition, you want to test your SCPs incrementally before your roll them out. Ideally AWS would have an “test mode” for SCPs where it would only alert, not block, but that’s not now AWS Policies work at the moment. So we’ll need to use an alternate strategy to roll out policy changes.

Incremental Rollouts

The first step to deploy a new SCP change would be to test it in some environment that does not affect your organization as a whole and that might be in a Sandbox OU and account. I often talk about this in classes — you need a place where people can test new things prior to bringing them into your Organization as a whole. This is where you might start testing a new Service Control Policy you’re working on. You apply it only to the SandBox OU initially or even a single account within that OU where you’re testing it out.

Once you have your changes working in your Sandbox OU and accounts, you push it to your dev environment. In a large organization, you might roll it out to a small portion of your dev environment which a team who is aware of the changes and can help you resolve any issues. Maybe you only apply the OU to a single account at a time. You have the person ultimately responsible for that environment confirm there are no issues. Then ultimately, you roll out the changes to the entire dev environment.

Once things are working in dev you move the changes over to QA. As I mentioned in my prior post, the QA team may have different things running than the dev environment — such as different testing tools not used in the dev environment. You’ll want to resolve any issues there, and possibly start with a small portion of QA, then move to all of QA.

Next you move on to staging. Ideally you have a staging environment that mirrors production. You can refresh that environment to match production if something is amiss. Restoring huge databases may or may not always be realistic on a frequent basis, but compute resources should be easily refreshed from source. This environment can be used for testing new SCPs before pushing them to production.

But you say: “The IP addresses are different! Configuration in production is different!”

You should have deployment scripts with parameterized IP addresses and ranges so each environment uses the correct IP ranges, secrets, or anything else that differs between environments without changing the deployment code. If you have to change your code (as opposed to your config files or secrets that are only key-value pairs) then your deployment is not really secure or testable. The code cannot be integrity-checked and what has been tested may not actually be what was deployed.

If things break because you have to manually change something and a person fat-fingered the deployment, then change the deployment to pull those values from a secret or parameter store instead of altering them each time they are deployed.

Of course there is a risk that the first time they are deployed something might break. Someone has to type the first value in a key-value pair somewhere for certain applications parameters. But that’s not your executable code or policy code. To resolve that issue in a back office financial environment I created a test script (credit to my coworker, Tom, that gave me the idea) to verify the values in a deployment before actually deploying the code.

For example, the code could check that IP addresses and domain names were accessible and that a trial run of the application that didn’t actually hit the database ran without errors. Get creative and figure out how to prevent the errors your see in deployments pre-deployment. But make sure all policy and executable code is parameterized in such a way the code itself does not change as it moves between environments.

Also be careful to test any conditions thoroughly that change what gets deployed in different environments — conditional code should not change what is deployed in staging and production. Only configuration values should change at that point.

Conditions and Mappings in CloudFormation Templates

ACM.32 Preventing the Confused Deputy Attack in Batch Job Roles

medium.com

After a successful roll out in staging, try production. Roll out incrementally for different regions, customers, or a small subset of applications. Then finally, after verifying nothing breaks, roll out to all portions of your production environment.

Once you’ve tested your policy across all OUs, you can move it from the development, QA, staging, and production environments up to the OU above them. If you want to deploy the change across OUs above that you would continue with the same type of incremental roll out for other environments beyond those OUs.

Isolated environment

I’m going to create a Sandbox OU for testing. The Sandbox OU should be completely isolated from your production environment in your AWS account so the possibility does not exist that some malware in that account could traverse your networks or get into other accounts via IAM policies that are misconfigured. You’ll want to think that through and consider potential attack paths. If someone obtained a session or credentials used in your Sandbox environment, is there any way they can use those credentials to access your production environment?

This particular Sandbox is created at the root level because I need to move over some resources from another account. I also want to be able to sue the root administrator account. This is my “organization-Sandbox” for testing things that might affect the organization as a whole or trying out things that are not allowed by governance SCPs. You probably only want your most-experienced, security-conscious team members deploying here.

You might also want to create a Sandbox under your governance OU called a “developer-Sandbox” where you have a relaxed SCP for developer sandbox testing but any existing rules enforced across the organization at the governance OU level will still apply. It depends what you are trying to test and how those governance rules affect what you are trying to do.

What cannot be incrementally tested

If you have something that truly cannot be incrementally tested, then you may need to set up a separate AWS Organizations account for testing those changes. I have used this in the past for testing changes to AWS Control Tower, for example. However, with my new setup, I don’t think I will need separate accounts. I’m still testing everything so I may change my mind later as I test out new ideas.

Follow for updates.

About Teri Radichel:
~~~~~~~~~~~~~~~~~~~~
⭐️ Author: Cybersecurity Books
⭐️ Presentations: Presentations by Teri Radichel
⭐️ Recognition: SANS Award, AWS Security Hero, IANS Faculty
⭐️ Certifications: SANS ~ GSE 240
⭐️ Education: BA Business, Master of Software Engineering, Master of Infosec
⭐️ Company: Penetration Tests, Assessments, Phone Consulting ~ 2nd Sight Lab

Need Help With Cybersecurity, Cloud, or Application Security?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
🔒 Request a penetration test or security assessment
🔒 Schedule a consulting call
🔒 Cybersecurity Speaker for Presentation

Follow for more stories like this:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
❤️ Sign Up my Medium Email List
❤️ Twitter: @teriradichel
❤️ LinkedIn: https://www.linkedin.com/in/teriradichel
❤️ Mastodon: @teriradichel@infosec.exchange
❤️ Facebook: 2nd Sight Lab
❤️ YouTube: @2ndsightlab

Summarize

Incremental Service Control Policy Rollouts to Prevent Production Outages

ACM.203 Testing and rolling out new Service Control Policies in a safe and controlled manner

Create an Organization CloudTrail with CloudFormation

ACM. 202 Automating deployment of a trail to monitor events across all AWS accounts

Sandbox environment

Why a Sandbox?

Defining AWS Accounts and Organizational Units

ACM.180 Defining accounts and organizational units based on by trust boundaries and roles to protect critical assets

Incremental Rollouts

Conditions and Mappings in CloudFormation Templates

ACM.32 Preventing the Confused Deputy Attack in Batch Job Roles

Isolated environment

What cannot be incrementally tested