Automate AWS Infrastructure with Boto 3: AWS Health Checks

Part 2 — How to write a Python script to automate AWS health checks

When I first joined a DevOps/SRE team, I realized there were a lot of simple AWS infrastructure changes that took up a large chunk of our engineering team’s time. I didn’t want to spend my valuable coding time on these manual, yet essential, tasks so I set out on a mission to automate them. Since I had wanted to build my Python scripting skills anyway, I discovered a way to solve two problems at once — using the software development kit Boto 3 to automate my simple, manual AWS tasks using Python.

For the second installment in this series I wanted to cover using Boto 3 and Python to automate AWS health checks for instances and their services and events. If you came from Part 1, then you already know how to import Boto 3 and create a client to use in your script. While we created an EC2 client in the last script, we want to create one for ECS here. Using ECS is going to give us the information we need about services and events, something that EC2 does not provide.

ecs_client = boto3.client('ecs', 'us-east-1')

You can then use this client to perform any of the methods listed in the ECS service section of the Boto 3 documentation. The documentation is super helpful in listing all of the resources you can use, their various methods, and how exactly to configure these methods. Feel free to explore!

Writing the Script

Picture this scenario. You have lots of ECS clusters in your environments. Each cluster has multiple services, each with their own events, tasks, and target groups. If a cluster is unhealthy, it will begin to drain. However, it can be hard to tell which clusters are draining. In the past, I’ve found the only way to see if the services are decreasing or not using is by constantly refreshing the AWS CLI. To avoid doing this, you can print out detailed event messages, which can be much more helpful in troubleshooting.

Let’s figure out how we can write a script that tells us which clusters are healthy, which are not, and what is going wrong.

clusters = ecs_client.list_clusters()['clusterArns']

You probably noticed that the clusters this gives us start with arn:aws:ecs followed by the region and a string of integers. Because we only want the name of the cluster, not this pattern the query returns, we use the split function to get the name after the /.

There are different ways to do this, but this is what I find easiest.

for cluster in clusters:
    clusterName = cluster.split('/')[1]
    print(clusterName)

Look at the documentation. What method would we have to use in order to find the services for a cluster? If you look at the documentation, there is really only one method that would make sense — list_services. We can see that this method needs the name of the cluster in order to find the service, which is where we will use the clusterName variable we just found.

for cluster in clusters:
    clusterName = cluster.split('/')[1]
    services = ecs_client.list_services(cluster=clusterName)['serviceArns']
    print(services)

Good! We found the services using the list_services method and found the list of names by filtering that method even further. Now, what method can we use to help us find the events of each service?

Describe_services seems to be the only one that lists out events for each service. Typically, when scripting with Boto 3, if you use a list method, you will then want to use a describe method.

for cluster in clusters:
    clusterName = cluster.split('/')[1]
    services = ecs_client.list_services(cluster=clusterName)['serviceArns']
    service_description = ecs_client.describe_services(cluster=cluster, services=services)
    print(service_description)

Good! We found the service description by using describe_services and specifying the cluster and services that we found earlier.

Oops! Looks like some of our services are empty. This means we need to add a statement so that if this is the case, the script still runs without any errors.

for cluster in clusters:
    clusterName = cluster.split('/')[1]
    services = ecs_client.list_services(cluster=clusterName)['serviceArns']
    if services:
        service_description =     ecs_client.describe_services(cluster=cluster, services=services)
        print(service_description)

Problem fixed!

if services:
    service_names = ecs_client.describe_services(cluster=cluster, services=services)
    for service in service_names:
        print(service['desiredCount'])
        print(service['runningCount'])

Now we want to use the service names we found to find important metrics like the desired count and the running count. We must loop through each service in the list of service names in order to find these metrics for each one.

This might be hard to read still… How can we make this so it’s clear to our user what is the desired count vs the running count?

for service in service_names:
    desired_count = service['desiredCount']
    running_count = service['runningCount']
    if desired_count != running_count:
        print(f'SERVICES RUNNING: {running_count}/{desired_count}')

That’s better! This way we are only displaying the service counts for instances that don’t have the correct amount of services running. We are also printing how many are running out of how many are SUPPOSED TO BE running.

Now we want to look at event messages. They give us helpful information to tell us the current state of the services and tasks in our environment. Event messages are something that may otherwise be hard to find and read. You’d have to go through each event one by one to see its state.

We want to easily be able to see if a service in our environment has failed, and what exactly the message is.

Earlier, we saw that the describe_service methods lists information about events. Now it’s time to save that information to an events variable! Let’s look at the json structure of the response and see how we can filter it so we only get the events.

This is going to find all the events and event information for every service.

for event in events:
    print(event['message'])

We only really care about the event message, so let’s find that and print that for the user.

If we look at all of these event messages we see that every healthy service has a message of “has reached a steady state.” Since we only care about the services that are failing and DO NOT have a steady state, let’s filter these messages out.

for event in events:
    message = event['message']
    healthy = re.search('has reached a steady state', message)
    if healthy is None:
        print(message)

I am using regex to search for a certain string in the event messages. If you use this, make sure you import it at the top of your script.

import re

If the regex returns None then that means ‘has reached a steady state’ is probably in the message which, in that case, we don’t need to see. We also want to make it easier to read event messages that indicate something may have gone wrong.

for event in events:
    message = event['message']
    healthy = re.search('has reached a steady state', message)
    if healthy is None:
        print('###FAILED###')
        print('------------------')
        print(message)
        print('------------------')
        print('###FAILED###')
        print('                  ')
        print('                  ')

There we go! That is a lot easier to read.

Now we can easily see what may be going wrong in our environments without searching through every cluster for service and event information.

Check out the next installment in our series where we’ll cover automating AWS Snapshots.

Originally published at https://www.capitalone.com.

DISCLOSURE STATEMENT: © 2020 Capital One. Opinions are those of the individual author. Unless noted otherwise in this post, Capital One is not affiliated with, nor endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are property of their respective owners.