A Bash Error Handler For a Bash Custom Lambda Runtime

ACM.319 Handling multiple errors and different types of errors in different functions

Part of my series on Automating Cybersecurity Metrics. Lambda. Network Security. GitHub Security. Container Security. Deploying a Static Website. The Code.

Free Content on Jobs in Cybersecurity | Sign up for the Email List

In the last post I figured out why my requests to AWS Services after deploying VPC Endpoint deployment were excessively slow and fixed it.

Troubleshooting VPC Endpoints

ACM.318 When you cannot access AWS Services or your response time slows down after deploying VPC endpoints and how to…

medium.com

Before I complete my AWS Secrets Manager solution, I need to fix my error handling because it only works for one line of code. Oops. And let’s just say this was not simple. But if it works, it will save me a ton of time in the future and can easily be baked into every container that uses my Bash runtime. Caveat — I’m sure I will find and fix other errors in the future, but it seems to be working and here’s the short version of the problems I faced along the way.

TLDR;
* Set some Bash flags to make sure your error messages show up where intended
* AWS CLI seems to be redirecting standard error to standard out
* For other Bash commands you would need to do that yourself if you want that to occur
* That means you'd need to know if you were handling an AWS CLI command or a Bash command when handling errors.
* Try to abstract your error handler out to a single function call as I did here.
* A single function call to a standard function reduces errors and missed error logging.
* If you pass variables between Bash files, pass them via temp files. (Lamba allows you to write to the /tmp directory.)
* Something was setting certain variables (like one containing my request ID) to an empty string for some reason - not in my code.
* Handling strings with multiple quotes is very, very tricky.
* Make sure your error messages do not have quotes in them.
* A divide by zero error is a good thing to test.
* The difference between an initialization and an invocation error is the request id.
* If you try to post an initialization error when you have a request id you will get a STATE TRANSITION ERROR message which tells you pretty much nothing. (Hope AWS will improve that.)
* If you have any errors in your error handling, the message won't get back to the Lambda function.
* Local testing is good.
* sts get-caller-identity is not working in Lambda
* Using temporary credentials for testing is not perfect, but reduces risk.
* Never assume the input values are "safe" or correct. Test and validate everything.
* The local environment uses HTTP. I presume you can use HTTPS on Lambda. Need to test further.

I wrote about using a trap and trying to catch errors here but we need to be a bit more specific.

Adding Error Handling to Bash Custom Lambda Runtime

ACM.306 Using Bash trap to capture and handle errors in bash scripts

medium.com

The problem is that the way I wrote that code, the function exits after one line of code and the full function never completes. I got this working but honestly I’m not exactly sure what is going on with all the error handling but here are some things I noticed.

Divide by zero error

When I tried to capture the error message for a divide by zero error, I couldn’t seem to capture the error message using this format:

error=$($(( 1/0 )) 2>&1)

I was able to get the error message working with my final solution below.

AWS CLI output and errors

The AWS CLI functions handle outputs and errors in unexpected ways.

In one case, I ran a command and I could not get both the error message and the function output by using this format:

response=$(aws secretsmanager list-secrets 2>&1)

If I used the above I couldn’t get the results of a successful response. What I realized was that if I redirected standard out to standard error I would get both results — but in my error handling function.

response=$(aws secretsmanager list-secrets 1>&2)

I got error and success outputs with that solution. The problem with that is that every response ended up in my error handling function which terminated the program.

What I figured out is that behind the scenes, AWS must be already pushing standard error to standard out. So all you need to get both is as follows:

response=$(aws secretsmanager list-secrets)

The problem is that to catch an error, I have to catch the EXIT system signal as noted in my prior post on this topic. Both success and failure responses end up in the error response function.

I could try to exclude exit response zero but that doesn’t exactly work. Now, somehow I got this working below, but I’m not actually sure how I resolved exit from hitting my error function. but what I did in the end seems to be working. I will be testing more, however as time goes on to make sure.

Exit Codes

In my last post I was checking exit codes and I was confused, thinking that the AWS CLI was returning a SIGHUP exit code. Actually, AWS has its own set of exit codes. The system signals and exit codes are two separate things.

You can find the AWS CLI return codes here:

AWS CLI Return Codes - AWS CLI 2.13.21 Command Reference

These are the following return codes returned at the end of execution of a CLI command: - Limited to commands, at least…

awscli.amazonaws.com

The above, when all responses got sent to my error handling function, is why I had to check for 0 and ignore it. But I think I have a solution that solves this. Again, pending additional testing. The above list is helpful though, to understand what type of error we are getting back from the AWS CLI.

System Signal Codes

I found a possibly better list of system signals and what they mean here:

It also tells us what command we can run to get the specific signals on our system. YAY.

man 7 signal

Eventually I find what I’m looking for 😊

So we can use that to define custom error codes and what they mean and return that in our error messages to be nicer to people using our code.

Thoughtful Error Handling

Your error handler is one of your most important security defenses

medium.com

First I created a list of the return codes and values, but I named them SYS_ to not conflict with anything that might be used by the OS.

I created a function to call to report back the system error message. If there’s some way to do this from the system, I don’t know but this will work. I can also make the error messages more specific if needed.

I only added the above logic for certain error codes I initially plan to trap. I left the rest in the comments at the bottom of the file in case I need to add them later.

Next I inserted a syntax command as an error from bash, not the AWS CLI. As it turns out an bash syntax error, at least from AWS Linux, also returns 2. So 2 is not just for an “Interrupt from keyboard.”

For this reason, I added syntax error to error code 2 as shown.

I could test other error messages as needed. But what was happening at this point is that if there was no error message, I was using the signal description. In fact, I figured out a way to capture the divide by 0 error message in my code at the bottom of the post.

Attempting to get the response from a function in another file

The response for the divide by zero error was initially not set because I sourced a file that simply executes some lines of code without setting the response used in the error handling function. I tried various methods to capture the error message from the command which did not work.

Finally I resorted to creating a temp file and using tee to send both standard error and standard out to the temp file. Then I display that temp file on exit. That was the only way I could get the response from a divide by zero error.

Creating a generic error handling function

As noted, the AWS errors already seem to send back both standard error and standard out. When I tried to apply tee directly to AWS commands I had issues. Perhaps I was doing something wrong, but I also didn’t want to add error handling for every single line in my function.

I created a file called errors.sh which is now included in my docker file and include that in my function file.

I create a temp file at the top using the mktemp command associated with the variable name TMP.

Then I add the definitions of the signal commands I am trapping, and the method to get the description for each signal.

The next two functions set and unset traps.

set_trap: sets a trap on the signals we want to capture and send to the error handler.

unset_trap: unsets the trap once we are in the error handler. What I figured out was that other signals initiated while in the error handling function would lead to state transition errors (ERROR State > ERROR state is not allowed, I gather.)

Next I updated my send_errors function as follows.

Stop trapping exit codes

First, I unset the trap.

This prevents the invalid state transition errors I mentioned above because we stop capturing system signals that end up calling the send_error function again that we are currently executing.

Ignore exit code 0

If the exit code is 0 return 0.

Now what I need to test here is that this returns and executes any code that needs to execute after this point, but I think it does. I’ve just spent a lot of time on this so will save more testing for future posts.

Read the error message from the temp file into a variable

I’m going to show you how I sent the temp file to the error output in a minute. Presuming the error message exists in the temp file, I capture it in the msg variable and delete the temp file.

I also realize at this point, that the sample code from AWS is not cleaning up the header temp files it’s creating when the function exits (unless that is happening behind the scenes in the RIE. To be safe I’ll delete that when the function exits as well, since these files can hang around between functions.

If the error message in the temp file is bank, use the signal message

Here’s where I use the signal error message function I created above. If there was a problem generating a proper error message, send the system code message.

Echo the error message in case of problems sending the error to Lambda

I echo out the error message, because if the credentials are set but invalid, the end of the function that attempts to send error messages to Lambda will cause an error and the error message we are processing doesn’t make it to the Lambda output terminal window. It usually exists in the RIE terminal window, depending on how it was trapped. But in this case, it always gets output — I think. Still testing.

Call the appropriate API to log the error with the Lambda Service

As noted before, the difference between the initialization error and the processing error seems to be whether or not a request ID exists.

We can check to see if the request ID exists and set the error type accordingly.

Then we can use the error type and the information above to formulate the error message to send to Lambda and the API call. No more need to track the state in the outer function.

Triggering the error handler

We set some flags to make sure the script fails the way we want.

-e exit immediately on error
-o pipefail ensures that the exit status of the last command in a pipe is used as the overall exit status of the pipe.
-u treat unset variables as an error when substituting

set man page

Manual Page for set

linuxcommand.org

Then I source my errors.sh file and call the set_trap function to start capturing system signals.

Now initially I started by inserting my commands to redirect and capture the errors in a temp file in the functions/handler.sh file but that did not work very well. I would have to set the error handling redirect on every command. Also, it doesn’t work consistently.

By setting the errors in the higher level RIE code I can just capture errors in one place. What I did was move all the code to execute the functions/handler.sh file and log the response back to Lambda in its own function.

Then I just have one function to and I can capture the response of that one function and handle any errors that come out of it like this:

process_function 2>&1 | tee $TMP

One thing I also did was move my code to set up credentials into a credentials.sh file. In order to reference values set in that function, I had to source (include) it in my functions/handler.sh file.

What I realized next was that the REQUEST_ID set in the main file was not accessible in the errors.sh file when the error was triggered by a line in credentials.sh.

After much trial and tribulation, like my error message that I could not retrieve in all cases, I ended up putting my request ID in a temp file also. Though the values of other variables in my error handler passed from the root file to the error handling file just fine, the request ID kept getting set to an empty string.

I scoured all my code. I don’t think I am doing that. I should probably put all my variables in a temp file or explicitly pass them into each function.

The only thing I can think of is that the variable is statically set and used when the file is sourced. In the example below the request ID is an empty string when I source the file.

I change the variable later. Now I’m setting it in a temp file to get around the issue:

Note that although a request ID should not have a quote in it ever, I remove quotes just to be safe. I’m using double quotes everywhere and inserting double quotes into a value might result in errors or worse — injection attacks. I should do a better job of encoding the errors but for now getting to a working state.

Then I can grab the request ID later like this:

What happened is that I was setting the URL where we send the error to init or invocation based on whether or not I have a request ID. I explained why in the error handling prior post. I kept getting an init URL every time.

That also led me to realize that if you try to send an init error when the function is already initialized and has a request ID, you will get the INVALID STATE ERROR. That error message is very confusing. Why can’t it just say:

“You are trying to send an init error when the Lambda function is already initialized.”

But just remember that you cannot send an init error if you have a request ID. I had altered my code to check whether the request ID was present or not to set the value for the API like this:

Until I got a proper REQUESTID value I was obviously going to keep getting the error type init.

Then I had all kinds of strange, wonky, weird errors that were very difficult to debug as they seemed to make the whole system return wacky results. Bash is not the greatest when it comes to quotes and double quotes. In order validate that each and every value was correctly set before I attempted to send the API request, I ended up printing them out:

If there is a problem with any single value, I’ll get it on the line that caused the error or see that it is incorrect above.

By the way I noticed all the URLs were sending HTTP instead of HTTPS requests. If I leave it like that, my requests will be sent in plain text within AWS. For local testing, I do not have an HTTPS certificate set up. Be aware of that risk. In any case, I set the scheme based on whether or not I am actually inside the Lambda environment as shown above now:

I also set a DEBUG variable and use that in a file where I set credentials. This file will be different by the time I check it in due the above cross-file variable revelations.

What I’m doing here is checking to see if AWS_ACCESS_KEY_ID is set or not. If it is not, I can hardcode some values that I pulled out of a Lambda function assuming the proper role. If they are set, then I can print out the credentials so I can copy and paste them here easily for local testing.

I can source this file in a function that needs credentials. I could choose only to source the file in DEBUG mode.

The problem is that my DEBUG mode is set in an outer file and I want to make sure that it is always that same value. There are no real guarantees in BASH. It’s not the best programming language to be using but it’s fast for my proof of concept for all this stuff.

I write about how I’m obtaining credentials in the next post — and some risks associated with my method. Do not do this in production.

When I finally got everything set correctly I entered errors at various points in the code and tested. One was a divide by zero error in the file above at the top.

I run the local test environment.

I open the second terminal window as explained in my prior posts and test the function locally.

I get the expected error on that screen:

Over in my RIE screen I can see that all the values are set as expected.

Now to test in Lambda. Push the container up to ECR. Redeploy. Test.

Yay, nice error message:

I also get the rest of the output at the bottom of the screen:

The one thing I notice above is that the scheme is still http instead of https. I’ll mess around with that more later. I presume Lambda supports https here.

This whole post was triggered by simply wanting to run two AWS CLI commands and get proper errors returned. As I mentioned at the start, the prior error handler was stopping after the first command.

Let’s remove the error and retest.

Initially my local test says the token has expired. That’s good. We don’t want long-lived tokens hanging around int test files.

I can grab some valid credentials out of my Lambda invocation above if I push the new container up to Lambda and run it.

The first time I run the container it is very slow. I don’t know if this is because I’m running code in a container or what. I tested this same container previously and it took like 10 minutes on the first attempt. The second time I ran it, it took three seconds. I haven’t changed anything except the error handling.

What is odd is that now I’m getting an error message that says the Lambda function can’t reach the STS service after 5 minutes.

This is strange since I haven’t changed any of that code that accesses STS, retrieves secrets, or my networking. I was only working on my error handler. The only thing I can think of is that Amazon is blocking the get caller identity call in Lambda. I don’t really need it in Lambda only locally, so I can skip that if I’m not using the hard coded credentials in my code. I make that change and redeploy the function.

I get back all my secrets from list secrets and the specific secret I’m after with GetSecretValue.

Next I can snag the credentials and test locally.

When I test locally, everything works, including my STS call.

The function:

Phew! That took awhile. All that because I wanted to call two lines of AWS CLI code. Stay tuned as I work through cloning a GitHub repo to AWS CodeCommit using GitHub commands.

Follow for updates.

About Teri Radichel:
~~~~~~~~~~~~~~~~~~~~
⭐️ Author: Cybersecurity Books
⭐️ Presentations: Presentations by Teri Radichel
⭐️ Recognition: SANS Award, AWS Security Hero, IANS Faculty
⭐️ Certifications: SANS ~ GSE 240
⭐️ Education: BA Business, Master of Software Engineering, Master of Infosec
⭐️ Company: Penetration Tests, Assessments, Phone Consulting ~ 2nd Sight Lab

Need Help With Cybersecurity, Cloud, or Application Security?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
🔒 Request a penetration test or security assessment
🔒 Schedule a consulting call
🔒 Cybersecurity Speaker for Presentation

Follow for more stories like this:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
❤️ Sign Up my Medium Email List
❤️ Twitter: @teriradichel
❤️ LinkedIn: https://www.linkedin.com/in/teriradichel
❤️ Mastodon: @teriradichel@infosec.exchange
❤️ Facebook: 2nd Sight Lab
❤️ YouTube: @2ndsightlab