avatarTeri Radichel

Summarize

Troubleshooting VPC Endpoints

ACM.318 When you cannot access AWS Services or your response time slows down after deploying VPC endpoints and how to fix it

Part of my series on Automating Cybersecurity Metrics. Lambda. Network Security. GitHub Security. Container Security. Deploying a Static Website. The Code.

Free Content on Jobs in Cybersecurity | Sign up for the Email List

In the last post, I wrote about how to use Secrets Manager in a private network with a VPC endpoint.

I got it to work but the response was too slow, and when I ran another Secrets Manager command, it got even worse. It was taking something like 10 minutes after adding three AWS CLI commands. Clearly something was wrong — but it was still working. What?

In this post, I thought I was simply going to add GetSecretValue to my function but as it turns out I had a whole slew of other issues along the way — and I’m going to explain what they are, and how to fix them. For the slow networking, I think AWS could prevent that for customers, and I’ll explain how.

What happened was I added calls to a new service — STS — in my function. The AWS GetSecretValue method also requires KMS. I’m not sure if that is also a factor here, but in that case there are three AWS service calls, and for local testing, I’m also calling the method to invoke Lambda functions so four AWS service calls.

Then, I blocked all public access as explained in this post because there’s just a lot of junk hitting my EC2 instance and blocking it at the NACL improves performance and reduces risk.

Note also that using VPC Endpoints should reduce latency as there are less hops. In theory. But here’s what happened.

I set up a VPC Endpoint manually for Lambda and Secrets Manager in my VPC where I’m testing the Lambda function via an EC2 instance and the RIE.

As I’ve shown repeatedly, testing outside of Lambda helps you resolve networking issues more easily and that is even more apparent after this post.

I couldn’t get the call to list secrets to work from the local testing environment.

DNS problems

Here was problem number one. When I was trying to run the AWS SecretManger list-secrets command and it wasn’t working in my local test environment, I tried making a DNS request to the AWS service.

dig kms.us-east-2.amazonaws.com

That failed.

Make sure private DNS is enabled on the endpoint, VPC, and resources.

I reviewed all my DNS settings on the endpoint, VPC and EC2 instance as described in this post and I could find no issues.

If you do not set up your DNS correctly, your traffic may still be traversing the public Internet. I could not find any issues.

I can see that the endpoint has the correct DNS name:

Private DNS is enabled.

As instructed above, my VPC settings are correct.

DNS is enabled on my EC2 instance.

Validating the Security Groups

Then I added the code with the other service calls. And then my container took 10 minutes to run! At this point I’m getting a bit frustrated. I had checked my networking multiple times, including the security groups.

Make sure your security groups allow access to the endpoint and from the endpoint to AWS Services.

Recall that you need a VPC endpoint interface security group that allows:

  • Inbound access from your resource security group id on tcp 443 that needs to access the service
  • Outbound 443 tcp to the AWS service (I just used 0.0.0.0/0 and figure AWS will keep things restricted on their side since I don’t know the specific IP addresses for the service — but we could try to change that to a prefix list, if available. There aren’t many right now.)

That security group looked good.

Then I checked that my EC2 Security group had:

  • Outbound access to the Interface Endpoint Security Group ID.

That security group looked good.

What. Is. The. Problem?

VPC Endpoint Subnets

Now what?

I started comparing the VPC Endpoint that wasn’t working to the one I deployed with CloudFormation that clearly was working. Here’s the odd thing. Secrets Manager was working in the local VPC previously but now it was failing. It must have been just traversing the Internet Gateway and I did not realize it, because somehow, there were no subnets assigned to my VPC Endpoint.

Finally, I started comparing the VPC Endpoint that wasn’t working to the one I deployed with CloudFormation that clearly was working. Here’s the odd thing. Secrets Manager was working in the local VPC previously but now it was failing. It must have been just traversing the Internet Gateway and I did not realize it, because somehow, there were no subnets assigned to my VPC Endpoint.

Make sure your VPC Endpoint has subnets assigned.

This is where blocking all public access except what you explicitly need in your NACLs is helpful. You will find out if something is misconfigured!

Also, this is why getting a working, reusable configuration written in code and using it consistently will help prevent errors and security problems. May I suggest micro-templates? (A term I made up and the method I have been using throughout this series to create reusable building blocks of CloudFormation templates.)

I had manually configured the VPC Endpoint for Secrets Manager and thought for sure I had assigned subnets but when looking at the differences between my working and non-working environment, I realized that when looking at the VPC endpoint in the console the subnets were missing.

VPC Dashboard > Click Endpoints on left > Click on endpoint > Subnets

There were three subnets in my VPC so I selected them all.

So now I have three subnets assigned to my endpoint.

My DNS query now works and returns the expected results.

Problem solved, right? My call to Secrets Manager was working. But it was slooow. Hmm.

Policies — hold that thought

I had reviewed and manually adjusted all the policies in the local test environment. More on those policies in the next post I’m just focused on networking in this post as it’s enough ground to cover in one post. Just know that I spent a lot of wasted time looking in CloudTrail events and data events and fiddling with policies that were not the problem.

Make sure endpoint, IAM, and Resource policies are correct.
Check CloudTrail for access errors. (Remember to add the error column.)

Specific subnets assignments matter (apparently?)

At some point the whole problem was ticking me off. I went away for a while because I had spent way too long on this and it was blocking me from adding a couple of lines of code to my function. It felt like an insane amount of time for like three lines of code.

There should be some obvious logs or error messages telling you what the problem was. I added some feedback to the AWS VPC Resource Map to explain what I was trying to figure out and left.

I may not be the fastest but I probably will beat most people on persistence when it comes to solving a problem. I came back and decided to run tcpdump on my EC2 instance to see the traffic when I made the service calls. And then the problem was immediately obvious.

This is another reason you should want to run your Lambda functions in a local test environment that matches where you will deploy the Lambda function. It’s easier to validate things like this.

Here we see a bunch of requests (SYN packets) repeatedly getting no response.

Then the connections start to try to access another IP with no response…

And then it hits me.

I look up the subnet for the failed requests. It is in one of the three subnets I assigned to the endpoint, but not the subnet where the EC2 instance exists.

So basically the DNS request is returning three IPs, one for each subnet, and then the connection attempt loops through them to try to connect. Two out of the three IPs fail. If you are unlucky and all four function calls happen to access the two inaccessible IPs first, too bad for you. You are going to be waiting a very long time (depending on how your network is configured.)

So the obvious fix on the customer side is that the EC2 instance subnet needs to be the only subnet assigned to the VPC Endpoint interface.

Make sure your VPC Endpoint and Resources are in the same subnet

However, this doesn’t really make sense, because you have to pay for every VPC Endpoint interface you instantiate and AWS recommends creating a hub VPC where the interfaces exist and using peering connections to allow instances in separate VPCs and subnets to access the Interfaces. So is it slow when you do that or does it even work?

It hit me later. Ah yes, I would have tot give my EC2 isntance security group access to all three endpoints. But I was looking at the VPC Flow logs and I couldn’t find any rejected traffic. Maybe I missed it?

But my recommendation to AWS is that the DNS query should only return the IP addresses with endpoints accessible to the resource (my EC2 instance) that is trying to reach the interface. Yes, I know that’s not how DNS works but if you control whole setup you can return the appropriate values.

If an IP address is not accessible, don’t return it in the DNS request or have the endpoint connection somehow validate the correct IP to use if multiple are returned.

Also, make this visible and apparent in the logs and the VCP Resource Map so it is easier to troubleshoot. I think there should be another column in the list of ENIs for VPC Flow logs that indicates if that ENI is initiating traffic that is getting blocked, and then which resource is blocking it. More on that later.

In any case, once I removed all but the subnet my EC2 instance is in from the VPC Endpoint configuration, the extraneous failed requests went away and all was right again in the land of Lambda local testing.

Make sure your resources and VPC Endpoints are in the same subnet.

To summarize, here’s a graphic showing all the networking-related things you need to check if troubleshooting VPC Endpoints.

This took a LOT OF TIME to resolve and was not what I intended to be doing yesterday so I hope it helps someone.

Moving on to other problems…

Follow for updates.

Teri Radichel | © 2nd Sight Lab 2023

The best way to support this blog is to sign up for the email list and clap for stories you like. If you are interested in IANS Decision Support services so you can schedule security consulting calls with myself and other IANS faculty, please reach out on LinkedIn via the link below. Thank you!

About Teri Radichel:
~~~~~~~~~~~~~~~~~~~~
Author: Cybersecurity for Executives in the Age of Cloud
Presentations: Presentations by Teri Radichel
Recognition: SANS Difference Makers Award, AWS Security Hero, IANS Faculty
Certifications: SANS
Education: BA Business, Master of Software Engineering, Master of Infosec
Company: Cloud Penetration Tests, Assessments, Training ~ 2nd Sight Lab
Like this story? Use the options below to help me write more!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
❤️ Clap
❤️ Referrals
❤️ Medium: Teri Radichel
❤️ Email List: Teri Radichel
❤️ Twitter: @teriradichel
❤️ Mastodon: @[email protected]
❤️ Facebook: 2nd Sight Lab
❤️ YouTube: @2ndsightlab
❤️ Buy a Book: Teri Radichel on Amazon
❤️ Request a penetration test, assessment, or training
 via LinkedIn: Teri Radichel 
❤️ Schedule a consulting call with me through IANS Research

My Cybersecurity Book: Cybersecurity for Executives in the Age of Cloud

Vpc Endpoints
Network
Security
DNS
Slow
Recommended from ReadMedium