AWS EKS Best Practices

A checklist for Cloud Engineers to live by

In this guide, we’ll explore the best practices to focus on when working with Amazon Elastic Kubernetes Service (EKS) and how to optimize application workloads, harden security configurations, and simplify cluster operations while making the most out of AWS’s powerful cloud infrastructure.

#1 — Enhance Network Security

✅ Block SSH/RDP remote access to EKS cluster node groups

Disabling SSH/RDP remote access to your EKS cluster node groups largely prevents unauthorized access and potential breaches. It also lowers the risk of bad actors taking over your infrastructure and keeps your EKS cluster resources and sensitive data safe.

To achieve this using AWS CLI, when creating the EKS cluster node group, avoid using --remote-access option in create-nodegroup command.

# With --remote-access option:

aws eks create-nodegroup
  --region us-east-1 
  --cluster-name my-cluster
  --nodegroup-name my-nodegroup-1
  --instance-types m5.large
  --subnets subnet-xxxxxx subnet-yyyyy
  --remote-access ec2SshKey="my-ssh-key-1",sourceSecurityGroups="sg-XXXXX"
  --node-role arn:aws:iam::XYXYXYXY:role/my-eks-node-role

# After removing --remote-access option:

aws eks create-nodegroup
  --region us-east-1 
  --cluster-name my-cluster
  --nodegroup-name my-nodegroup-1
  --instance-types m5.large
  --subnets subnet-xxxxxx subnet-yyyyy
  --node-role arn:aws:iam::XYXYXYXY:role/my-eks-node-role

However, if you really need remote access, enable it on a case-by-case basis while taking extra precautions like using strong authentication, ensuring secure network connections through security groups, and regularly checking access logs for any suspicious activity.

✅ Block Public Access to EKS Cluster Endpoint

When launching a new EKS cluster, a public endpoint is automatically generated on the Kubernetes API server, so that the Kubernetes management tools (e.g. kubectl) can communicate with your EKS cluster. Since this API server endpoint is publicly accessible from the internet, this configuration exposes your EKS cluster to various malicious activities and attacks.

As a best practice, this public access to EKS cluster endpoints must be revoked by using endpointPublicAccess=false option with update-cluster-config command. However, you can still set endpointPrivateAccess=true in order to maintain private access to the EKS cluster (e.g. kubectl commands running from an EC2 bastion host within the VPC), especially for carrying out cluster management operations.

# Disable public access to EKS cluster and enable only private access 

aws eks update-cluster-config
  --region us-east-1
  --name my-cluster
  --resources-vpc-config 
      endpointPublicAccess=false,endpointPrivateAccess=true,publicAccessCidrs=["10.0.0.20/32"]

For advanced configurations, read more about Amazon EKS cluster endpoint access control.

✅ Restrict unnecessary ingress traffic using EKS Security Groups

Avoid opening all ports within EKS security groups, as it can expose vulnerabilities to attackers who may use port scanners and probing techniques to identify applications and services and launch malicious activities like brute-force attacks. In most instances, permitting inbound traffic solely on TCP port 443 (HTTPS) would be sufficient.

The describe-security-groups command can be used to check inbound/ingress rules associated with the security group and to revoke any unnecessary ingress rules, revoke-security-group-ingress command can be used as follows. If TCP port 443 (HTTPS) is not open, authorize-security-group-ingress command can be used to add the missing ingress rule to the security group.

# Check inbound/ingress rules

aws ec2 describe-security-groups
 --region us-east-1
 --group-ids sg-xxxxx
 --query 'SecurityGroups[*].IpPermissions'

# Revoke non-compliant ingress rules (e.g. revoke SSH traffic on TCP port 22)

aws ec2 revoke-security-group-ingress
 --region us-east-1
 --group-id sg-xxxxx
 --protocol tcp
 --port 22
 --cidr 0.0.0.0/0

# Allow incoming traffic on TCP port 443

aws ec2 authorize-security-group-ingress 
  --region us-east-1
  --group-id sg-xxxxx 
  --protocol tcp 
  --port 443 
  --cidr 10.10.1.0/24

✅ Harden IAM Role Policies of EKS Cluster Node Groups

An IAM role is assigned to every worker node in the EKS cluster node group in order to run kubelet and interact with various other APIs. This IAM role eliminates the need for individual credentials on each node and simplifies providing fine-grained permissions. Also, ensure that these IAM roles must only have the necessary permissions for the tasks they perform, following the principle of least privilege.

The following commands can be used to remove a non-compliant IAM role policy and attach a new one.

# Remove policy

aws iam delete-role-policy 
  --role-name my-node-group-role 
  --policy-name my-old-policy

# Attach policy

aws iam attach-role-policy 
  --role-name my-node-group-role 
  --policy-name my-new-policy

✅ Restrict Kubernetes RBAC

Limit permissions in not only IAM but also Kubernetes RBAC, reducing the attack surface and adhering to the “principle of least privilege” — especially, minimising permissions granted via the aws-auth ConfigMap and Kubernetes roles and clusterroles to decrease the risk of compromised credentials.

✅ Authenticate Kubernetes API calls by integrating with an OpenID Connect identity provider

OpenID Connect (OIDC) provides a secure and flexible way to authenticate and authorize users within applications and systems. OIDC providers can be used as an alternative to IAM and after configuring authentication to EKS cluster, you can create Kubernetes roles and clusterroles to assign permissions to the roles, and then bind the roles to the identities using Kubernetes rolebindings and clusterrolebindings. Note that you can only associate one OIDC identity provider to your cluster. For instructions, read more about authenticating users for your cluster from an OpenID Connect identity provider.

✅ Use EKS CNI policy (AWS-managed) to access networking resources

Attach the AmazonEKS_CNI_Policy AWS-managed policy for EKS cluster node groups to effectively manage networking resources. This policy allows the Kubernetes CNI (Container Network Interface) to perform essential tasks such as listing, describing, and modifying VPC ENIs (Elastic Network Interfaces) using the VPC CNI Plugin ( amazon-vpc-cni-k8s) on behalf of the cluster, ensuring proper networking functionality and communication within the EKS environment. For additional instructions, read more about configuring the Amazon VPC CNI plugin for Kubernetes.

# Attach policy

aws iam attach-role-policy
  --role-name AmazonEKSVPCCNIRole
  --policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy

✅ Use ECR read-only policy (AWS-managed) to access ECR repositories

Attach the AmazonEC2ContainerRegistryReadOnly AWS-managed policy for EKS cluster node groups to grant permissions to only read and retrieve container images from ECR repositories, without allowing any unnecessary operations on ECR.

# Attach policy

aws iam attach-role-policy
  --role-name AmazonEKSECRReadRole
  --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly

✅ Use EKS Cluster policy (AWS-managed) to manage AWS resources

Attach the AmazonEKSClusterPolicy AWS-managed policy for EKS cluster role to provide Kubernetes with the permissions it requires to manage resources on your behalf. It ensures secure access control and cluster operations, seamless integration with AWS services, and regular updates from AWS.

# Attach policy

aws iam attach-role-policy
  --role-name AWSEKSClusterRole
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSClusterPolicy

✅ Enable Envelope Encryption for EKS Kubernetes Secrets using KMS

By default, all Kubernetes secrets are stored on the Kubernetes backend database — etcd, in plain text. Anyone having access to the Kubernetes master will be able to see the secrets by looking it up in the backend. This is a huge vulnerability and to add an extra layer of security, implement envelope encryption (i.e. encrypt a key with another key) for these Kubernetes secrets using KMS keys. This will encrypt plaintext Kubernetes secrets with Data Encryption Key (DEK) and encrypt the DEK with kms:encrypt before storing in etcd. KMS can support Customer-managed keys (CMKs), AWS-managed keys, or AWS-owned keys for encryption — and in general, CMKs are the most recommended option.

On AWS EKS, KMS can tie in directly with the Kubernetes Secrets, using envelope encryption and storing the master key in KMS. Source: in4it.com

To implement this strategy, read more about using envelope encryption with AWS KMS keys and using EKS encryption provider support for defense-in-depth.

#2 — Enable logging & monitoring

✅ Setup EKS control plane logging

Ensure control plane logs are activated for all EKS clusters, which enables publishing API, audit, controller manager, scheduler, and authenticator logs to AWS CloudWatch Logs. With this setup, various log types, including API server logs, audit logs, authenticator logs (specific to AWS EKS), controller manager logs, and scheduler logs can be collected. Also, note that each of these log types corresponds to a crucial component within the Kubernetes control plane. For instructions, read more about enabling and disabling control plane logs.

# Enable AWS EKS control plane logging

aws eks update-cluster-config
 --region us-east-1
 --name my-cluster
 --logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'

Components of an EKS cluster that generate logs. Source: densify.com

✅ Setup EKS Audit Log Monitoring in GuardDuty

Auditing activities on EKS clusters for suspicious changes using a tool like GuardDuty is an important security measure. GuardDuty supports security monitoring features, including monitoring Kubernetes audit logs from EKS clusters and analysing them for potentially malicious and suspicious activity. It consumes Kubernetes audit log events directly from the Amazon EKS control plane logging feature and captures chronological activities from users, applications using the Kubernetes API, and the control plane.

# Enable EKS Audit Log Monitoring

aws guardduty update-detector 
  --detector-id xxxxxxxxxxx 
  --features '[{"Name" : "EKS_AUDIT_LOGS", "Status" : "ENABLED"}]'

Additionally, you can also consider external EKS monitoring tools like TrendMicro Cloud Conformity’s Real-Time Threat Monitoring and Analysis (RTMA) engine, which actively identifies Amazon EKS configuration adjustments within your AWS account and ensures timely audits and detection of changes at the AWS EKS service level.

✅ Setup CloudTrail logging for Kubernetes API calls

Ensure that CloudTrail logging is activated for all EKS clusters to capture and document all Kubernetes API calls. It will record all important cluster operations (e.g. CreateCluster, DeleteCluster) and generate detailed log entries for each event, including information about the IAM identities responsible for such actions and the credentials used. For exact steps and instructions, read more about Logging Amazon EKS API calls with AWS CloudTrail.

#3 — Maintain a healthy EKS cluster

✅ Enable readiness and liveness probes for all pods

Readiness probes determine if a pod is ready to serve traffic. When a pod is not ready, it’s removed from service, but it remains running. Readiness probes are crucial for avoiding sending traffic to pods that are still initializing or experiencing issues.

Liveness probes verify if a pod is alive and functioning correctly. If a liveness probe fails, Kubernetes restarts the pod. Liveness probes are essential for detecting and recovering from situations where a pod becomes unresponsive or enters a faulty state while running.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: web-container
        image: nginx:latest
        readinessProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 10
          periodSeconds: 15

Implementing these readiness and liveness probes in Kubernetes is crucial for maintaining application health and ensuring high availability. By defining these probes, Kubernetes can automatically check the responsiveness of pods and take corrective actions when necessary.

✅ Enable pod anti-affinity to ensure spreading pod replicas across multiple worker nodes

Deploying pod workloads with multiple replicas spread across multiple worker nodes is crucial for ensuring high availability and fault tolerance in Kubernetes clusters. By utilizing the Kubernetes Anti-Affinity feature, pods are automatically scheduled across different worker nodes, minimizing the risk of a single node failure affecting all application pods.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - web
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: web-container
        image: nginx:latest

In the above example, the podAntiAffinity field is used to specify that pods with the label app: web should be spread across different worker nodes (topologyKey: "kubernetes.io/hostname"). By deploying multiple replicas across multiple nodes, Kubernetes ensures resilience to node failures and enhances the overall availability and reliability of the application.

✅ Enable CPU & Memory resource requests and limits for pods

Applying appropriate resource requests and limits to every pod is vital for optimizing resource utilization and maintaining cluster stability in AWS EKS. Without proper allocation, resource waste can accumulate over time, leading to inefficiencies and performance bottlenecks. Utilizing Kubernetes’ Vertical Pod Autoscaling (VPA) can help automate this process, adjusting resource requests based on historical usage data. While VPA may require pod eviction for changes, upcoming Kubernetes updates aim to address this limitation. Complementing Kubernetes autoscaling with machine learning technology for fine-grained analysis of real-time capacity utilization ensures efficient resource management, enhancing the overall performance and scalability of your EKS clusters.

apiVersion: v1
kind: Pod
metadata:
  name: web-pod
spec:
  containers:
  - name: web-container
    image: nginx:latest
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 200m
        memory: 256Mi

✅ Deploy worker nodes across multiple Availability Zones

Configuring worker nodes to deploy across multiple Availability Zones is critical for enhancing the resilience and availability of AWS EKS clusters. By spreading worker nodes across zones, the impact of a single zone outage is mitigated, preventing complete cluster downtime. This is achieved by configuring AWS Auto Scaling Groups (ASGs) to span multiple Availability Zones.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: multi-asgs
  region: us-west-2

nodeGroups:
  - name: ng1
    instanceType: m5.xlarge
    availabilityZones:
      - us-west-2a
  - name: ng2
    instanceType: m5.xlarge
    availabilityZones:
      - us-west-2b
  - name: ng3
    instanceType: m5.xlarge
    availabilityZones:
      - us-west-2c

✅ Keep the Kubernetes version of the EKS cluster up-to-date

Ensure all EKS clusters run on the latest stable version of Kubernetes. This approach provides access to the latest features, design updates, bug fixes, enhanced security, and improved performance. Ideally, these version checks must happen regularly (e.g. quarterly — since Kubernetes releases new minor versions every ~3 months). For Kubernetes versions compatible with EKS, read more about Amazon EKS Kubernetes versions.

# Check cluster version

aws eks describe-cluster
 --region us-east-1
 --name my-cluster
 --query 'cluster.version'

# Update cluster version

aws eks update-cluster-version
 --region us-east-1
 --name my-cluster
 --kubernetes-version 1.24

Failing to update Kubernetes versions on time can lead to higher extended support costs as well. For instance, to provide extended support to older Kubernetes versions, starting April 1, 2024, you will be charged a total of $0.60 per cluster per hour, not the usual $0.10 (400$+ per month). This is an unnecessary cost and regularly updating the Kubernetes versions on schedule is the way to go.

For extended support for older Kubernetes versions on EKS, the pricing would be almost 6x times. Source: aws.amazon.com/blogs

✅ Match the CoreDNS add-on version with the EKS cluster’s Kubernetes version

When launching a new EKS cluster, for high availability purposes, 2 CoreDNS replicas are deployed by default (regardless of node count). Since these CoreDNS pods serve as the cluster DNS which provides name resolution for all pods in the cluster, its version has to be always up-to-date and compatible with the Kubernetes version of the cluster.

The CoreDNS version can be checked and updated to suitable values using describe-addon and update-addon commands.

# Check CoreDNS add-on version

aws eks describe-addon --cluster-name my-cluster --addon-name coredns

# Update CoreDNS add-on version

aws eks update-addon 
  --region us-east-1 
  --cluster-name my-cluster 
  --addon-name coredns 
  --addon-version v1.11.1-eksbuild.6 
  --resolve-conflicts PRESERVE

To find compatible version pairs, read more about working with the CoreDNS Amazon EKS add-on.

Conclusion

By following these guidelines, you ensure your EKS environment is secure, highly available, and optimized for performance. Over the past years, the AWS team has been super innovative and has released a plethora of new features on the EKS ecosystem. Embrace these practices to unlock the full potential of AWS EKS and drive success in your cloud-native journey.

AWS EKS Best Practices

A checklist for Cloud Engineers to live by

#1 — Enhance Network Security

✅ Block SSH/RDP remote access to EKS cluster node groups

✅ Block Public Access to EKS Cluster Endpoint

✅ Restrict unnecessary ingress traffic using EKS Security Groups

✅ Harden IAM Role Policies of EKS Cluster Node Groups

✅ Restrict Kubernetes RBAC

✅ Authenticate Kubernetes API calls by integrating with an OpenID Connect identity provider

✅ Use EKS CNI policy (AWS-managed) to access networking resources

✅ Use ECR read-only policy (AWS-managed) to access ECR repositories

✅ Use EKS Cluster policy (AWS-managed) to manage AWS resources

✅ Enable Envelope Encryption for EKS Kubernetes Secrets using KMS

#2 — Enable logging & monitoring

✅ Setup EKS control plane logging

✅ Setup EKS Audit Log Monitoring in GuardDuty

✅ Setup CloudTrail logging for Kubernetes API calls

#3 — Maintain a healthy EKS cluster

✅ Enable readiness and liveness probes for all pods

✅ Enable pod anti-affinity to ensure spreading pod replicas across multiple worker nodes

✅ Enable CPU & Memory resource requests and limits for pods

✅ Deploy worker nodes across multiple Availability Zones

✅ Keep the Kubernetes version of the EKS cluster up-to-date

✅ Match the CoreDNS add-on version with the EKS cluster’s Kubernetes version

Conclusion

If you enjoyed this article, you might also like reading these:

Saving Big on AWS: Best Practices for Cost Optimization and Efficiency

A checklist for Cloud Solution Architects to live by

AWS Security Best Practices

A checklist for Cloud Admins to live by

AWS Lambda Performance Best Practices

A checklist for Cloud Engineers to live by

26 Terraform Hacks for Effective Infrastructure Automation (With Examples)

A checklist for Cloud Engineers to live by