State of Managed Kubernetes 2021

EKS vs. AKS vs. GKE from a Developer’s Perspective (2021 Edition)

Kubernetes turns seven on June 7th, with the stable release now at v1.21. As Kubernetes began to rule the container orchestration world, use of containers in production also quickly became the norm, with more than 23% of CNCF survey respondents running more than 5,000 containers in their organization, up 109% from 2016. However, managing Kubernetes remains a difficult task, which is growing the demand for managed Kubernetes offerings from the major cloud vendors.

Last year, I published my second managed Kubernetes service comparison piece, detailing the differences between Amazon’s EKS, Microsoft’s AKS, and Google’s GKE from a developer’s point of view. A lot has changed since then both in terms of the offerings by the vendors and my personal experience using these services. So here’s the 2021 edition of the state of managed Kubernetes.

“The challenge of scaling Kubernetes, the complexity of managing the control plane, the API layer, the database — that isn’t for the faint of heart.”

— Deepak Singh, VP of Compute Services at AWS

Amazon EKS

According to Flexera’s 2021 State of Cloud report, AWS still leads the container orchestration market share with 51% of the respondents using Amazon EKS and ECS compared to 43% for AKS and 31% for GKE.

I was unable to find the exact breakdown between ECS vs. EKS users, but according to Datadog’s latest Container Report, EKS usage is the lowest compared to the dominance GKE and AKS shown in their respective cloud platforms:

Although these survey results may be skewed based on the sample size, anecdotally, it matches up with my experiences with EKS. AWS groups ECS, ECR, Fargate, and EKS under a single “Containers” offering, and tends to lag behind AKS and GKE in terms of Kubernetes-specific features such as support for latest Kubernetes version, beta features (e.g. vertical pod autoscaler, custom kubelet arguments), and managed options (e.g. node auto-repair, automated upgrades, dashboard). For example, until the EKS team finally started supporting Kubernetes taints for managed node groups in May, I stayed away from using managed nodes to run specialized pods (e.g. databases, long-living operations) on tainted nodes. Other developers/engineers shared similar frustrations with EKS lagging behind other vendors that pushed them to a different cloud vendor or resorting to self-managed clusters as seen in the Github thread below:

[EKS] [request]: Managed Node Groups support for node taints · Issue #864 · aws/containers-roadmap

Community Note Please vote on this issue by adding a 👍 reaction to the original issue to help the community and…

github.com

I don’t have insight into the inter-workings of the AWS product teams, but I may attribute this slower velocity on a few factors:

AWS’s focus is split across its various container offerings. There has been a lot of work on AWS Fargate (e.g. support for EFS, encrypted stroage, integration with Container Insights) and AWS ECS (e.g. support for batch jobs, fluent bit). If Amazon were to focus primarily on EKS only, velocity may pick up.
Similarly, AWS must support various EKS interfaces (e.g. eksctl, CloudFormation, aws cli) when pushing out new features as opposed to having a single interface (e.g. Azure CLI, gcloud cli) and letting other third-party (e.g. Terraform) implement the changes later.
AWS has always preferred a more hands-off approach, giving the flexibility and responsibility to the customer. For example, EKS does not ship with ALB ingress controller, logging agents that integrate with CloudWatch, or node autoscalers by default. Compared to GKE or AKS where those essential Kubernetes operational components are embedded as part of the managed service, EKS relies on the customer to modify the bare offering to have their cluster “production-ready.”

In general, my biggest disappointment with EKS has been how little managed node groups offload some of the operational burden. It does automate the provisioning and lifecycle management of nodes, but it is not automated in the same sense as GKE, where it can repair nodes or upgrade versions based on release channels. If you need to run your own AMI (e.g. to install SSM Agent or Amazon Inspector as per EKS security best practices), upgrading node groups has always been tricky where creating a new node group and draining the old node group has been easier than updating in-place, which I expected a managed node group to handle for me in the background.

But most of these inconveniences are mostly a one-time setup pain, that can be solved with a shared Terraform module or cluster setup script. Given AWS’s continued dominance in the public cloud market, EKS lagging or lacking features here and there may not be enough to warrants organizations to switch to a different cloud provider, especially when they have other legacy applications or relies on a managed database or a data pipeline.

The more interesting developments on EKS in my mind are EKS Anywhere and open-sourced eks-distro to capture the growing hybrid-cloud market. With growing usage of containers, the untapped potential lies in converting on-prem or hybrid-cloud customers to use managed EKS in those environments.

Stephen O’Grady, the co-founder of the analyst firm RedMonk, also sees an industry-wide shift towards managed offerings:

“When enterprises consider something strategic, the initial inclination is to run it themselves. Then they realize over time as they acclimate that not only is it not giving them any competitive advantage, it is more likely than not the vendors can run it better than they can. Is every enterprise going down this route? Not yet, but the appetite and direction of travel seems clear.”

If Amazon is able to capture the growing market of large enterprises opening up to the idea of public cloud and also running Kubernetes on their data centers in conjunction with some workloads on AWS, its container service offering could help further cement its lead in the public cloud market.

Azure AKS

AKS continues to be the quickest to support newer versions of Kubernetes (including container runtimes) and also pushes customers to upgrade the fastest (only supports last 3 minor versions). The Azure team also significantly improved the developer experience by supporting auto-upgrades for nodes and planned maintenance windows (in preview) to close the gap on GKE’s lead. While financially-backed SLA is no longer a competitive edge as other vendors have followed suit, it is still the only cloud vendor to provide a free managed control plane service (vs. $0.10/hr for EKS and GKE). The only missing non-Kubernetes add-on feature I can think of is the lack of a hardened OS image optimized for running containers. Google provides Container Optimized OS, Amazon provides Bottlerocket, but AKS currently only supports Ubuntu or Windows Server.

Since Azure is not the primary cloud vendor that my company uses, I don’t have too much experience with AKS at scale, but the general notion is that AKS has quickly caught up with GKE and have been investing heavily in this space. The most interesting part of AKS in my mind is how the developer experience will grow as a result of Microsoft’s acquisitions over the years.

First, Helm (acquired in 2017) continues to dominate as the go-to method for packaging Kubernetes applications. Then we have VS Code, a popular text-editor developed by Microsoft, that supports a feature called “Bridge to Kubernetes” to allow users to run and debug code as if it was running inside Kubernetes with local tunneling. While this feature works with any Kubernetes distribution, it has native integrations with AKS, Azure CLI, and Helm.

Combined with the increasing trend of using Kubernetes for local development, the potential for Microsoft to leverage Github to provide a more seamless development and CI/CD experience with AKS could be where the developer ecosystem evolves next. While there are some application management platforms built on top of Kubernetes like Shipa, most organizations struggle to provide a unified development experience that can go from local development (e.g. minikube, kind) to production environments with little overhead. If I can simply check in code after testing it with Bridge to Kubernetes without having to configure Kubernetes-specific CI/CD tool like Argo, Tekton, etc, then AKS may seriously become a deciding factor between Azure and AWS in the future.

Google GKE

As the clear leader in terms of developer experience and number of features supported (managed Istio, Knative, vertical pod autoscaling, managed nodes, DNS cache), Google doubled down on its managed approach by announcing GKE Autopilot. GKE Autopilot is a happy medium between Cloud Run (serverless container offering) and GKE standard (infrastructure as a service). Autopilot reduces the underlying node management operations (e.g. autoscaling, resource optimization, node upgrades) and provides a compute platform for developers to deploy containerized applications. It also implements best security practices like Shielded GKE Nodes and Workload Identity, as well as limiting container privileges. Best of all, GKE Autopilot only charges for resources used (e.g. vCPU/s), similar to a serverless service instead of paying for nodes provisioned.

I used GKE Autopilot on a few personal projects, so I can’t comment on operations at scale, but here were a few things I noticed:

As with any serverless offering, there is a non-negligible startup cost (i.e. if the application you are deploying requires provisioning new nodes, it takes some time before the container is ready). If the application is bursty in terms of resource usage but cannot scale down (e.g. stateful blockchain nodes), GKE Autopilot may not be the best choice as you may need to pre-provision nodes in advance.
Some of the Helm charts didn’t work out of the gate due to security restrictions imposed by GKE. This is one of the strengths of Autopilot (i.e. secure by default), but may be an additional burden for teams looking for a simple switch if the Helm charts require more privileged access (e.g. running privileged Gitlab runners for Docker-in-Docker scenario for e2e tests).

I still recommend GKE (either GKE Autopilot or GKE standard) to anyone looking to start a Kubernetes-based project and are not tied to other cloud vendors. It still provides the best managed experience with operational components included by default with a polished management console and great logging/monitoring integration.

But as seen in Flexera’s State of the Cloud Report mentioned in the EKS section, GKE still lags behind AWS and Azure in terms of adoption and usage, probably in proportion to the general cloud market share. Even though I prefer to use GKE, if other parts of the organization is using AWS or Azure, it’s hard to convince those stakeholders to switch to a new cloud based on Kubernetes offering alone.

However, the one silver lining for GKE might be the growth of multi-cloud and the shift towards Kubernetes being the base compute layer. Google Cloud has invested heavily into Anthos (Google’s managed application platform running on top of Kubernetes) to dive into the multi-cloud market. If Google continues to execute and reduce the operational burden like it has with GKE Autopilot, running Anthos on AWS or Azure (or more realistically for hybrid or multi-cloud environments) could become a possibility for smaller companies without dedicated teams for Kubernetes.

Personally, I think Microsoft’s investment into improving developer experience seems like a more promising sell, but if Google can combine Anthos with its strengths in AI and ML, it could capture an interesting segment in the market as I have written previously:

Why BigQuery Omni is a Big Deal

Google Cloud’s bet on an open platform is starting to materialize with Anthos and BigQuery Omni.

medium.com

Overall, while GKE holds the lead in 2021, there is a lot more feature parity and established communities in EKS and AKS.