Control Plane for Distributed Kubernetes PaaS
Authors: Ankur Singla, Harshad Nakil
This blog is first in a series of blogs that cover various aspects of what it took for us to build and operate our SaaS service:
- Control plane for distributed Kubernetes PaaS
- Global service mesh for distributed applications
- Platform security of distributed infrastructure, apps, and data
- Application and network security of distributed clusters
- Observability across a globally distributed platform
- Operations and SRE of a globally distributed platform
- Golang service framework for distributed microservices
As we described in our earlier blog, our customers are building complex and diverse sets of business solutions — like smart manufacturing, video forensics for public safety, algorithmic trading, telco 5G networks — and thus we need to deliver an always-on, connected, and reliable experience for these applications and their end-users.
Since these applications could be running in multiple clusters across cloud providers or customers’ edge locations, our platform team had to build a distributed control plane and a PaaS service to deploy, secure, and operate multiple multi-tenant Kubernetes clusters. This distributed control plane has delivered many operational, scaling, and performance benefits that we will cover in our presentation (video link) — e.g. how to manage thousands of edge K8s clusters with GitOps — and also as a separate blog post in the coming weeks.
TL;DR (Summary)
- We could not find a simple to use solution in the market that could solve the problem of deploying, securing, and operating multiple application clusters that are distributed across cloud providers, private clouds or multiple edge locations.
- We could not find a robust Kubernetes distribution or PaaS (eg. OpenShift, Cloud Foundry, etc) that provided a comprehensive set of security and operational services needed for distributed clusters — for example PKI-based identity, RBAC and user-access management, secrets and key management across cloud providers, multi-cluster service mesh, observability and audit logs, or application and network security.
- Anthos (Google), Azure Arc (Microsoft), and Rancher are multi-cluster management stations and packaging of multiple different services; our analysis was that these would not have solved the operational, scaling, security, and multi-tenancy requirements that we had for application and infrastructure services across multiple clusters.
- We had to build our own distributed control plane for our managed PaaS that is built on top of Kubernetes. We started with vanilla Kubernetes and then made significant changes to deliver platform services needed by our DevOps and SRE teams. In addition, we had to build a control plane to manage large numbers of distributed clusters and deliver multi-tenancy across heterogeneous infrastructure (in edge, our network, and multiple cloud providers).
Kubernetes for App Management: Why & How
We chose Kubernetes (K8s) to be the core of our platform for managing distributed applications as it provides a rich set of functionality without being overly prescriptive — giving us flexibility on innovating on things that we believe matter to our customers. We used this as a foundation on which to start building our service and with the growing popularity of K8s, it is also easier to find developers and operators who are familiar with it.
That said, deploying and managing a large number of production-grade Kubernetes clusters across a hybrid environment (multiple clouds, network POPs, and edge locations) is not very easy as there are no out-of-the-box solutions for Kubernetes that can:
- Harmonize heterogeneous infrastructure resources with automated clustering, scaling, and zero-touch provisioning; this was especially painful at the edge and in our network PoPs
- Provide high-performance and reliable connectivity across disparate locations — especially when crossing cloud providers and coming from edge locations
- Solve the security problem of data-in-transit, data-at-rest, secrets, keys and network…all backed by a uniform PKI identity that works across edge, network and cloud
- Provide true multi-tenancy — tenant isolation and security guarantees — with the ability to run production and development workloads for internal and customer needs on the same clusters
- Provide observability and operations across distributed clusters that ties into centralized policy and intent, without the need for building complex logs and metrics collection
After several proofs-of-concept with multiple cloud providers and open-source platforms like GKE, AKS, EKS, and RKE as well as OpenShift and Cloud Foundry — we realized that none of them could meet all of the five requirements above. As a result, we decided to build our own PaaS — starting with “vanilla” Kubernetes and made several additions — for identity, networking, security, multi-tenancy, logging, metrics, etc. While we use Kubernetes to meet our internal needs, we had to make some hard decisions like not exposing these Kubernetes clusters directly to our internal users and/or customers to run their workloads (more on that later, as multi-tenancy was a key objective for us).
In addition to multiple new features that we needed to add, there was also a need to run our workloads/services alongside customer workloads in many locations across the edge, our network, and public/private clouds. This meant that we had to build additional capabilities to manage multiple clusters in multiple environments…all connected using our global network and our distributed application gateways to provide zero-trust and application-level connectivity across these clusters.
The Hard Part: Multi-Tenancy and Multi-Cluster for Kubernetes
Building and operating applications running in a single Kubernetes cluster is a non-trivial task, even if consuming a cloud provider-managed cluster. This is why it is common for DevOps and SRE teams to minimize their overhead and not deal with the complexities of many clusters. It is quite common to see teams build one large Kubernetes cluster and put all types of resources within the same cluster. While this seems great because they can simplify operations and run the cluster for maximal compute efficiency and cost, this is not the best idea for several reasons. First, the needs for production workloads are very different from dev-test and from staging — unstable development workloads can potentially cause problems for more stable production workloads.
In addition to the needs of varied workloads, K8s security and isolation limitations is another driver for multi-cluster. A typical approach to solve for K8s security and resource isolation is to spin-up independent clusters for each tenant using a multi-tenant model. While this may be feasible to do in the cloud, it is not possible at the edge to run multiple clusters. Edge sites have compute and storage resource limitations and constrained network bandwidth to send logs and metrics for each additional cluster to the central cloud.
To deal with the problem of multiple Kubernetes clusters, we evaluated Rancher for centralized management of our Kubernetes clusters (when we started, Anthos and Azure Arc did not exist) and KubeFed. The two approaches available at that time were (and still the same situation today):
- Multi-cluster management (eg. Rancher) from a central console would have given us the ability to deploy multiple clusters in any location and perform lifecycle management operations like upgrades, rollback, etc. Some of these systems also gave the ability to address an individual cluster with automation for configuration and deployment of applications
- Another approach is to deploy a Kubernetes cluster federation (KubeFed) control plane and it can make multiple physical clusters look like one cluster. This project was just getting started at the time we looked and even today is only in alpha stage.
After the recent announcement of GCP Anthos and Azure Arc, we re-evaluated our original decision to build a distributed control plane and the conclusion was that even these two new offerings could not have solved two critical problems with distributed clusters. These two key capabilities that we needed for our platform were:
- Managing multiple clusters as-a-fleet to solve the problem of performing operations across all or a logical group of clusters — operations like configuration, deployment, metrics, etc. This is critical as we want to reduce operations overhead for our SRE teams, improve debug-ability for our DevOps, and improve the scalability of our system
- Ability to carve up an individual physical Kubernetes cluster for multi-tenancy without needing to spin up physical clusters — this is especially critical in resource-constrained environments where we don’t want to add new physical clusters just for multi-tenancy
To solve these two problems, we had to come up with a new technique — distributed control plane — to solve the operational overhead of “multiple” clusters and provide an equivalent of “multiple clusters” for multi-tenancy in resource-constrained environments.
Distributed Control Plane: How We Achieved Multi-Cluster Kubernetes
Our platform team decided to build a distributed control plane for Kubernetes that exposes Kubernetes APIs for our team’s use, however, these APIs are coming from “virtual” clusters that only exist in our control plane — a virtual K8s (vK8s) API server for a virtual K8s cluster (as shown in Figure 1). This control plane maps the intent of the user to multiple physical Kubernetes clusters running in our edge, our network POPs, and public cloud locations. These physical clusters are only accessible to our distributed control plane, and not to any individual tenant/user.

This control plane provides each tenant with one or more “virtual” application clusters where they can deploy their application(s) and based on configuration, the control plane will replicate and manage it across multiple physical Kubernetes clusters. In addition to configuration and deployment operations, monitoring operations also follows this “virtual” cluster without the need to build tooling to collect and dissect data from multiple physical clusters.
Let’s take a sample UI application called productpage, where user intent is to run it distributed across 3 locations — pa2-par, ny8-nyc and ams9-ams with 2 replicas in each of them. As the user creates a vK8s object and attaches it to a virtual cluster, which immediately provisions a vK8s API server that can be used with standard kubectl.
As the next step, the user downloads the kubeconfig for this virtual cluster and creates standard yaml to describe a K8s deployment for productpage.

