The article provides a comprehensive guide to creating a highly available Kubernetes cluster with minimal vendor dependencies, focusing on flexibility and control for small to medium-sized clusters.
Abstract
The article "Create Highly Available Kubernetes cluster with Minimal Dependencies" outlines the process of setting up a Kubernetes cluster that is both highly available and flexible enough to move between cloud providers. It emphasizes the use of basic resources and excludes vendor-specific features to maintain minimal dependencies. The author shares their experiences in building such a cluster, detailing the requirements for both single-master and highly available clusters, including the necessary software components like networking, persistent volumes, monitoring, logging, Ingress, and backup solutions. The guide covers the selection of a cloud provider, domain name management, DNS provisioning, and the specific configurations for components like HAProxy, Keepalived, and kubeadm. It also delves into the deployment of storage solutions with rook-ceph, monitoring with Prometheus Operator, logging with Elasticsearch and Kibana, and using Traefik as an Ingress controller. Finally, the article discusses the importance of external storage and backup strategies using NFS, Minio, and Velero to ensure the cluster's data is secure and recoverable.
Opinions
The author prefers a vendor-independent approach to maintain flexibility and control over the cluster.
Terraform is the recommended tool for host provisioning due to its ability to manage cloud resources consistently.
Calico is chosen as the pod network add-on, but the author notes the necessity to adjust its default settings to avoid conflicts with the cloud provider's private IP range.
The author finds the official kubeadm documentation somewhat lacking in detail and suggests additional resources for configuring an HA cluster.
Rook-ceph is favored for persistent storage due to its container-native design and lack of vendor lock-in.
Kube-prometheus is the monitoring solution of choice, with customization recommended for better integration into the cluster.
Fluent-bit is preferred over Fluentd for logging due to its lighter resource footprint.
Traefik is recommended as an Ingress controller for its simplicity and performance, with a preference for running it in cluster mode for HA setups.
Velero is highlighted as a crucial tool for cluster backups, with Minio as the S3-compatible storage solution for storing backups externally.
The author values the ability to expose services securely through Ingress, advocating for basic auth protection for Prometheus, Alertmanager, and Grafana web UIs.
The article suggests that while the provided guide is comprehensive, it is essential to tailor the cluster setup to the specific needs and environments of the services it will host.
Create Highly Available Kubernetes cluster with Minimal Dependencies
Components to build a close-to-production grade cluster
Introduction
Kubernetes is a popular topic nowadays. There are numerous articles on Medium and everywhere. The big name cloud providers have great Kubernetes support that you can create a usable cluster as quick as in minutes. There is also kubernetes-the-hard-way to build everything from scratch.
What if you are like me, using a (relatively) small cloud provider, in need of the flexibility to move vendor at any time and detailed control of the things running on your cluster?
In this article I describe my experiences to create Kubernetes cluster on self-managed cloud. I intend to use only the very basic resources, e.g. excluding any vendor-specific resources, to create a close-to-production grade cluster.
I will go straight into details. There are lots of good introductions to Kubernetes and I found this one nice. This article help me get a understanding of HA Kubernetes.
I only cover small sized clusters in this article, which is suitable for my use scenarios.
Highly Available or Single Master?
It depends on the need.
If the applications running on the cluster requires HA(high availability), the cluster as the infrastructure must be highly available.
If the service could tolerate some outages, a single-master cluster could also run very well.
Below 2 diagrams describe a minimal highly available cluster , which is composed of 8 host machines, and a single-master cluster, which is composed of 4 host machines.
8 hosts for HA cluster4 hosts for single master cluster
List of required resources
A cloud provide to get Virtual Hosts
CPU, Memory, Storage
Networking, IP Addresses(Public, Private)
Floating IP address if HA is needed
Domain Name provider
DNS provider
A minimal single master cluster could be created with below resources.
A minimal highly available cluster could be created with below resources.
A cluster is created to provide services. The resources to support the services running on the cluster need additional estimation. It’s easy to up-size the masters/workers, or add additional workers to the cluster. Of course I mean it’s only easy to a certain extend, like dozens of workers.
I won’t cover auto-scaling the worker nodes here, e.g. automatically adding/removing worker nodes to the cluster.
Software Components
The minimal cluster is composed with below software components to meet the basic requirements: networking, persistent volumes, monitoring, logging, Ingress and backup.
There are lots choices to make, from architecture to components, and my decisions are mostly based on one precondition: minimal vendor dependencies.
To list them all here:
Host provisioning
Haproxy
Bootstrap
Networking
Persistent volume
Monitoring
Logging
Ingress
Backup
Virtual Host provisioning
I’m not going to describe the very details of host provisioning, as it depends a lot on the cloud provider, choice of Linux flavor and requirements of security.
The goal is to prepare the virtual hosts for the cluster, make them safe, make them controllable, and make them automatically.
In my own experience, I use Terraform to manage the hosts and create the same “bare” hosts. After the provisioning, I collect all the hosts in a “resource file”, for example named as “clustera_hosts” , like below, with all private IPs.
Allow full ssh access from a dedicated “bastion host” only.
All hosts know about each other and opens all ports among them on the private IP network interface.
Use iptables to block all network access for both public/private IPs, except 80/443.
So these hosts are isolated as a group that they could communicate with each other, but prevent access from other tenants of the cloud provide and public Internet, like AWS VPC.
To create highly available cluster, I use 2 hosts running haproxy and keepalived in front of the cluster. These 2 hosts have additional floating IP addresses for both private network interface and public network interface.
The floating ip addresses is a vendor dependency that I don’t think there is a way to bypass. If the cloud provider has the capability to provide any HA service this is a must have.
Sample haproxy configuration
Sample keepalived configuration
Bootstrap — kubeadm
The official document has detailed steps to bootstrap single master cluster and HA cluster.
There is one thing I’d like to mention when bootstrapping the cluster. It’s important to plan the network beforehand. There are a few names here about networks:
Host’s public IP range, which is provided by the cloud and not related here
Host’s private IP range, which is provided by the cloud, and it’s the “physical” network used by the cluster
Pod network IP range, e.g. pod-network-cidr, which is for the pods in the cluster to communicate
Service cluster IP range, e.g. service-cluster-ip-range, which is for the services in the clusters, default “10.96.0.0/12”
These 3 IP ranges, MUST be different. There is a good videoto explain the idea. Normally these 3 IP ranges shall fall in “Private IPv4 network addresses” (link).
In my case, the cloud provider’s private IP range is 192.168.0.0, so the other 2 IP ranges couldn’t be the same. It has to be considered together with next step.
2 parameters are used in the command, first is to define the IP range for Pod network, second is to define the Apiserver to listen on the private network IP address. If necessary define “ — service-cidr” too.
The official document doesn’t provide enough detail about how to write kubeadm-config.yaml. I found this guide and this godoc is useful if you need fine tune the content.
In below sample kubeadm-config.yaml, “podSubnet” is equivalent to parameter “pod-network-cidr”, and “advertiseAddress” is equivalent to parameter “apiserver-advertise-address” for single master cluster bootstrap command.
After bootstrapping and before adding other masters/workers, a pod-network add-on must be installed so that pods can communicate with each other.
There are choices and it’s better to plan which one to use beforehand. The official document has the list of available choices. It’s lots of reading and lots of considerations to do the right choice. There is a comparison article here. It must be chosen based the actual environments and requirements of the services.
For me I chose “calico” at the time simply because it’s the first in the choice list 😸
HOWEVER, the easy choice gave me hard time. This is because the default pod network for calico is “192.168.0.0/16” and it conflicts with my private IP range. So I have to do some tuning, to tell calico that I want the pod network to be “10.0.0.0/16”.
Long story short, below is the diff that I did. Some changes are to reduce the logging from calico. It’s important to define “IP_AUTODETECTION_METHOD” as “can-reach=192.168.4.130”, which is the private IP of haproxy to make sure the pod network goes through the private network interface, not the public IP address.
$diffcalico.yamlcalico.yaml.ori27c27<"log_level":"warn",--->"log_level":"info",599,601d598<# use private ip address<-name:IP_AUTODETECTION_METHOD<value:"can-reach=192.168.4.130"615c612<value:"10.0.0.0/16"--->value:"192.168.0.0/16"627c624<value:"warning"--->value:"info"630,631d626<-name:BGP_LOGSEVERITYSCREEN<value:"warn"
Add other nodes
After pod network is up, it’s easy to follow the official guide here or hereto add other nodes into the cluster.
Storage — rook-ceph
Persistent volume is the next resource to add. I’m going to do monitoring and logging inside the same cluster and the services that I plan to run on the cluster requires persistent storage too. So I do need persistent volumes ready before adding others.
Storage is a complex topic. The official document has list of choices and I choose rook-ceph as it has no vendor dependency and it all runs inside the cluster.
Rook’s document is easy to follow and their slackchannel is helpful too.
The Prometheus Operator
Highly available Prometheus
Highly available Alertmanager
Prometheus node-exporter
Prometheus Adapter for Kubernetes Metrics APIs
kube-state-metrics
Grafana
I believe the proper way to use kube-prometheus is to follow the customization steps.
Below is my steps to prepare a env and do the customization.
My example.jsonnet to use rook-ceph as Prometheus’s storage.
Here I use 10G storage for 30 days of retention in one of the clusters and the storage is normally used for 50%. So it is based on experimental result and shall be adjusted based on use scenario.
It’s important to export prometheus/alertmanager/grafana web UI for via Ingress. I choose to use basic auth to protect them from anonymous access.
This doc has steps to do it for Nginx Ingress. As I’m going to use Traefik Ingress, here is my example.
Logging — Elasticsearch and Kibana
I’m using a combination of kubernetes’ addon + fluent-bit + Curator-cleanup from kubernetes-elasticsearch-cluster. I call it fluent-bit-elasticsearch for logging. To actually run elasticsearch reliably, the minimal size of worker nodes with 4G memory is not enough and need be doubled.
The 2-pods elasticsearch installation from the addon is not as powerful as a 7-pods cluster from here, but I think it’s good enough for a small size Kubernetes cluster. I want to use the curator-cleanup jobs to remote the logs older than 90 days. I replace fluentd with the lightweight fluent-bit.
In es-statefulset.yaml, volumeClaimTemplates is required. The volume size need some estimation and experiments. I use 40G volume to retain 90 days of logs.
Ingress Controllers are the portal to the services running on the cluster.
For a single master cluster, I deploy the Traefik Ingress Controller following the example on the master node, as it’s very simple to point the whole DNS domain to the master’s public IP address. I think this is acceptable for small size cluster with low network traffic, and it actually works great in one of my clusters.
For HA cluster, multiple Traefik instances in cluster mode is a must have. The official document of Traefik is very brief. I shared some details about creating Traefik cluster as Ingress Controller here. The basic idea is to deploy a consul cluster as the KV store for Traefik. Use Haproxy in front of the traefik cluster and point DNS domain to the haproxy.
In my HA cluster I deploy the Traefik cluster on the master nodes. It’s also because the expected network traffic for the cluster is low. I think for high traffic cluster it’s worth to deploy more Traefik instances on worker nodes. In this case an automatic update to the haproxy backend is needed.
External Storage and Backup — NFS, minio and velero
To run production level cluster backup is must have. Backup consists of 2 parts: the etcd cluster and the services on the cluster. I only describe about the latter, using Velero(previous Ark) to backup cluster resources and persistent volumes.
To do backup, of course we need a destination to store the backups and it couldn’t be inside the cluster itself. Velero supports various storage providers, and to be vendor-independent, Minio, an S3-compatible storage service that runs locally on your cluster, is the choice. Minio will use a persistent volume provided over NFS from another host outside of the cluster. Below diagram
Backup architecture
The latest version of velero has removed the sample files for minio deployment. I’m using the previous version 0.11.1 and the config/minio has the contents. It only need a change to use the NFS volume.
diff --git a/velero/config/minio/00-minio-deployment.yaml b/velero/config/minio/00-minio-deployment.yamlindex bd262b7..9dad144 100644--- a/velero/config/minio/00-minio-deployment.yaml+++ b/velero/config/minio/00-minio-deployment.yaml@@ -30,12 +30,14 @@ spec:
spec:
volumes:
- name: storage
- emptyDir: {}+ persistentVolumeClaim:+ # Name of the PVC created earlier+ claimName: minio
- name: config
emptyDir: {}
containers:
- name: minio
$ kubectl -n velero get pod
NAME READY STATUS RESTARTS AGE
minio-757cdf7d7d-qfnqx 1/1Running13d20h
restic-78hf4 1/1Running694d
restic-g4txc 1/1Running494d
restic-krm28 1/1Running494d
velero-6f6f6999b-qctcw 1/1Running13d20h
To create backup plans and specify what volumes to backup
# backup a specific namespace test daily and keep 7 days
velero schedule create test-daily --schedule="@daily" --include-namespaces test --ttl 168h0m0s
# To backup a persistent volume, annotate the pod
kubectl -n test annotate pod/test-0 backup.velero.io/backup-volumes=test-volume
Actually it’s better to annotate the services which requires persistent volume backup with ‘backup.velero.io/backup-volumes’, like this example.
Summary
This article is already too long but still only cover the components of a Kubernetes cluster briefly.
I hope this is helpful if you would like to create Kubernetes from scratch and put it into production.