I was recently catching up on containers and Kubernetes tech, and I was quite overwhelmed with the amount of features, technologies, projects, standards and products.
So, to help me make sense of it, wrote down some notes, summing and mapping the bits. As usual, a lot is about the infra.
But first, a quick word on Docker. Docker were the kings of containers in the early days (2013 onwards), offering a monolithic (all-on-one) solution to working with containers. Since mid-2015, strong market pressures led to industry standardization of each layer, with Docker code typically being used as a basis.
1.1 Container images
The container image (disk) format spec is defined by OCI (Open Container Initiative, backed by the Linux Foundation).
The image is implemented as copy-on-write layers on top of a base image (a standard Linux distro, a lightweight one like Alpine Linux, or Windows). The layers are union-mounted at runtime (ex: overlayfs).
Typically, the image is defined by a
Dockerfile that describes the set of steps needed to build an container image. You could use
docker build to build, though there are multiple “next-gen” alternatives and variations, like Google’s Kaniko (build images inside an unprivileged k8s container) and BuildKit.
The container images are stored in a container registry like the public Docker Hub, a local Docker Registry, Google Container Registry (gcr.io) etc.
All can be accessed by a standard API and tools.
Harbor (CNCF, originally VMware, site, youtube)
Another example of a container registry “that stores, signs, and scans container images for vulnerabilities”, with enterprise security.
1.2 Container runtimes (technical)
1.2.1 Low-level container runtimes
The container runtime spec is defined by OCI, with
runc (donated by Docker) as a reference implementation.
A (low-level) container runtime is responsible for creating and destroying containers. For example, to start a container, it may create Linux cgroups, namespaces and security limits, set the container networking and storage and start the container.
- runc (sometimes just called “Docker”)
The common, standard runtime (from Docker). Fast container startup with a “good enough” isolation, but does NOT protect against actively hostile code in a container trying to break free.
- runhcs (link)
“runc for Windows”, calling the Windows builtin Host Compute Service (HCS). Can use Hyper-V isolation (Linux / Windows images) or process container (Windows images).
- Kata containers (link)
A secured containers runtime, using lightweight virtual machines.
A merge of Intel Clear Containers and Hyper.sh RunV, stewarded by the OpenStack Foundation (OSF).
(can leverage kemu / nemu / firecracker hypervisors – see below)
- gVisor (link, announce May 2018)
A sandboxed container runtime from Google (runsc).
Basically, it uses a userspace unprivileged kernel (written in Go), that intercepts all system calls, isolating the real kernel.
- Nabla containers (link)
A new sandboxed container runtime from IBM.
1.2.2 High-level container runtimes
High-level container runtime maintains the entire lifecycle of a container, for example, downloading its image from a container repo, managing local images, monitoring running containers etc. It delegates the work of actually starting / stopping a container to a low-level container runtimes (as above).
containerd (CNCF) is the most common high-level container runtime. It originated from Docker and it supports multiple container environments including Docker and Kubernetes.
See also Kubernetes CRI runtimes and CRI-O below
1.3 Hypervisors (technical)
If you are going to use a container runtime based on virtualization, which virtualization technology (hypervisors) to pick?
The most common open source hypervisor for VMs.
- NEMU (site, context)
Nemu (by Intel) is a stripped down version of qemu, with only the bare minimum for modern cloud workloads, exclusively KVM-based.
- Firecracker (site, announce Nov 2018, lwn)
A bare-metal KVM-based Virtual Machine Monitor (VMM) written in Rust. Open sourced by AWS – this is the hypervisor behind AWS Lambda. Provides isolation, fast startup (< 125ms). Optimized for functions (no live migrations, snapshots etc)
Also, rust-vmm is a bunch of rust libraries for hypervisor devs (new and fast-evolving).
1.4 Container orchestration
Going beyond a single container, container orchestration tech builds a containers cluster from a group of hosts, and provides higher-level functionality like fault tolerance, autoscaling, rolling upgrades, load balancing, ops, secrets etc etc.
I’ll be focusing in Kubernetes in the rest of this post.
Container orchestration originally from Google (next-gen Borg).
- Docker Swarm
Container orchestration from Docker.
Container orchestrator of DCOS (Mesos-based cluster).
2. Kubernetes (k8s)
2.1 Kubernetes Pod
In Kubernetes, a pod is the lowest-level unit of deployment. It is made of either a single container, or a few containers that are deployed together as a unit. Each pod has a unique IP address, and access to storage resources.
A pod is meant to hold a single instance of an app – to scale out, deploy multiple pods.
You rarely run pods directly. Instead, you work with a higher-level abstractions such as Deployment Controllers, Services or Jobs (see below).
If a pod has multiple containers, they are always co-located (on the same host), co-scheduled, and run in a shared context (same cgroup etc). They also share the same IP address and port space, can reach each other over localhost or local “Unix” IPC (shared memory and semaphores), and share access to the storage volumes.
A common use case is to inject a “sidecar” container into a pod, transparently adding auxiliary functionality, for example around networking, logging etc (see Service Proxy / Envoy below).
kubelet monitors its pods containers using both liveness probes (should the container be restarted) and readiness probes (is the container ready to accept traffic). A pod is ready only when all its containers are ready (else its service will not route traffic to it).
2.2 Kubernetes Controllers
Controllers maintain a group of pods over time.
- ReplicaSet (RS)
“guarantee the availability of a specified number of identical Pods”. Basically, it maintains a group of stateless pods, by checking the expected vs. actual number of pods, and killing or launching pods.
Its config includes:
- a target number of pod replicas.
- a label-based selector to check the current number of replicas.
- a pod template for starting new pods.
A ReplicaSet plus an update policy – how to switch to a new app version (new container image). Default: rolling upgrades.
It supports updates, rollbacks, scaling, pause/resume etc.
Here is an app rolling upgrade example:
- The user updates the deployment to point to a new image.
- The deployment creates a new ReplicaSet for the new version.
- The deployment adds pods to the new RS and removes them from the old RS (for example, +1, -1, +1, -1).
- Eventually, the old empty RS is removed (back to a single RS).
- Pod Autoscaling
Use the built-in Horizontal Pod Autoscaler with a Deployment (or RS) to define autoscaling based on metrics like CPU. It basically updates the target number of replicas in a RS based on the observed metrics.
For example, you can set min/max pods, and CPU threshold.
“manages the deployment and scaling of a set of Pods, and provides guarantees about (1) the ordering (2) and uniqueness of these Pods“.
Used for stateful apps – using persistent volumes, stable network identity etc.
Some examples: WordPress/MySQL, Cassandra, ZooKeeper…
See also Kubernetes Operators below for richer examples.
“a job creates one or more Pods and ensures that a specified number of them successfully terminate”
Some variations like multiple parallel pods, and a scheduler (CronJob)
2.3 Kubernetes CRI runtimes (technical)
Each Kubernetes node runs a management agent called kubelet. As Kubernetes wanted to support multiple container runtimes, beyond Docker, it decided to define a (gRPC-based) plugin API to talk to container runtimes, called CRI (Container Runtime Interface).
Nowadays, containerd can natively talk with kubelet over CRI, but in the past, it required two intermediate daemons (later only one). In response, the community (led by Red Hat) developed CRI-O (site, CNCF) – a lightweight container engine tailored exclusively to k8s. It also supports all the low-level runtimes. So, today there are two alternative high-level runtimes which are practically the same, and the k8s community is moving to CRI-O.
k8s today supports having multiple container runtimes in a single cluster.
For example: runc vs. kata-qemu vs. kata-firecracker (on the same node).
For Windows – need separate worker nodes running Windows Server 2019.
2.4 Kubernetes Networking and Services
2.4.1 Kubernetes Networking
Each Kubernetes pod is assigned a unique, routable IP address (Pod IP). This removes the needs for node-level port mapping, and allows pod-to-pod communication across nodes in the k8s cluster.
Technically, the implementation creates a flat, NAT-less network, including:
- Within a node
The host network includes a bridge device on its network namespace. Each container network namespace is connected to that bridge using a
- Between the nodes
The worker nodes bridges are connected together, typically using some overlay network. The implementation is provided by a CNI plugin (Container Network Interface) – a common standard network API with a rich ecosystem of conforming plugins, like Flannel, Calico, Cilium.
A different example – on AWS EKS, pod IPs are actually allocated from the VPC (not using an overlay network).
2.4.2 Kubernetes Services
A service controls how to expose an app / microservice over the network, providing a single, stable IP address (cluster IP), while the underlying pods (with their pod IP addresses) may come and go. It also provides load balancing across the relevant pods. The service resource includes a label-based pod selector to identify its pods.
The kube-proxy process that runs on every node, tracks the cluster services and endpoints, and routes traffic for the services to their underlying pods.
It has three proxy modes (implementations) – userspace (old), iptables (default, better), ipvs (newer, best).
Publishing services (ServiceTypes)
How to expose a service to the world, inside or outside Kubernetes?
- ClusterIP (default)
Do not expose the service outside the k8s cluster.
Access to the clusterIP will be routed (by the kube-proxy on the request origin node) to one of the available pods of the service.
All the cluster nodes will expose the same fixed port for this service. An internal ClusterIP service is automatically created, and external access to the
NodeIP:NodePortwill be routed to that ClusterIP.
Create and use an external load balancer (from the cloud provider). That load balancer will route traffic to an automatically created NodePort (and ClusterIP). So, external traffic entry to k8s is spread across all nodes.
Returns a specific external (DNS) hostname.
Creates a headless service – it would have a DNS entry that points to all pod IPs (if the service is defined with Selector), or to a specific external IP address.
For services that use selector to pick pods, k8s also automatically maintain Endpoint objects (basically the IP:port pairs).
2.4.3 Ingress / Ingress Controllers
While Ingress is a different k8s resource than services, it seems like a generalization of them. Ingress Controller is the backend that spawns Ingress resources as needed. Ingress connects apps to external traffic. It can both create external resources like load balancers, and also run code inside the k8s cluster (in its own pods), and apply some logic on the incoming traffic, typically HTTP(S).
For example, Ingress may fan-out different endpoints to different backend services, apply custom load-balancing logic, turn on TLS etc.
The most common ingress controller is ingress-nginenx, but there are many others, such as HAproxy, Contour etc.
See also Service Meshes below
2.5 Service mashes
A service mesh adds “brains” to the k8s network layer (data plane), so each service you deploy can be dumb (at least regarding networking).
The common service mesh functionality includes:
- Traffic policy
Authenticating services, providing an authorization layer (think AWS IAM for your own services), transparently encrypting network traffic between services (mTLS for HTTP).
- Traffic telemetry
Collecting fine-grained, standardized metrics across all services.
For example, latency, throughput and errors per HTTP endpoint.
- Traffic management
Smarter load balancing, for example client-side load balancing, or shifting 1% of the traffic to the canary deployment.
A rich and complex service mesh, led by Google / IBM / Lyft etc.
Leveraging Envoy as service proxy (see below).
Linkerd (CNCF, site)
An ultralight k8s-specific service mesh “that just works”.
Can be deployed incrementally (service-by-service).
Linkerd 1.x was based on “Twitter stack” (Scala, Finagle, Netty, JVM).
Linkerd 2.x (a total rewrite) is based on Go (control plane) and Rust (data plane; service proxy), dramatically reducing both complexity and footprint. Here is a good article for context.
2.5.1 Service Proxy
Service meshes are typically built on top of a service proxy, which is the component that actually hijacks the pod network traffic (data plane network), analyzing it and acting on it.
A service proxy is typically automatically injected as a sidecar container in newly deployed pods (by implementing an Admission Controller on the k8s API server). It routes the pod traffic to itself using
2.5.2 Random bits on top of a service mesh
A k8s operator that automates the promotion of canary deployments using service meshes (for traffic shifting) and Prometheus metrics (for analysis)
visualize the topology, health etc of services on Istio service mesh.
2.6 Building and packaging
Helm (CNCF, site) is a package manager for k8s. Its packages (charts) describe k8s resources (services, deployments etc). Installing a chart creates the relevant resources (deploying an instance of the app).
Helm 3 coming later this year, as a better and totally incompatible version.
There is also a lot of k8s CI/CD projects.
Check Jenkins X (GitOps for k8s), and this Google post discussing CDF (the new Continuous Delivery Foundation) , Tekton (shared CI/CD building blocks), Spinmaker (CD platform) and Kayenta (automated canary testing).
2.7 Serverless on Kubernetes
Kubernetes-native serverless framework, led by Google. Build and run serverless apps on k8s. Working towards stable 1.0 API.
Google also offers KNative as a service called Cloud Run, either fully managed or on top of your GKE cluster.
- Build – in-cluster build system, source code to containers, as CRDs.
- Eventing – eventing framework, CloudEvents format compliant.
Includes event sources, brokers, triggers (easily consume events, specify filters), EventType registry, channels (persistency) etc.
- Serving – autoscaling (based on requests or CPU), scale-to-zero, gradual rollouts of new revisions etc. Based on k8s and Istio.
by Red Hat and Microsoft, Kubernetes-based Event-Driven Autoscaling.
Allows Azure Functions to run on k8s, adds multiple Azure event sources (and a few others like Kafka), provides autoscaling based on input queue size (like Kafka consumer backlog or Azure Queue), and enables direct event consumption from source (not decoupled with HTTP).
Red Hat angle – supported on Red Hat OpenShift, their k8s offering.
K8S Cluster Autoscaler (link)
Dynamically provision new nodes (from the underlying cloud provider) when pods cannot be scheduled due to limited resources. Can also scales down nodes with some limits (avoiding disruption).
Kubernetes configuration management
Helm vs. kustomize vs. jsonnet vs. ksonnet (discontinued) vs. Replicated Ship vs. Helm 3
Velero (formerly Heptio Ark)
disaster recovery / data migration for k8s apps
Grafana Loki (site)
“Like Prometheus, but for logs”.
Sharing Prometheus labels, going only lightweight aggregations, integrated with Grafana.
a full, pre-built on-prem k8s stack of, including Istio, Knative etc.
a storage orchestrator for k8s, multiple storage engine (like Ceph).
logging, logs are distributed streams of data (a unified logging layer)
EFK Elastic, Fluentd, Kibana
cost optimization / visibility / allocation / recommendations