Making sense of all those container technologies

I was recently catching up on containers and Kubernetes tech, and I was quite overwhelmed with the amount of features, technologies, projects, standards and products.
So, to help me make sense of it, wrote down some notes, summing and mapping the bits. As usual, a lot is about the infra.

But first, a quick word on Docker. Docker were the kings of containers in the early days (2013 onwards), offering a monolithic (all-on-one) solution to working with containers. Since mid-2015, strong market pressures led to industry standardization of each layer, with Docker code typically being used as a basis.

1. Containers

1.1 Container images

The container image (disk) format spec is defined by OCI (Open Container Initiative, backed by the Linux Foundation).

The image is implemented as copy-on-write layers on top of a base image (a standard Linux distro, a lightweight one like Alpine Linux, or Windows). The layers are union-mounted at runtime (ex: overlayfs).

Typically, the image is defined by a Dockerfile that describes the set of steps needed to build an container image. You could use docker build to build, though there are multiple “next-gen” alternatives and variations, like Google’s Kaniko (build images inside an unprivileged k8s container) and BuildKit.

The container images are stored in a container registry like the public Docker Hub, a local Docker Registry, Google Container Registry ( etc.
All can be accessed by a standard API and tools.
Harbor (CNCF, originally VMware, site, youtube)
Another example of a container registry “that stores, signs, and scans container images for vulnerabilities”, with enterprise security.

1.2 Container runtimes (technical)

1.2.1 Low-level container runtimes

The container runtime spec is defined by OCI, with runc (donated by Docker) as a reference implementation.
A (low-level) container runtime is responsible for creating and destroying containers. For example, to start a container, it may create Linux cgroups, namespaces and security limits, set the container networking and storage and start the container.

  • runc (sometimes just called “Docker”)
    The common, standard runtime (from Docker). Fast container startup with a “good enough” isolation, but does NOT protect against actively hostile code in a container trying to break free.
  • runhcs (link)
    “runc for Windows”, calling the Windows builtin Host Compute Service (HCS). Can use Hyper-V isolation (Linux / Windows images) or process container (Windows images).
  • Kata containers (link)
    A secured containers runtime, using lightweight virtual machines.
    A merge of Intel Clear Containers and RunV, stewarded by the OpenStack Foundation (OSF).
    (can leverage kemu / nemu / firecracker hypervisors – see below)
  • gVisor (link, announce May 2018)
    A sandboxed container runtime from Google (runsc).
    Basically, it uses a userspace unprivileged kernel (written in Go), that intercepts all system calls, isolating the real kernel.
  • Nabla containers (link)
    A new sandboxed container runtime from IBM.

1.2.2 High-level container runtimes

High-level container runtime maintains the entire lifecycle of a container, for example, downloading its image from a container repo, managing local images, monitoring running containers etc. It delegates the work of actually starting / stopping a container to a low-level container runtimes (as above).

containerd (CNCF) is the most common high-level container runtime. It originated from Docker and it supports multiple container environments including Docker and Kubernetes.

See also Kubernetes CRI runtimes and CRI-O below

1.3 Hypervisors (technical)

If you are going to use a container runtime based on virtualization, which virtualization technology (hypervisors) to pick?

  • QEMU
    The most common open source hypervisor for VMs.
  • NEMU (site, context)
    Nemu (by Intel) is a stripped down version of qemu, with only the bare minimum for modern cloud workloads, exclusively KVM-based.
  • Firecracker (site, announce Nov 2018, lwn)
    A bare-metal KVM-based Virtual Machine Monitor (VMM) written in Rust. Open sourced by AWS – this is the hypervisor behind AWS Lambda. Provides isolation, fast startup (< 125ms). Optimized for functions (no live migrations, snapshots etc)

Also, rust-vmm is a bunch of rust libraries for hypervisor devs (new and fast-evolving).

1.4 Container orchestration

Going beyond a single container, container orchestration tech builds a containers cluster from a group of hosts, and provides higher-level functionality like fault tolerance, autoscaling, rolling upgrades, load balancing, ops, secrets etc etc.
I’ll be focusing in Kubernetes in the rest of this post.

  • Kubernetes
    Container orchestration originally from Google (next-gen Borg).
  • Docker Swarm
    Container orchestration from Docker.
  • Marathon
    Container orchestrator of DCOS (Mesos-based cluster).

2. Kubernetes (k8s)

2.1 Kubernetes Pod

In Kubernetes, a pod is the lowest-level unit of deployment. It is made of either a single container, or a few containers that are deployed together as a unit. Each pod has a unique IP address, and access to storage resources.
A pod is meant to hold a single instance of an app – to scale out, deploy multiple pods.
You rarely run pods directly. Instead, you work with a higher-level abstractions such as Deployment Controllers, Services or Jobs (see below).

If a pod has multiple containers, they are always co-located (on the same host), co-scheduled, and run in a shared context (same cgroup etc). They also share the same IP address and port space, can reach each other over localhost or local “Unix” IPC (shared memory and semaphores), and share access to the storage volumes.
A common use case is to inject a “sidecar” container into a pod, transparently adding auxiliary functionality, for example around networking, logging etc (see Service Proxy / Envoy below).

Pod healthchecks
kubelet monitors its pods containers using both liveness probes (should the container be restarted) and readiness probes (is the container ready to accept traffic). A pod is ready only when all its containers are ready (else its service will not route traffic to it).

2.2 Kubernetes Controllers

Controllers maintain a group of pods over time.

  • ReplicaSet (RS)
    “guarantee the availability of a specified number of identical Pods”. Basically, it maintains a group of stateless pods, by checking the expected vs. actual number of pods, and killing or launching pods.
    Its config includes:
    1. a target number of pod replicas.
    2. a label-based selector to check the current number of replicas.
    3. a pod template for starting new pods.
  • Deployment
    A ReplicaSet plus an update policy – how to switch to a new app version (new container image). Default: rolling upgrades.
    It supports updates, rollbacks, scaling, pause/resume etc.
    Here is an app rolling upgrade example:
    1. The user updates the deployment to point to a new image.
    2. The deployment creates a new ReplicaSet for the new version.
    3. The deployment adds pods to the new RS and removes them from the old RS (for example, +1, -1, +1, -1).
    4. Eventually, the old empty RS is removed (back to a single RS).
  • Pod Autoscaling
    Use the built-in Horizontal Pod Autoscaler with a Deployment (or RS) to define autoscaling based on metrics like CPU. It basically updates the target number of replicas in a RS based on the observed metrics.
    For example, you can set min/max pods, and CPU threshold.
  • StatefulSet
    “manages the deployment and scaling of a set of Pods, and provides guarantees about (1) the ordering (2) and uniqueness of these Pods“.
    Used for stateful apps – using persistent volumes, stable network identity etc.
    Some examples: WordPress/MySQL, Cassandra, ZooKeeper
    See also Kubernetes Operators below for richer examples.
  • Jobs
    “a job creates one or more Pods and ensures that a specified number of them successfully terminate”
    Some variations like multiple parallel pods, and a scheduler (CronJob)

2.3 Kubernetes CRI runtimes (technical)

Each Kubernetes node runs a management agent called kubelet. As Kubernetes wanted to support multiple container runtimes, beyond Docker, it decided to define a (gRPC-based) plugin API to talk to container runtimes, called CRI (Container Runtime Interface).

Nowadays, containerd can natively talk with kubelet over CRI, but in the past, it required two intermediate daemons (later only one). In response, the community (led by Red Hat) developed CRI-O (site, CNCF) – a lightweight container engine tailored exclusively to k8s. It also supports all the low-level runtimes. So, today there are two alternative high-level runtimes which are practically the same, and the k8s community is moving to CRI-O.

Kubernetes RuntimeClass
k8s today supports having multiple container runtimes in a single cluster.
For example: runc vs. kata-qemu vs. kata-firecracker (on the same node).
For Windows – need separate worker nodes running Windows Server 2019.

2.4 Kubernetes Networking and Services

2.4.1 Kubernetes Networking

Each Kubernetes pod is assigned a unique, routable IP address (Pod IP). This removes the needs for node-level port mapping, and allows pod-to-pod communication across nodes in the k8s cluster.

Technically, the implementation creates a flat, NAT-less network, including:

  1. Within a node
    The host network includes a bridge device on its network namespace. Each container network namespace is connected to that bridge using a veth pair.
  2. Between the nodes
    The worker nodes bridges are connected together, typically using some overlay network. The implementation is provided by a CNI plugin (Container Network Interface) – a common standard network API with a rich ecosystem of conforming plugins, like Flannel, Calico, Cilium.
    A different example – on AWS EKS, pod IPs are actually allocated from the VPC (not using an overlay network).

2.4.2 Kubernetes Services

A service controls how to expose an app / microservice over the network, providing a single, stable IP address (cluster IP), while the underlying pods (with their pod IP addresses) may come and go. It also provides load balancing across the relevant pods. The service resource includes a label-based pod selector to identify its pods.

The kube-proxy process that runs on every node, tracks the cluster services and endpoints, and routes traffic for the services to their underlying pods.
It has three proxy modes (implementations) – userspace (old), iptables (default, better), ipvs (newer, best).

Publishing services (ServiceTypes)
How to expose a service to the world, inside or outside Kubernetes?

  • ClusterIP (default)
    Do not expose the service outside the k8s cluster.
    Access to the clusterIP will be routed (by the kube-proxy on the request origin node) to one of the available pods of the service.
  • NodePort
    All the cluster nodes will expose the same fixed port for this service. An internal ClusterIP service is automatically created, and external access to the NodeIP:NodePort will be routed to that ClusterIP.
  • LoadBalancer
    Create and use an external load balancer (from the cloud provider). That load balancer will route traffic to an automatically created NodePort (and ClusterIP). So, external traffic entry to k8s is spread across all nodes.
  • ExternalName
    Returns a specific external (DNS) hostname.
  • None
    Creates a headless service – it would have a DNS entry that points to all pod IPs (if the service is defined with Selector), or to a specific external IP address.

For services that use selector to pick pods, k8s also automatically maintain Endpoint objects (basically the IP:port pairs).

2.4.3 Ingress / Ingress Controllers

While Ingress is a different k8s resource than services, it seems like a generalization of them. Ingress Controller is the backend that spawns Ingress resources as needed. Ingress connects apps to external traffic. It can both create external resources like load balancers, and also run code inside the k8s cluster (in its own pods), and apply some logic on the incoming traffic, typically HTTP(S).
For example, Ingress may fan-out different endpoints to different backend services, apply custom load-balancing logic, turn on TLS etc.
The most common ingress controller is ingress-nginenx, but there are many others, such as HAproxy, Contour etc.

See also Service Meshes below

2.5 Service mashes

A service mesh adds “brains” to the k8s network layer (data plane), so each service you deploy can be dumb (at least regarding networking).
The common service mesh functionality includes:

  • Traffic policy
    Authenticating services, providing an authorization layer (think AWS IAM for your own services), transparently encrypting network traffic between services (mTLS for HTTP).
  • Traffic telemetry
    Collecting fine-grained, standardized metrics across all services.
    For example, latency, throughput and errors per HTTP endpoint.
  • Traffic management
    Smarter load balancing, for example client-side load balancing, or shifting 1% of the traffic to the canary deployment.

Istio (site)
A rich and complex service mesh, led by Google / IBM / Lyft etc.
Leveraging Envoy as service proxy (see below).

Linkerd (CNCF, site)
An ultralight k8s-specific service mesh “that just works”.
Can be deployed incrementally (service-by-service).
Linkerd 1.x was based on “Twitter stack” (Scala, Finagle, Netty, JVM).
Linkerd 2.x (a total rewrite) is based on Go (control plane) and Rust (data plane; service proxy), dramatically reducing both complexity and footprint. Here is a good article for context.

a recent istio vs. linkerd performance analysis

There is also a recent standardization effort called SMI (Service Mesh Interface, site, announce), providing standard interface for meshes on k8s.

2.5.1 Service Proxy

Service meshes are typically built on top of a service proxy, which is the component that actually hijacks the pod network traffic (data plane network), analyzing it and acting on it.

A service proxy is typically automatically injected as a sidecar container in newly deployed pods (by implementing an Admission Controller on the k8s API server). It routes the pod traffic to itself using iptables rules.

Envoy (CNCF, site) is a popular service proxy used by multiple meshes.
Contour (site) is a smarter k8s ingress controller with Envoy integration.

2.5.2 Random bits on top of a service mesh

Flagger (site)
A k8s operator that automates the promotion of canary deployments using service meshes (for traffic shifting) and Prometheus metrics (for analysis)

Kiali (site)
visualize the topology, health etc of services on Istio service mesh.

2.6 Building and packaging

Helm (CNCF, site) is a package manager for k8s. Its packages (charts) describe k8s resources (services, deployments etc). Installing a chart creates the relevant resources (deploying an instance of the app).
Helm 3 coming later this year, as a better and totally incompatible version.

There is also a lot of k8s CI/CD projects.
Check Jenkins X (GitOps for k8s), and this Google post discussing CDF (the new Continuous Delivery Foundation) , Tekton (shared CI/CD building blocks), Spinmaker (CD platform) and Kayenta (automated canary testing).

2.7 Serverless on Kubernetes

KNative (site)
Kubernetes-native serverless framework, led by Google. Build and run serverless apps on k8s. Working towards stable 1.0 API.
Google also offers KNative as a service called Cloud Run, either fully managed or on top of your GKE cluster.

  • Build – in-cluster build system, source code to containers, as CRDs.
  • Eventing – eventing framework, CloudEvents format compliant.
    Includes event sources, brokers, triggers (easily consume events, specify filters), EventType registry, channels (persistency) etc.
  • Serving – autoscaling (based on requests or CPU), scale-to-zero, gradual rollouts of new revisions etc. Based on k8s and Istio.

KEDA (announce)
by Red Hat and Microsoft, Kubernetes-based Event-Driven Autoscaling.
Allows Azure Functions to run on k8s, adds multiple Azure event sources (and a few others like Kafka), provides autoscaling based on input queue size (like Kafka consumer backlog or Azure Queue), and enables direct event consumption from source (not decoupled with HTTP).
Red Hat angle – supported on Red Hat OpenShift, their k8s offering.




Kubernetes operators

A design pattern for stateful apps, automating their lifecycle.
Operator Hub is a marketplace for operators.
For example, Strimzi (site) is an operator for running Kafka on k8s.

K8S Cluster Autoscaler (link)
Dynamically provision new nodes (from the underlying cloud provider) when pods cannot be scheduled due to limited resources. Can also scales down nodes with some limits (avoiding disruption).

Kubernetes configuration management
Helm vs. kustomize vs. jsonnet vs. ksonnet (discontinued) vs. Replicated Ship vs. Helm 3

kubeflow (site, announce)
machine learning toolkit for k8s, “making deployment of ML workflows on k8s simple, portable and scalable”.

Velero (formerly Heptio Ark)
disaster recovery / data migration for k8s apps

Grafana Loki (site)
“Like Prometheus, but for logs”.
Sharing Prometheus labels, going only lightweight aggregations, integrated with Grafana.

anthos (site)
a full, pre-built on-prem k8s stack of, including Istio, Knative etc.

a storage orchestrator for k8s, multiple storage engine (like Ceph).

logging, logs are distributed streams of data (a unified logging layer)
EFK Elastic, Fluentd, Kibana

kubecost (site)
cost optimization / visibility / allocation / recommendations

Random notes on Apache Pulsar / Apache BookKeeper

This is a placeholder for my notes from researching these technologies. It might save you a bit of time if you plan to look into these.

  • Apache BookKeeper ➝ an alternative to Apache Kafka core (“topics”).
  • Apache Pulsar➝ built on top of BookKeeper, add multi-tenancy, multi-region, non-Java clients, tiered storage (offload to S3) etc, plus:
    • Pulsar Functions ➝ Kafka Streams
    • Pulsar IO ➝ Kafka Connect (including Debezium)
    • Pulsar SQL➝ KSQL (Using Presto)
    • Pulsar Schema Registry
  • ➝ the company that commercializes it all. Recently added a cloud service offering (on Google Cloud). Great blog.

Apache BookKeeper

Apache BookKeeper provides replicated, durable storage of (append-only) log streams, with low-latency reads and writes (“<5ms”). In their words, “a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads”.

It was originally developed by Yahoo! as part of an HDFS NameNode HA alternative solution, open sourced as Apache ZooKeeper sub-project in 2011, and graduated as a top-level project in 2015.


  • Record / log entry ➝ basic unit of data.
  • Ledger ➝ a persisted sequence of (append-only) log entries. It has only a single writer and is bounded – eventually sealed when the writer dies or explicitly asks for sealing.
  • Log stream ➝ unbounded stream, uses multiple ledgers, rotated based on a time or size rolling policy.
  • Namespace ➝ a tenant, logical grouping of multiple streams, sharing some policies.
  • Bookie ➝ a single server storing and serving ledger fragments.
  • ensemble ➝ the collection of bookies that handle a specific ledger. A subset of all the bookies in the BookKeeper cluster.
  • ZooKeeper ➝ the metadata store, for coordination and metadata. (etcd support for Kubernetes seems to be commited in 4.9.0)
  • Ledger API ➝ low-level API
  • DistributedLog API ➝ “Log Stream API”. A higher-level, streaming oriented API. It is BookKeeper sub-project that was originally an independent open-source project by Twitter. Seems dead, zero activity on its mailing lists, likely since Twitter have moved to Kafka

Apache Pulsar

  • Pulsar functions – lightweight functions for stream processing, can run in on the Pulsar cluster or on Kubernetes.

NOTE – I’m pausing here, will return in the future if relevant

So what is serverless?

“Serverless? but there is a physical server underneath”

You might have encountered some cynical remarks on the term “Serverless”. I believe this actually comes from some confusion about which type server is being removed, and also from different people having a different cloud background.

Let’s say you have an app that exposes a REST API. It runs all the time, waiting for requests and processes them as they arrive. It may be that this daemon exposes a dozen REST endpoints. It likely has a main loop that waits for HTTP requests from clients, runs the corresponding business logic, and returns a JSON response or some HTTP error.

This app acts as a server, as in “client/server” architecture.
Serverless is about removing that server layer (the always-on “main loop”), which means wiring client requests directly to the business logic functions, without constantly paying for an up-and-running server. You “pay-as-you-go” for the actual resource consumption of your business logic code, not for reserving capacity by the hour.

How does it work?

Technically, if you convert the above app to serverless, you will provide:

  1. The “business logic” functions
    Instead of packaging your entire app, you package each of your inner business logic functions, and upload (only) them to your function service (sometimes called FaaS / Function as a Service), for example to the AWS Lambda service.
  2. The REST glue
    You declare a mapping between HTTP endpoints (and methods) and the functions, so calling the endpoint triggers the function. For example, a POST of may invoke the create_customer() function. The details varies between cloud providers, but on AWS you set it using the Amazon API Gateway service.

Behind the scenes, your cloud provider will transparently create servers to run your functions – for example, creating JVMs with their own “main()” for Java functions. It will automatically add or remove such servers based on the current number of concurrent executions of your functions, will reuse those servers to save startup time, will stop them all when a function was not called at all for a while etc etc.

Continue reading

Exploring the HDFS Default Value Behaviour

In this post I’ll explore how the HDFS default values really work, which I found to be quite surprising and non-intuitive, so there s a good lesson here.

In my case, I have my local virtual Hadoop cluster, and I have a different client outside the cluster. I’ll just put a file from the external client into the cluster in two configurations.

All the nodes of my Hadoop (1.2.1) cluster have the same hdfs-site.xml file with the same (non-default) value for dfs.block.size (renamed to dfs.blocksize in Hadoop 2.x) of 134217728, which is 128MB. In my external node, I also have the hadoop executables with a minimal hdfs-site.xml.

First, I have set dfs.block.size to 268435456 (256MB) in my client hdfs-site.xml and copied a 400MB file to HDFS:

 ./hadoop fs -copyFromLocal /sw/400MB.file /user/ofir

Checking its block size from the NameNode:

./hadoop fsck /user/ofir/400MB.file -files -blocks -racks
FSCK started by root from / for path /user/ofir/400MB.file at Thu Jan 30 22:21:29 UTC 2014
/user/ofir/400MB.file 419430400 bytes, 2 block(s):  OK
0. blk_-5656069114314652598_27957 len=268435456 repl=3 [/rack03/, /rack03/, /rack02/]
1. blk_3668240125470951962_27957 len=150994944 repl=3 [/rack03/, /rack03/, /rack01/]

 Total size:    419430400 B
 Total dirs:    0
 Total files:    1
 Total blocks (validated):    2 (avg. block size 209715200 B)
 Minimally replicated blocks:    2 (100.0 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:    0 (0.0 %)
 Mis-replicated blocks:        0 (0.0 %)
 Default replication factor:    1
 Average block replication:    3.0
 Corrupt blocks:        0
 Missing replicas:        0 (0.0 %)
 Number of data-nodes:        6
 Number of racks:        3
FSCK ended at Thu Jan 30 22:21:29 UTC 2014 in 1 milliseconds

So far, looks good – the first block is 256MB.

Continue reading

Exploring HDFS Block Placement Policy

In this post I’ll cover how to see where each data block is actually placed. I’m using my fully-distributed Hadoop cluster on my laptop for exploration and the network topology definitions from my previous post – you’ll need a multi-node cluster to reproduce it.


When a file is written to HDFS, it is split up into big chucks called data blocks, whose size is controlled by the parameter dfs.block.size in the config file hdfs-site.xml (in my case – left as the default which is 64MB). Each block is stored on one or more nodes, controlled by the parameter dfs.replication in the same file (in most of this post – set to 3, which is the default). Each copy of a block is called a replica.

Regarding my setup – I have nine nodes spread over three racks. Each rack has one admin node and two worker nodes (running DataNode and TaskTracker), using Apache Hadoop 1.2.1. Please note that in my configuration, there is a simple mapping between host name, IP address and rack number -host name is hadoop[rack number][node number] and IP address is[rack number][node number].

Where should the NameNode place each replica?  Here is the theory:

The default block placement policy is as follows:

  • Place the first replica somewhere – either a random node (if the HDFS client is outside the Hadoop/DataNode cluster) or on the local node (if the HDFS client is running on a node inside the cluster).
  • Place the second replica in a different rack.
  • Place the third replica in the same rack as the second replica
  • If there are more replicas – spread them across the rest of the racks.

That placement algorithm is a decent default compromise – it provides protection against a rack failure while somewhat minimizing inter-rack traffic.

So, let’s put some files on HDFS and see if it behaves like the theory in my setup – try it on yours to validate.

Continue reading

Exploring The Hadoop Network Topology

In this post I’ll cover how to define and debug a simple Hadoop network topology.


Hadoop is designed to run on large clusters of commodity servers – in many cases spanning many physical racks of servers. A physical rack is in many cases a single point of failure (for example, having typically a single switch for lower cost), so HDFS tries to place block replicas on more than one rack. Also, there is typically more bandwidth within a rack than between the racks, so the software on the cluser (HDFS and MapReduce / YARN) can take it into account. This leads to a question:

How does the NameNode know the network topology?

By default, the NameNode has no idea which node is in which rack. It therefore by default assumes that all nodes are in the same rack, which is likely true for small clusters. It calls this rack “/default-rack“.

So, we have to teach Hadoop our cluster network topology – the way the nodes are grouped into racks. Hadoop supports a pluggable rack topology implementation – controlled by the parameter topology.node.switch.mapping.impl in core-site.xml, which specifies a java class implementation. The default implementation is using a user-provided script, specified in in the same config file, a script that gets a list of IP addresses or host names and returns a list of rack names (in Hadoop2: The script will get up to topology.script.number.args parameters per invocation, by default up to 100 requests per invocation (in Hadoop2: net.topology.script.number.args).

Continue reading

Creating a virtualized fully-distributed Hadoop cluster using Linux Containers

TL;DR  Why and how I created a working 9-node Hadoop Cluster on my  laptop

In this post I’ll cover why I wanted to have a decent multi-node Hadoop cluster on my laptop, why I chose not to use virtualbox/VMware player, what is LXC (Linux Containers) and how did I set it up. The last part is a bit specific to my desktop O/S (Ubuntu 13.10).

Why install a fully-distributed Hadoop cluster on my laptop?

Hadoop has a “laptop mode” called pseudo-distributed mode. In that mode, you run a single copy of each service (for example, a single HDFS namenode and a single HDFS datanode), all listening under localhost.

This feature is rather useful for basic development and testing – if all you want is write some code that uses the services and check if it runs correctly (or at all). However, if your focus is the platform itself – an admin perspective, than that mode is quite limited. Trying out many of Hadoop’s feature requires a real cluster – for example replication, failure testing, rack awareness, HA (of every service types) etc.

Why VirtualBox or VMware Player are not a good fit?

When you want to run several / many virtual machines, the overhead of these desktop virtualization sotware is big. Specifically, as they run a full-blown guest O/S inside them, they:

  1. Consume plenty of RAM per guest – always a premium on laptops.
  2. Consume plenty of disk space – which is limited, especially on SSD.
  3. Take quite a while to start / stop

To run many guests on a laptop, a more lightweight infrastructure is needed.

What is LXC (Linux Containers)

Continue reading

The end of the monolithic database engine dream

It seems I like calling my posts “The End of…” 🙂

Anyway, Rob Klopp wrote an interesting post titled “Specialized Databases vs. Swiss Army Knives“. In it, he argues with Stonebraker’s claim that the database market will split to three-to-six categories of databases, each with its own players. Rob counter claims by saying that data is typically used in several ways, so it is cumbersome to have several specialized databases instead of one decent one (like Hana of course…).

I have a somewhat different perspective. In the past, let’s say ten years ago, I was sure that Oracle  database is the right thing to throw at any database challenge, and I think many in the industry shared that feeling (each with his/her favorite database, of course). There was a belief that a single database engine could be smart enough, flexible enough, powerful enough to handle almost everything.

That belief is now history. As I will show, it is now well understood and acknowledged by all the major vendors that a general-purpose database engine just can’t compete in all high-end niches. HOWEVER, the existing vendors are, as always, adapting. All of them are extending their databases to offer multiple database engines inside their product, each for a different use case.

The leader here seems to be Microsoft SQL Server. SQL Server 2014 (currently at CTP2) comes with three separate database engines. In addition to the existing engine, they introduced Hekaton – an in-memory OLTP engine that looks very promising. They also delivered a brand new implementation of their columnar format – now called clustered columnstore index – which is now fully updatable and is actually not an index – it is a primary table storage format with all the usual plumbing (delta trees with a tuple mover process when enough rows have accumulated).

Continue reading

Industry Standard SQL-on-Hadoop benchmarking?

Earlier today a witty comment I made on twitter led to a long and heated discussion about vendor exaggerations regarding SQL-on-Hadoop relative performance. It started as a link to this post by Hyeong-jun Kim, CTO and Chief Architect at Gruter. That post discusses some ways that vendor exaggerate and suggests to verify these claims with your own data and queries.

Anyway, Milind Bhandarkar from Pivotal suggested that joining an industry standard benchmarking effort might be the right think. I disagree, wanted to elaborate on that:

Industry Standard SQL-on-Hadoop benchmark won’t improve a thing

  • These benchmarks (at least in the SQL space) don’t help users to pick a technology. I’ve never heard of a customer who picked some solution because it lead the performance or price/performance list of TPC-C or TPC-H.
    Customers will always benchmark on their data and their workload…
  • …And they do so because in this space, many of the small variations in data (ex: data distribution within a column) or workload (SQL features, concurrency, transaction type mix) will have dramatic impact on the results.
  • So, the vendors who write the benchmark will fight to death to have the benchmark highlight their (exisitng) strengths. If the draft benchmark will show them at the bottom, they’ll retire and bad mouth the benchmark for not incorporating their suggestions.
  • Of course, the close-source players still won’t allow publishing benchmark results – the “DeWitt Clause” (right Milind? What about HAWK?)
  • And even with a standard, all vendors will still use micro-benchmarks to highlight the full extent of new features and optimizations, so the rebuttals and flame wars will not magically end.

What I think is the right thing for users

Since vendors (and users) will not agree on a single benchmark, the next best thing is that each player will develop their own benchmark(s), but will make it easily reproducible.

For example, share the dataset, the SQLs, the specific pre-processing if any, the specific non-default config options if any, the exact SW versions involved, the exact HW config and hopefully – provide the script that runs the whole benchmark.

This would allow the community and ecosystem – users, prospects and vendors – to:

  • Quickly reproduce the results (maybe on a somewhat different config).
  • Play with different variations to learn how stable are the conclusions (different file formats, data set size, SQL and parameter variations any many more).
  • Share their result in the open, using the same disclosure, for everyone to learn from it or respond to it.

Short version – community over committee.

Why that also won’t happen

Sharing a detailed, easily reproducible report is great for smart users who want to educate themselves and choose the right product.

However, there is nearly zero incentive for vendors and projects to share it, especially for the leaders. Why? Because they are terrified that a competitor will use it to show that he is way faster… That could be the ultimate marketing fail (a small niche player afford it, since they have little to lose).

Some other reasonable excuses – there could be a dependency on internal testing frameworks, scripts or non-public datasets, not enough resources to clean up and document each micro-benchmark or to follow up with everyone etc.

Also, such disclosure may prevent marketing / management from highlighting some results out of context (or add wishful thinking)… Not sure many are willing to report to their board – “sales are down since we told everyone that our product is not yet good enough”.
For example – I haven’t yet seen a player claiming “Great Performance! We are now 0.7x faster than the current performance leader!”. Somehow you’ll need to claim that you are great, even if for some a specific scenario under specific (undisclosed?) constraints.


  • You can’t really optimize picking the right tool by building a universal performance benchmark (and picking is not only about performance).
  • Human interests (commercial and others) will generate friction as all compete for some mindshare. Social pressure might sometimes help to make competitors support civilized technical discussion, but only rarely.
  • when sharing performance numbers, please be nice – share what you can and don’t mislead.
  • As a user, try to learn and verify before making big commitments, and be prepare to screw up occasionally…

Big Data PaaS part 1 – Amazon Redshift

Last month I attended AWS Summit in Tel-Aviv. It was a great event, one of the largest local tech events I’ve seen. Anyway, a session on Redshift got me thinking.

When Amazon Redshift was announced (Nov 2012), I was working at Greenplum. I remember that while Redshift pricing was impressive, I mostly dismissed it at that time due to some perceived gaps in functionality and being in a “limited beta” status. Back to last month’s session, I came to see what’s new and was taken by surprise by the tone of the presentation.

As a database expert, I thought I would hear about the implementation details – the sharding mechanism, the join strategies, the various types of columnar encoding and compression, the pipelined data flow, the failover design etc… While a few of these were briefly mentioned, they were definitely not the focus. Instead, the main focus was the value of getting your data warehouse in a PaaS platform.

So what is the value? It is a combination of several things. Lower cost is of course a big one – it allows smaller organization to have a decent data warehouse. But it is much more than cost – being a PaaS means you can immediately start working without worrying about everything from a complex, time-consuming POC to all the infrastructure work for successful production-grade deployment. Just provision a production cluster for a day or two, try it out on your data and queries, and if you like the performance (and cost) – simply continue to use it. It allows a very safe play – or as they put it “low cost of failure“. In addition, a PaaS environment can handle all the ugly infrastructure tasks around MPP databases or appliances that typically block adoption – backups, DR, connectivity etc.
Let me elaborate on these points:

I guess anyone who did a DW POC can relate to the following painful process. You have to book the POC many weeks in advance, pick from several uneasy options (remote POC, fly to vendor central site or coordinate a delivery to your data center), pick in advance a small subset of data and queries to test, don’t use your native BI/ETL tools, rely on the vendor experts to run and tune the POC, finish with a lot of open questions as you ran out of time, pick something, pay a lot of money, wait a couple of months for delivery, start migration / testing / re-tuning (or even DW design and implementation) and several months later you will likely realize how good (or bad) your choice really was… This is always a very high risk maneuver, and the ability to provision a production environment, try it out and if you like it, just keep it without a huge upfront commitment is a very refreshing concept.

Same goes for many “annoying” infrastructure bits. For example – backup, recovery and DR. Typically these will not be tested in a real-world configuration as part of a POC due to various constraints, and the real tradeoff between feasible options in cost, RPO and RTO will not seriously evaluated or in many cases not even considered. Having a working backup included with Redshift – meaning, the backup functionality with underlying infrastructure and storage – is again huge. Another similar one is patching – it is nice that it’s Amazon’s problem.

Last of these is scaling. With an appliance, scaling out is painful. You order an (expensive) upgrade, wait a couple of month for it to be install, then try to roll it out (which likely involves many hours of storage re-balancing), then pray it works. Obviously you can’t scale down to reduce costs. With Redshift, they provision a second cluster (while your production is switched to read-only mode), and switch you over to the new one “likely within an hour”. Of course, with Redshift you can scale up or down on demand, or even turn off the cluster when not needed (but there are some pricing gotchas as there is a strong pricing bias to a 3-year reserved instances).

If you’re interested, you can find Guy Ernest’s presentation here .During his presentation, most slides were hidden as he and Intel had only 45 minutes – I see now that the full slide deck actually does have some slides full of implementation details 🙂

BTW –  another thing that I was curious about was – can the Amazon team significantly evolve the Redshift code base given its external origin(ParAccel)? I assumed that it is not a trivial task. Well, I just read last week that AWS are releasing a big functionality upgrade or Redshift (plus some more bits), so I think that one is also off the table.

It would be interesting to see if Redshift would now gain more traction, especially as it got more integrated into the AWS offering and workflow.