This is a placeholder for my notes from researching these technologies. It might save you a bit of time if you plan to look into these.
- Apache BookKeeper ➝ an alternative to Apache Kafka core (“topics”).
- Apache Pulsar➝ built on top of BookKeeper, add multi-tenancy, multi-region, non-Java clients, tiered storage (offload to S3) etc, plus:
- Pulsar Functions ➝ Kafka Streams
- Pulsar IO ➝ Kafka Connect (including Debezium)
- Pulsar SQL➝ KSQL (Using Presto)
- Pulsar Schema Registry
- streaml.io ➝ the company that commercializes it all. Recently added a cloud service offering (on Google Cloud). Great blog.
Apache BookKeeper provides replicated, durable storage of (append-only) log streams, with low-latency reads and writes (“<5ms”). In their words, “a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads”.
It was originally developed by Yahoo! as part of an HDFS NameNode HA alternative solution, open sourced as Apache ZooKeeper sub-project in 2011, and graduated as a top-level project in 2015.
- Record / log entry ➝ basic unit of data.
- Ledger ➝ a persisted sequence of (append-only) log entries. It has only a single writer and is bounded – eventually sealed when the writer dies or explicitly asks for sealing.
- Log stream ➝ unbounded stream, uses multiple ledgers, rotated based on a time or size rolling policy.
- Namespace ➝ a tenant, logical grouping of multiple streams, sharing some policies.
- Bookie ➝ a single server storing and serving ledger fragments.
- ensemble ➝ the collection of bookies that handle a specific ledger. A subset of all the bookies in the BookKeeper cluster.
- ZooKeeper ➝ the metadata store, for coordination and metadata. (etcd support for Kubernetes seems to be commited in 4.9.0)
- Ledger API ➝ low-level API
DistributedLog API ➝ “Log Stream API”. A higher-level, streaming oriented API. It is BookKeeper sub-project that was originally an independent open-source project by Twitter. Seems dead, zero activity on its mailing lists, likely since Twitter have moved to Kafka…
- Pulsar functions – lightweight functions for stream processing, can run in on the Pulsar cluster or on Kubernetes.
NOTE – I’m pausing here, will return in the future if relevant