Random notes on Apache Pulsar / Apache BookKeeper

This is a placeholder for my notes from researching these technologies. It might save you a bit of time if you plan to look into these.
TL;DR:

  • Apache BookKeeper ➝ an alternative to Apache Kafka core (“topics”).
  • Apache Pulsar➝ built on top of BookKeeper, add multi-tenancy, multi-region, non-Java clients, tiered storage (offload to S3) etc, plus:
    • Pulsar Functions ➝ Kafka Streams
    • Pulsar IO ➝ Kafka Connect (including Debezium)
    • Pulsar SQL➝ KSQL (Using Presto)
    • Pulsar Schema Registry
  • streaml.io ➝ the company that commercializes it all. Recently added a cloud service offering (on Google Cloud). Great blog.

Apache BookKeeper

Apache BookKeeper provides replicated, durable storage of (append-only) log streams, with low-latency reads and writes (“<5ms”). In their words, “a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads”.

It was originally developed by Yahoo! as part of an HDFS NameNode HA alternative solution, open sourced as Apache ZooKeeper sub-project in 2011, and graduated as a top-level project in 2015.

Terminology

  • Record / log entry ➝ basic unit of data.
  • Ledger ➝ a persisted sequence of (append-only) log entries. It has only a single writer and is bounded – eventually sealed when the writer dies or explicitly asks for sealing.
  • Log stream ➝ unbounded stream, uses multiple ledgers, rotated based on a time or size rolling policy.
  • Namespace ➝ a tenant, logical grouping of multiple streams, sharing some policies.
  • Bookie ➝ a single server storing and serving ledger fragments.
  • ensemble ➝ the collection of bookies that handle a specific ledger. A subset of all the bookies in the BookKeeper cluster.
  • ZooKeeper ➝ the metadata store, for coordination and metadata. (etcd support for Kubernetes seems to be commited in 4.9.0)
  • Ledger API ➝ low-level API
  • DistributedLog API ➝ “Log Stream API”. A higher-level, streaming oriented API. It is BookKeeper sub-project that was originally an independent open-source project by Twitter. Seems dead, zero activity on its mailing lists, likely since Twitter have moved to Kafka

Apache Pulsar

  • Pulsar functions – lightweight functions for stream processing, can run in on the Pulsar cluster or on Kubernetes.

NOTE – I’m pausing here, will return in the future if relevant

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s