Chapter 2
Pravega System Architecture
Beneath the promises of infinite scalability and strong consistency lies Pravega's sophisticated, distributed architecture-a harmonious orchestration of logical abstractions, persistent storage, and coordination services. This chapter unfolds the inner workings of Pravega's architecture, offering a rare, behind-the-scenes view into how its components interact, adapt, and recover at scale. Prepare to dissect each subsystem with a critical lens, gaining knowledge crucial for both systems designers and operators who demand reliability and efficiency from their streaming platforms.
2.1 Logical and Physical Architecture Overview
Pravega's architecture is fundamentally shaped by a clear separation of logical components and their physical deployment, enabling scalable, fault-tolerant streaming storage. At a high level, the system consists of three primary logical components: the client library, the controller, and the segment store. Each serves distinct responsibilities that collectively support continuous stream storage and scalable data processing.
The client library resides on the application side, providing a high-level abstraction for streams and segments. It offers APIs for reading and writing event streams while handling stream management, segment routing, and interaction with the controller. The client library maintains segment metadata and performs asynchronous I/O operations, thus coupling application logic directly to streaming semantics without requiring awareness of underlying infrastructure topologies.
The controller embodies the control plane and coordinates all cluster-wide management operations. Centralized from a logical perspective, the controller manages stream lifecycle management, including creation, scaling, and truncation of streams. It tracks stream segment metadata-mapping stream segments to segment stores-concurrency control through epoch management, and enforcing consistency across distributed nodes. The controller implements distributed consensus using systems such as Apache ZooKeeper and utilizes tiered caching of metadata to ensure low-latency control interactions.
The segment store forms the core data plane component responsible for durable storage and retrieval of stream segments. Distributed across multiple physical nodes, it handles append-only writes and random reads efficiently, leveraging tiered storage options including local disk and scalable object stores. The segment store exposes a low-level segment abstraction optimized for managing large, dynamically sized objects, providing atomicity and durability guarantees. Each segment store instance manages a subset of segments, enabling load distribution and fault isolation.
Physically, these logical components may be deployed in varying architectures, ranging from single-node clusters for development and testing to complex geo-distributed setups powering production environments. In a single-node development cluster, all components-including the client library, controller, and segment store-often run within a single Java Virtual Machine or container. This minimalist deployment facilitates rapid iteration and debugging but limits scalability and fault tolerance.
In contrast, a multi-node cluster represents a common production deployment model where components are physically separated across a cluster of machines or containers. Typically, multiple segment store instances are horizontally scaled and distributed to balance load and achieve data locality. Controllers are deployed as a redundant cluster, often using leader election protocols to maintain active coordination roles and avoid single points of failure. Clients run external to the cluster, connecting remotely via network endpoints to invoke control plane operations and segment read/write RPCs.
For geo-distributed deployments, Pravega can be configured to replicate streams across geographically distributed data centers or cloud regions, ensuring high availability and disaster recovery. Control plane components synchronize metadata using strong consistency protocols, while segment stores integrate with global storage backends, leveraging replication and erasure coding to sustain data durability. Network partitioning mechanisms and advanced routing logic in the client library allow seamless failover and multi-region transparency.
Mapping logical components to physical resources is essential for optimizing performance and fault isolation. The controller cluster, being metadata heavy and latency-sensitive, is typically deployed on nodes with high CPU and memory capacity but does not require large disk storage. Segment store nodes prioritize disk I/O throughput and network bandwidth, as they perform heavy data ingestion and retrieval. Physical separation of these roles prevents resource contention and aids targeted scaling; for example, by increasing segment store nodes to handle higher streaming throughput independently from controller scaling.
Moreover, robust fault tolerance entails isolating failures within respective physical nodes hosting specific logical services. Crashes or slowdowns in a segment store provision must not disrupt controller operations or client responsiveness. Through careful deployment planning-such as node affinity, resource limits, and isolated network namespaces-Pravega ensures recovery and availability without cascading faults.
The logical-to-physical mapping in Pravega's architecture supports a modular, extensible streaming storage system. The client library abstracts complexities for application developers; the controller governs consistent cluster management; and the segment store sustains scalable, durable data ingestion. Various deployment models-from single-node development to multi-site production clusters-effectively map these components onto infrastructure to balance throughput, latency, fault isolation, and scalability. This separation and flexibility underscore Pravega's adaptability for modern streaming data applications across diverse operational domains.
2.2 Stream Segment Store Internals
The Stream Segment Store constitutes the core data persistence layer within Pravega, architected to deliver highly efficient, durable, and scalable stream storage. It embodies a meticulously engineered log-structured storage system designed to guarantee high-throughput, low-latency access patterns while ensuring fault tolerance and data integrity even under adverse scenarios.
At the foundational level, the segment store organizes data into append-only segments, which represent the primitive unit of storage and manipulation. Each segment corresponds to a linear sequence of bytes, enabling efficient sequential writes and reads crucial for streaming workloads. Internally, the store adopts a log-structured approach, writing all mutations-appends, truncations, and attribute updates-in a strictly sequential manner to an underlying durable log. This design excels at minimizing disk seek operations, thereby optimizing write throughput and write amplification.
Central to the consistency and durability guarantees is the employment of a write-ahead log (WAL) mechanism. Before any segment modification is considered committed, the corresponding write entry is first flushed to the WAL persisted on stable storage. The WAL functions as an immutable, sequential record of all changes, supporting crash recovery by replaying or rolling back incomplete operations following failures. This ensures that no data mutation is lost and that the system can restore a consistent segment state even after abrupt shutdowns.
To mitigate the storage overhead and performance degradation inherent in continuous appending, the segment store integrates automatic compaction strategies. Over time, segments accumulate obsolete or logically deleted data, such as truncated prefixes or overwritten attribute values. The compaction process reorganizes the on-disk layout by discarding these redundant regions and condensing valid data into contiguous blocks. It balances the trade-off between compaction frequency and system throughput by employing heuristics based on segment size, fragmentation ratio, and system load, thereby reducing read amplification and improving space reclamation without impairing foreground operations.
Durability is further enhanced by segment-level replication and checkpointing mechanisms. Data replication can be configured across multiple storage nodes, enabling seamless failover and high availability. Checkpoints capture consistent snapshots of segment metadata and data offsets at safe states, facilitating rapid recovery without the need to replay the entire WAL history. These checkpoints operate alongside the WAL to dramatically reduce recovery time while preserving strong durability semantics.
The segment store achieves low-latency read semantics through precise concurrency control and caching layers. Writes are first buffered in memory and asynchronously flushed to the log, enabling prompt acknowledgment to clients. Read operations consult in-memory indices mapping segment offsets to physical storage locations, avoiding costly scans and enhancing random-access performance...