Chapter 2
Apache Hudi: Architecture, Features, and Design Principles
At the heart of efficient data lake ingestion lies Apache Hudi, architected to bring transactional power and analytic agility to open storage environments. In this chapter, we peel back the layers of Hudi's system design-exploring its core table abstractions, advanced write semantics, and internal mechanisms that deliver performance and reliability at scale. Through a detailed exploration of file layouts, metadata handling, compaction strategies, and the landscape of competing table formats, you'll discover the foundational principles that set Hudi apart for enterprise-grade workloads.
2.1 Overview of Hudi's Table Abstractions
Apache Hudi provides two primary table abstractions that underpin its data management capabilities: Copy-on-Write (COW) and Merge-on-Read (MOR). Each abstraction is designed with specific architectural characteristics that influence performance, storage layout, and query latency, optimized for different workload patterns and operational requirements in large-scale data management systems.
The Copy-on-Write (COW) table type ensures data freshness and read efficiency by writing updates via a complete rewrite of the affected base data files. In essence, when a record is inserted or updated, Hudi rewrites the whole data file containing that record, atomically replacing it in storage. This results in a read-optimized Parquet layout, eliminating the need for on-the-fly merge operations during query execution. Consequently, COW tables offer low-latency reads suitable for analytics and BI workloads. However, the trade-off is manifested in write amplification and higher latency during ingestion, since entire files must be rewritten for each update, which can become expensive at high update rates or with large partitions.
Conversely, the Merge-on-Read (MOR) table type introduces a hybrid storage and query model that balances write and read performance by decoupling the handling of base and incremental changes. MOR maintains data in two forms: a read-optimized base file in Parquet format and a series of delta log files containing incremental updates stored in a columnar or log-structured format (e.g., Avro). During reads, Hudi executes a real-time merge of the base dataset and the delta logs, presenting a consistent, up-to-date view of the data. This approach reduces write amplification by allowing frequent updates to be appended as incremental logs without requiring a rewrite of entire data files. However, this comes at the cost of increased read latency and complexity, as query engines must merge base and incremental data, and compaction processes are necessary to periodically merge logs into base files to maintain query performance.
The architectural distinction between COW and MOR extends to how timelines and commit protocols orchestrate the transactional integrity and consistency within Hudi tables. At its core, Hudi maintains a timeline, a chronological sequence of events capturing commits, compactions, cleanups, and rollbacks. This timeline is fundamental in guaranteeing ACID semantics across distributed writes by coordinating multi-step operations in a fault-tolerant manner.
Hudi's commit protocol leverages a two-phase commit process that is central to ensuring atomicity and durability. When a write operation is initiated, Hudi first writes a tentative commit file capturing the metadata and affected file groups. Upon successful data writes and validation, this commit is atomically finalized by renaming the temporary commit marker to a permanent state within the timeline metadata. If the commit encounters failures before finalization, the system can identify and roll back incomplete operations to ensure no partial writes corrupt the dataset. This mechanism, combined with timeline awareness, enforces snapshot isolation, preventing readers from observing partial or inconsistent data states.
Timeline metadata also empowers schema evolution and time travel capabilities in Hudi. Schema modifications, such as adding or evolving fields, are integrated as incremental commits, and the timeline tracks the progressive schema versions. Readers querying the table can access consistent snapshots adhering to particular schema versions corresponding to their chosen commit instant. Time travel leverages the timeline's sequential commit history to provide access to historical snapshots of data by querying the dataset at a specified commit or timestamp. This facilitates auditing, reproducibility, and rollback scenarios without implementing external snapshot storage.
The implementation of these abstractions and protocols in Hudi is optimized for cloud-native, large-scale environments where distributed consistency and latency trade-offs must be carefully balanced. COW tables excel in scenarios demanding low-latency analytical queries with limited update operations, whereas MOR tables favor workloads with frequent incremental updates and stream ingestion at the expense of more complex query-time merges and compaction overhead.
In summary, Hudi's table abstractions-Copy-on-Write and Merge-on-Read-serve as foundational architectural pillars tailored for varying data ingestion and query patterns. Their integration with the commit protocol and timeline mechanisms ensures ACID compliance, seamless schema evolution, and temporal data access. Understanding these distinctions is crucial for designing and optimizing data pipelines that harness Hudi's capabilities in real-world, large-scale data lake environments.
2.2 Write Semantics and Transactional Guarantees
Apache Hudi supports a diverse range of ingestion modes, each tailored to different data ingestion and update scenarios, including inserts, upserts, and bulk imports. These ingestion modes are designed to address the requirements for near-real-time analytics, incremental data processing, and large-scale batch data ingestion, while simultaneously providing strong transactional guarantees.
Ingestion Modes: Inserts, Upserts, and Bulk Imports
Inserts correspond to the straightforward ingestion of new data without modifying existing records. This mode suits append-only datasets where new data continuously arrives, and no updates to historical data are necessary. Insert operations in Hudi commit new data as independent file slices within partitions, preserving data immutability.
Upserts combine the functionality of inserts and updates, enabling modification of existing records alongside the insertion of new ones. Upserts are essential in scenarios where data changes gradually over time, such as corrections, late-arriving data, or merges from multiple source systems. Underlying upsert logic performs record-level deduplication and identification of pre-existing keys by leveraging Hudi's indexing mechanisms.
Bulk imports facilitate high-throughput ingestion of large static datasets for creating or replacing entire partitions. This mode bypasses record-level indexing and merging to prioritize speed over incremental update capabilities, making it suitable for initial loads or batch refreshes of data.
Transactional Model in Hudi
To provide ACID (Atomicity, Consistency, Isolation, Durability) guarantees across these ingestion modes, Hudi employs a sophisticated transactional model built on top of distributed data lakes. The transaction framework ensures that writers achieve atomic and consistent commits even in highly concurrent environments.
Atomicity is maintained through the use of a timeline abstraction and atomic commit protocols. Each ingestion operation, whether insert or upsert, is encapsulated as a commit instant in Hudi's timeline server. This commit instant corresponds to a unique, monotonically increasing timestamp-based version that orchestrates the metadata and data file writes atomically. Partial writes or failures result in the operation being visible only after a successful commit completes, thereby preventing partial visibility of data.
Consistency guarantees are rooted in the copy-on-write (COW) or merge-on-read (MOR) storage types. During a commit, new data files or delta log files are written alongside the existing datasets but never overwrite them directly. The atomic commit switches the active pointers in the timeline to the newly written files, producing a consistent snapshot view on read. This approach ensures stable, consistent reads from ongoing ingestion operations.
Isolation within Hudi is achieved through snapshot isolation semantics. Each reader observes a stable version of the dataset corresponding to a specific commit instant. Concurrent writers operate on different commit instants coordinated via the timeline server to avoid write-write conflicts. The timeline enforces serialization by permitting only one active commit instant per writer at any time. This prevents data corruption and non-repeatable reads during concurrent ingestion.
Implementation of Transaction ...