Apache Iceberg: The Definitive Guide

Name: Apache Iceberg: The Definitive Guide
Brand: O'Reilly
Price: 58.99 EUR
Availability: OnlineOnly

Tomer Shiran Jason Hughes Alex Merced(Author)

O'Reilly (Publisher)

Published on 2. May 2024

344 pages

E-Book

PDF with Adobe-DRM

System requirements

978-1-0981-4859-1 (ISBN)

€58.99incl. 7% vat

System requirements

for PDF with Adobe-DRM

E-Book Single Licence

Available for download

Description

More details

Other editions

Content

Intro
Copyright
Table of Contents
Foreword by Gerrit Kazmaier
Foreword by Raghu Ramakrishnan
Foreword by Rick Sears
Preface
About This Book
Why We Wrote This Book
What You Will Find Inside
How to Use This Book
Feedback and Questions
Conventions Used in This Book
Using Code Examples
O'Reilly Online Learning
How to Contact Us
Acknowledgments
Part I. Fundamentals of Apache Iceberg
Chapter 1. Introduction to Apache Iceberg
How Did We Get Here? A Brief History
Foundational Components of a System Designed for OLAP Workloads
Bringing It All Together
The Data Warehouse
A Brief History
Pros and Cons of a Data Warehouse
The Data Lake
A Brief History
Pros and Cons of a Data Lake
Should I Run Analytics on a Data Lake or a Data Warehouse?
The Data Lakehouse
What Is a Table Format?
Hive: The Original Table Format
Modern Data Lake Table Formats
What Is Apache Iceberg?
How Apache Iceberg Came to Be
The Apache Iceberg Architecture
Key Features of Apache Iceberg
Conclusion
Chapter 2. The Architecture of Apache Iceberg
The Data Layer
Datafiles
Delete Files
The Metadata Layer
Manifest Files
Manifest Lists
Metadata Files
Puffin Files
The Catalog
Conclusion
Chapter 3. Lifecycle of Write and Read Queries
Writing Queries in Apache Iceberg
Create the Table
Insert the Query
Merge Query
Reading Queries in Apache Iceberg
The SELECT Query
The Time-Travel Query
Conclusion
Chapter 4. Optimizing the Performance of Iceberg Tables
Compaction
Hands-on with Compaction
Compaction Strategies
Automating Compaction
Sorting
Z-order
Partitioning
Hidden Partitioning
Partition Evolution
Other Partitioning Considerations
Copy-on-Write Versus Merge-on-Read
Copy-on-Write
Merge-on-Read
Configuring COW and MOR
Other Considerations
Metrics Collection
Rewriting Manifests
Optimizing Storage
Write Distribution Mode
Object Storage Considerations
Datafile Bloom Filters
Conclusion
Chapter 5. Iceberg Catalogs
Requirements of an Iceberg Catalog
Catalog Comparison
The Hadoop Catalog
The Hive Catalog
The AWS Glue Catalog
The Nessie Catalog
The REST Catalog
The JDBC Catalog
Other Catalogs
Catalog Migration
Using the Apache Iceberg Catalog Migration CLI
Using an Engine
Conclusion
Part II. Hands-on with Apache Iceberg
Chapter 6. Apache Spark
Configuration
Configuring Apache Iceberg and Spark
Configuring the Catalogs
Starting Spark with All the Configurations (AWS Glue Example)
Data Definition Language Operations
CREATE TABLE
ALTER TABLE
Alter a Table with Iceberg's Spark SQL Extensions
DROP TABLE
Reading Data
The Select All Query
The Filter Rows Query
Aggregation Queries
Using Window Functions
Writing Data
INSERT INTO
MERGE INTO
INSERT OVERWRITE
DELETE FROM
UPDATE
Iceberg Table Maintenance Procedures
Expire Snapshots
Rewrite Datafiles
Rewrite Manifests
Remove Orphan Files
Conclusion
Chapter 7. Dremio's SQL Query Engine
Configuration
Data Definition Language Operations
CREATE TABLE
ALTER TABLE
DROP TABLE
Reading Data
Using the SELECT Query
Filtering Rows
Using Aggregated Queries
Using Window Functions
Writing Data
INSERT INTO
COPY INTO
MERGE INTO
DELETE
UPDATE
Iceberg Table Maintenance
Expire Snapshots
Rewrite Datafiles
Rewrite Manifests
Conclusion
Chapter 8. AWS Glue
Configuration
Creating a Glue Database
Configuring the Glue ETL Job
Create a Table Using the Glue Data Catalog
Read the Table
Insert the Data
Conclusion
Chapter 9. Apache Flink
Configuration
Prerequisites
Start the Flink Cluster and Flink SQL Client
Data Definition Language Operations
CREATE CATALOG
CREATE DATABASE
CREATE TABLE
ALTER TABLE
DROP TABLE
Reading Data
Flink SQL Batch Read
Flink SQL Streaming Read
Metadata Table
Writing Data
INSERT INTO
INSERT OVERWRITE
UPSERT
Flink DataFrame and Table API with Apache Iceberg Tables
Prerequisites
Configuring the Flink Job
Starting the Cluster and Building the Package
Running the Job
Conclusion
Part III. Apache Iceberg in Practice
Chapter 10. Apache Iceberg in Production
Apache Iceberg Metadata Tables
The history Metadata Table
The metadata_log_entries Metadata Table
The snapshots Metadata Table
The files Metadata Table
The manifests Metadata Table
The partitions Metadata Table
The all_data_files Metadata Table
The all_manifests Metadata Table
The refs Metadata Table
The entries Metadata Table
Using the Metadata Tables in Conjunction
Isolation of Changes with Branches
Table Branching and Tagging
Catalog Branching and Tagging
Multitable Transactions
Rolling Back Changes
Rolling Back at the Table Level
Rolling Back at the Catalog Level
Conclusion
Chapter 11. Streaming with Apache Iceberg
Streaming with Spark
Streaming into Iceberg with Spark
Streaming from Iceberg with Spark
Streaming with Flink
Streaming into Iceberg with Flink
Example of Streaming into Iceberg with Flink
Streaming with Kafka Connect
The Iceberg Kafka Sink
Streaming with AWS
Conclusion
Chapter 12. Governance and Security
Securing Datafiles
Securing Files: Best Practices
Hadoop Distributed File System
Amazon Simple Storage Service
Azure Data Lake Storage
Google Cloud Storage
Securing and Governing at the Semantic Layer
Semantic Layer Best Practices
Dremio
Trino
Securing and Governing at the Catalog Level
Nessie
Tabular
AWS Glue and Lake Formation
Additional Security and Governance Considerations
Conclusion
Chapter 13. Migrating to Apache Iceberg
Migration Considerations
Three-Step In-Place Migration Plan
Four-Phase Shadow Migration Plan
Migrating Hive Tables to Apache Iceberg
The Snapshot Procedure
The Migrate Procedure
Migrating Delta Lake to Apache Iceberg
Migrating Apache Hudi to Apache Iceberg
Migrating Individual Files to Apache Iceberg
Using the add_files Procedure
Migrating from Delta Lake or Apache Hudi Without Preserving History
Migrating from Anywhere by Rewriting Data
Migrating Data to a New Iceberg Table
Migrating Data into an Existing Iceberg Table
Conclusion
Chapter 14. Real-World Use Cases of Apache Iceberg
Ensuring High-Quality Data with Write-Audit-Publish in Apache Iceberg
WAP Using Iceberg's Branching Feature
Running BI Workloads on the Data Lake
Land the Raw Data into the Data Lake
Curate Virtual Data Marts/Data Products
Create a Reflection to Accelerate Our Dashboard
Connect Our View to Our BI Tool
Benefits of Running BI Workloads on the Data Lake
Implementing Change Data Capture with Apache Iceberg
Create Apache Iceberg Tables
Apply Updates from Operational Systems
Create the Change Log View to Capture Changes
Merge Changed Data in the Aggregated Table
Conclusion
Index
About the Authors
Colophon

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Apache Iceberg: The Definitive Guide

Description

More details

Other editions

Additional editions

Content

System requirements