
Hadoop Application Architectures
Beschreibung
Get expert guidance on architecting end-to-end data management solutions with Apache Hadoop. While many sources explain how to use various components in the Hadoop ecosystem, this practical book takes you through architectural considerations necessary to tie those components together into a complete tailored application, based on your particular use case.
To reinforce those lessons, the book's second section provides detailed examples of architectures used in some of the most commonly found Hadoop applications. Whether you're designing a new Hadoop application, or planning to integrate Hadoop into your existing data infrastructure, Hadoop Application Architectures will skillfully guide you through the process.
This book covers:
- Factors to consider when using Hadoop to store and model data
- Best practices for moving data in and out of the system
- Data processing frameworks, including MapReduce, Spark, and Hive
- Common Hadoop processing patterns, such as removing duplicate records and using windowing analytics
- Giraph, GraphX, and other tools for large graph processing on Hadoop
- Using workflow orchestration and scheduling tools such as Apache Oozie
- Near-real-time stream processing with Apache Storm, Apache Spark Streaming, and Apache Flume
- Architecture examples for clickstream analysis, fraud detection, and data warehousing
Weitere Details
Weitere Ausgaben
Inhalt
- Intro
- Copyright
- Table of Contents
- Foreword
- Preface
- A Note About the Code Examples
- Who Should Read This Book
- Why We Wrote This Book
- Navigating This Book
- Conventions Used in This Book
- Using Code Examples
- Safari® Books Online
- How to Contact Us
- Acknowledgments
- Part I. Architectural Considerations for Hadoop Applications
- Chapter 1. Data Modeling in Hadoop
- Data Storage Options
- Standard File Formats
- Hadoop File Types
- Serialization Formats
- Columnar Formats
- Compression
- HDFS Schema Design
- Location of HDFS Files
- Advanced HDFS Schema Design
- HDFS Schema Design Summary
- HBase Schema Design
- Row Key
- Timestamp
- Hops
- Tables and Regions
- Using Columns
- Using Column Families
- Time-to-Live
- Managing Metadata
- What Is Metadata?
- Why Care About Metadata?
- Where to Store Metadata?
- Examples of Managing Metadata
- Limitations of the Hive Metastore and HCatalog
- Other Ways of Storing Metadata
- Conclusion
- Chapter 2. Data Movement
- Data Ingestion Considerations
- Timeliness of Data Ingestion
- Incremental Updates
- Access Patterns
- Original Source System and Data Structure
- Transformations
- Network Bottlenecks
- Network Security
- Push or Pull
- Failure Handling
- Level of Complexity
- Data Ingestion Options
- File Transfers
- Considerations for File Transfers versus Other Ingest Methods
- Sqoop: Batch Transfer Between Hadoop and Relational Databases
- Flume: Event-Based Data Collection and Processing
- Kafka
- Data Extraction
- Conclusion
- Chapter 3. Processing Data in Hadoop
- MapReduce
- MapReduce Overview
- Example for MapReduce
- When to Use MapReduce
- Spark
- Spark Overview
- Overview of Spark Components
- Basic Spark Concepts
- Benefits of Using Spark
- Spark Example
- When to Use Spark
- Abstractions
- Pig
- Pig Example
- When to Use Pig
- Crunch
- Crunch Example
- When to Use Crunch
- Cascading
- Cascading Example
- When to Use Cascading
- Hive
- Hive Overview
- Example of Hive Code
- When to Use Hive
- Impala
- Impala Overview
- Speed-Oriented Design
- Impala Example
- When to Use Impala
- Conclusion
- Chapter 4. Common Hadoop Processing Patterns
- Pattern: Removing Duplicate Records by Primary Key
- Data Generation for Deduplication Example
- Code Example: Spark Deduplication in Scala
- Code Example: Deduplication in SQL
- Pattern: Windowing Analysis
- Data Generation for Windowing Analysis Example
- Code Example: Peaks and Valleys in Spark
- Code Example: Peaks and Valleys in SQL
- Pattern: Time Series Modifications
- Use HBase and Versioning
- Use HBase with a RowKey of RecordKey and StartTime
- Use HDFS and Rewrite the Whole Table
- Use Partitions on HDFS for Current and Historical Records
- Data Generation for Time Series Example
- Code Example: Time Series in Spark
- Code Example: Time Series in SQL
- Conclusion
- Chapter 5. Graph Processing on Hadoop
- What Is a Graph?
- What Is Graph Processing?
- How Do You Process a Graph in a Distributed System?
- The Bulk Synchronous Parallel Model
- BSP by Example
- Giraph
- Read and Partition the Data
- Batch Process the Graph with BSP
- Write the Graph Back to Disk
- Putting It All Together
- When Should You Use Giraph?
- GraphX
- Just Another RDD
- GraphX Pregel Interface
- vprog()
- sendMessage()
- mergeMessage()
- Which Tool to Use?
- Conclusion
- Chapter 6. Orchestration
- Why We Need Workflow Orchestration
- The Limits of Scripting
- The Enterprise Job Scheduler and Hadoop
- Orchestration Frameworks in the Hadoop Ecosystem
- Oozie Terminology
- Oozie Overview
- Oozie Workflow
- Workflow Patterns
- Point-to-Point Workflow
- Fan-Out Workflow
- Capture-and-Decide Workflow
- Parameterizing Workflows
- Classpath Definition
- Scheduling Patterns
- Frequency Scheduling
- Time and Data Triggers
- Executing Workflows
- Conclusion
- Chapter 7. Near-Real-Time Processing with Hadoop
- Stream Processing
- Apache Storm
- Storm High-Level Architecture
- Storm Topologies
- Tuples and Streams
- Spouts and Bolts
- Stream Groupings
- Reliability of Storm Applications
- Exactly-Once Processing
- Fault Tolerance
- Integrating Storm with HDFS
- Integrating Storm with HBase
- Storm Example: Simple Moving Average
- Evaluating Storm
- Trident
- Trident Example: Simple Moving Average
- Evaluating Trident
- Spark Streaming
- Overview of Spark Streaming
- Spark Streaming Example: Simple Count
- Spark Streaming Example: Multiple Inputs
- Spark Streaming Example: Maintaining State
- Spark Streaming Example: Windowing
- Spark Streaming Example: Streaming versus ETL Code
- Evaluating Spark Streaming
- Flume Interceptors
- Which Tool to Use?
- Low-Latency Enrichment, Validation, Alerting, and Ingestion
- NRT Counting, Rolling Averages, and Iterative Processing
- Complex Data Pipelines
- Conclusion
- Part II. Case Studies
- Chapter 8. Clickstream Analysis
- Defining the Use Case
- Using Hadoop for Clickstream Analysis
- Design Overview
- Storage
- Ingestion
- The Client Tier
- The Collector Tier
- Processing
- Data Deduplication
- Sessionization
- Analyzing
- Orchestration
- Conclusion
- Chapter 9. Fraud Detection
- Continuous Improvement
- Taking Action
- Architectural Requirements of Fraud Detection Systems
- Introducing Our Use Case
- High-Level Design
- Client Architecture
- Profile Storage and Retrieval
- Caching
- HBase Data Definition
- Delivering Transaction Status: Approved or Denied?
- Ingest
- Path Between the Client and Flume
- Near-Real-Time and Exploratory Analytics
- Near-Real-Time Processing
- Exploratory Analytics
- What About Other Architectures?
- Flume Interceptors
- Kafka to Storm or Spark Streaming
- External Business Rules Engine
- Conclusion
- Chapter 10. Data Warehouse
- Using Hadoop for Data Warehousing
- Defining the Use Case
- OLTP Schema
- Data Warehouse: Introduction and Terminology
- Data Warehousing with Hadoop
- High-Level Design
- Data Modeling and Storage
- Ingestion
- Data Processing and Access
- Aggregations
- Data Export
- Orchestration
- Conclusion
- Appendix A. Joins in Impala
- Broadcast Joins
- Partitioned Hash Join
- Index
- About the Authors
Systemvoraussetzungen
Dateiformat: PDF
Kopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
- Computer (Windows; MacOS X; Linux): Installieren Sie bereits vor dem Download die kostenlose Software Adobe Digital Editions (siehe E-Book Hilfe).
- Tablet/Smartphone (Android; iOS): Installieren Sie bereits vor dem Download die kostenlose App Adobe Digital Editions oder die App PocketBook (siehe E-Book Hilfe).
- E-Book-Reader: Bookeen, Kobo, Pocketbook, Sony, Tolino u.v.a.m. (nicht Kindle)
Das Dateiformat PDF zeigt auf jeder Hardware eine Buchseite stets identisch an. Daher ist eine PDF auch für ein komplexes Layout geeignet, wie es bei Lehr- und Fachbüchern verwendet wird (Bilder, Tabellen, Spalten, Fußnoten). Bei kleinen Displays von E-Readern oder Smartphones sind PDF leider eher nervig, weil zu viel Scrollen notwendig ist.
Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.
Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.