
Hadoop: The Definitive Guide
Beschreibung
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
Weitere Details
Weitere Ausgaben
Andere Ausgaben

Vorauflage

Inhalt
- Cover
- Copyright
- Table of Contents
- Foreword
- Preface
- Administrative Notes
- What's New in the Fourth Edition?
- What's New in the Third Edition?
- What's New in the Second Edition?
- Conventions Used in This Book
- Using Code Examples
- Safari® Books Online
- How to Contact Us
- Acknowledgments
- Part I. Hadoop Fundamentals
- Chapter 1. Meet Hadoop
- Data!
- Data Storage and Analysis
- Querying All Your Data
- Beyond Batch
- Comparison with Other Systems
- Relational Database Management Systems
- Grid Computing
- Volunteer Computing
- A Brief History of Apache Hadoop
- What's in This Book?
- Chapter 2. MapReduce
- A Weather Dataset
- Data Format
- Analyzing the Data with Unix Tools
- Analyzing the Data with Hadoop
- Map and Reduce
- Java MapReduce
- Scaling Out
- Data Flow
- Combiner Functions
- Running a Distributed MapReduce Job
- Hadoop Streaming
- Ruby
- Python
- Chapter 3. The Hadoop Distributed Filesystem
- The Design of HDFS
- HDFS Concepts
- Blocks
- Namenodes and Datanodes
- Block Caching
- HDFS Federation
- HDFS High Availability
- The Command-Line Interface
- Basic Filesystem Operations
- Hadoop Filesystems
- Interfaces
- The Java Interface
- Reading Data from a Hadoop URL
- Reading Data Using the FileSystem API
- Writing Data
- Directories
- Querying the Filesystem
- Deleting Data
- Data Flow
- Anatomy of a File Read
- Anatomy of a File Write
- Coherency Model
- Parallel Copying with distcp
- Keeping an HDFS Cluster Balanced
- Chapter 4. YARN
- Anatomy of a YARN Application Run
- Resource Requests
- Application Lifespan
- Building YARN Applications
- YARN Compared to MapReduce 1
- Scheduling in YARN
- Scheduler Options
- Capacity Scheduler Configuration
- Fair Scheduler Configuration
- Delay Scheduling
- Dominant Resource Fairness
- Further Reading
- Chapter 5. Hadoop I/O
- Data Integrity
- Data Integrity in HDFS
- LocalFileSystem
- ChecksumFileSystem
- Compression
- Codecs
- Compression and Input Splits
- Using Compression in MapReduce
- Serialization
- The Writable Interface
- Writable Classes
- Implementing a Custom Writable
- Serialization Frameworks
- File-Based Data Structures
- SequenceFile
- MapFile
- Other File Formats and Column-Oriented Formats
- Part II. MapReduce
- Chapter 6. Developing a MapReduce Application
- The Configuration API
- Combining Resources
- Variable Expansion
- Setting Up the Development Environment
- Managing Configuration
- GenericOptionsParser, Tool, and ToolRunner
- Writing a Unit Test with MRUnit
- Mapper
- Reducer
- Running Locally on Test Data
- Running a Job in a Local Job Runner
- Testing the Driver
- Running on a Cluster
- Packaging a Job
- Launching a Job
- The MapReduce Web UI
- Retrieving the Results
- Debugging a Job
- Hadoop Logs
- Remote Debugging
- Tuning a Job
- Profiling Tasks
- MapReduce Workflows
- Decomposing a Problem into MapReduce Jobs
- JobControl
- Apache Oozie
- Chapter 7. How MapReduce Works
- Anatomy of a MapReduce Job Run
- Job Submission
- Job Initialization
- Task Assignment
- Task Execution
- Progress and Status Updates
- Job Completion
- Failures
- Task Failure
- Application Master Failure
- Node Manager Failure
- Resource Manager Failure
- Shuffle and Sort
- The Map Side
- The Reduce Side
- Configuration Tuning
- Task Execution
- The Task Execution Environment
- Speculative Execution
- Output Committers
- Chapter 8. MapReduce Types and Formats
- MapReduce Types
- The Default MapReduce Job
- Input Formats
- Input Splits and Records
- Text Input
- Binary Input
- Multiple Inputs
- Database Input (and Output)
- Output Formats
- Text Output
- Binary Output
- Multiple Outputs
- Lazy Output
- Database Output
- Chapter 9. MapReduce Features
- Counters
- Built-in Counters
- User-Defined Java Counters
- User-Defined Streaming Counters
- Sorting
- Preparation
- Partial Sort
- Total Sort
- Secondary Sort
- Joins
- Map-Side Joins
- Reduce-Side Joins
- Side Data Distribution
- Using the Job Configuration
- Distributed Cache
- MapReduce Library Classes
- Part III. Hadoop Operations
- Chapter 10. Setting Up a Hadoop Cluster
- Cluster Specification
- Cluster Sizing
- Network Topology
- Cluster Setup and Installation
- Installing Java
- Creating Unix User Accounts
- Installing Hadoop
- Configuring SSH
- Configuring Hadoop
- Formatting the HDFS Filesystem
- Starting and Stopping the Daemons
- Creating User Directories
- Hadoop Configuration
- Configuration Management
- Environment Settings
- Important Hadoop Daemon Properties
- Hadoop Daemon Addresses and Ports
- Other Hadoop Properties
- Security
- Kerberos and Hadoop
- Delegation Tokens
- Other Security Enhancements
- Benchmarking a Hadoop Cluster
- Hadoop Benchmarks
- User Jobs
- Chapter 11. Administering Hadoop
- HDFS
- Persistent Data Structures
- Safe Mode
- Audit Logging
- Tools
- Monitoring
- Logging
- Metrics and JMX
- Maintenance
- Routine Administration Procedures
- Commissioning and Decommissioning Nodes
- Upgrades
- Part IV. Related Projects
- Chapter 12. Avro
- Avro Data Types and Schemas
- In-Memory Serialization and Deserialization
- The Specific API
- Avro Datafiles
- Interoperability
- Python API
- Avro Tools
- Schema Resolution
- Sort Order
- Avro MapReduce
- Sorting Using Avro MapReduce
- Avro in Other Languages
- Chapter 13. Parquet
- Data Model
- Nested Encoding
- Parquet File Format
- Parquet Configuration
- Writing and Reading Parquet Files
- Avro, Protocol Buffers, and Thrift
- Parquet MapReduce
- Chapter 14. Flume
- Installing Flume
- An Example
- Transactions and Reliability
- Batching
- The HDFS Sink
- Partitioning and Interceptors
- File Formats
- Fan Out
- Delivery Guarantees
- Replicating and Multiplexing Selectors
- Distribution: Agent Tiers
- Delivery Guarantees
- Sink Groups
- Integrating Flume with Applications
- Component Catalog
- Further Reading
- Chapter 15. Sqoop
- Getting Sqoop
- Sqoop Connectors
- A Sample Import
- Text and Binary File Formats
- Generated Code
- Additional Serialization Systems
- Imports: A Deeper Look
- Controlling the Import
- Imports and Consistency
- Incremental Imports
- Direct-Mode Imports
- Working with Imported Data
- Imported Data and Hive
- Importing Large Objects
- Performing an Export
- Exports: A Deeper Look
- Exports and Transactionality
- Exports and SequenceFiles
- Further Reading
- Chapter 16. Pig
- Installing and Running Pig
- Execution Types
- Running Pig Programs
- Grunt
- Pig Latin Editors
- An Example
- Generating Examples
- Comparison with Databases
- Pig Latin
- Structure
- Statements
- Expressions
- Types
- Schemas
- Functions
- Macros
- User-Defined Functions
- A Filter UDF
- An Eval UDF
- A Load UDF
- Data Processing Operators
- Loading and Storing Data
- Filtering Data
- Grouping and Joining Data
- Sorting Data
- Combining and Splitting Data
- Pig in Practice
- Parallelism
- Anonymous Relations
- Parameter Substitution
- Further Reading
- Chapter 17. Hive
- Installing Hive
- The Hive Shell
- An Example
- Running Hive
- Configuring Hive
- Hive Services
- The Metastore
- Comparison with Traditional Databases
- Schema on Read Versus Schema on Write
- Updates, Transactions, and Indexes
- SQL-on-Hadoop Alternatives
- HiveQL
- Data Types
- Operators and Functions
- Tables
- Managed Tables and External Tables
- Partitions and Buckets
- Storage Formats
- Importing Data
- Altering Tables
- Dropping Tables
- Querying Data
- Sorting and Aggregating
- MapReduce Scripts
- Joins
- Subqueries
- Views
- User-Defined Functions
- Writing a UDF
- Writing a UDAF
- Further Reading
- Chapter 18. Crunch
- An Example
- The Core Crunch API
- Primitive Operations
- Types
- Sources and Targets
- Functions
- Materialization
- Pipeline Execution
- Running a Pipeline
- Stopping a Pipeline
- Inspecting a Crunch Plan
- Iterative Algorithms
- Checkpointing a Pipeline
- Crunch Libraries
- Further Reading
- Chapter 19. Spark
- Installing Spark
- An Example
- Spark Applications, Jobs, Stages, and Tasks
- A Scala Standalone Application
- A Java Example
- A Python Example
- Resilient Distributed Datasets
- Creation
- Transformations and Actions
- Persistence
- Serialization
- Shared Variables
- Broadcast Variables
- Accumulators
- Anatomy of a Spark Job Run
- Job Submission
- DAG Construction
- Task Scheduling
- Task Execution
- Executors and Cluster Managers
- Spark on YARN
- Further Reading
- Chapter 20. HBase
- HBasics
- Backdrop
- Concepts
- Whirlwind Tour of the Data Model
- Implementation
- Installation
- Test Drive
- Clients
- Java
- MapReduce
- REST and Thrift
- Building an Online Query Application
- Schema Design
- Loading Data
- Online Queries
- HBase Versus RDBMS
- Successful Service
- HBase
- Praxis
- HDFS
- UI
- Metrics
- Counters
- Further Reading
- Chapter 21. ZooKeeper
- Installing and Running ZooKeeper
- An Example
- Group Membership in ZooKeeper
- Creating the Group
- Joining a Group
- Listing Members in a Group
- Deleting a Group
- The ZooKeeper Service
- Data Model
- Operations
- Implementation
- Consistency
- Sessions
- States
- Building Applications with ZooKeeper
- A Configuration Service
- The Resilient ZooKeeper Application
- A Lock Service
- More Distributed Data Structures and Protocols
- ZooKeeper in Production
- Resilience and Performance
- Configuration
- Further Reading
- Part V. Case Studies
- Chapter 22. Composable Data at Cerner
- From CPUs to Semantic Integration
- Enter Apache Crunch
- Building a Complete Picture
- Integrating Healthcare Data
- Composability over Frameworks
- Moving Forward
- Chapter 23. Biological Data Science: Saving Lives with Software
- The Structure of DNA
- The Genetic Code: Turning DNA Letters into Proteins
- Thinking of DNA as Source Code
- The Human Genome Project and Reference Genomes
- Sequencing and Aligning DNA
- ADAM, A Scalable Genome Analysis Platform
- Literate programming with the Avro interface description language (IDL)
- Column-oriented access with Parquet
- A simple example: k-mer counting using Spark and ADAM
- From Personalized Ads to Personalized Medicine
- Join In
- Chapter 24. Cascading
- Fields, Tuples, and Pipes
- Operations
- Taps, Schemes, and Flows
- Cascading in Practice
- Flexibility
- Hadoop and Cascading at ShareThis
- Summary
- Appendix A. Installing Apache Hadoop
- Prerequisites
- Installation
- Configuration
- Standalone Mode
- Pseudodistributed Mode
- Fully Distributed Mode
- Appendix B. Cloudera's Distribution Including Apache Hadoop
- Appendix C. Preparing the NCDC Weather Data
- Appendix D. The Old and New Java MapReduce APIs
- Index
- About the Author
Systemvoraussetzungen
Dateiformat: PDF
Kopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
- Computer (Windows; MacOS X; Linux): Installieren Sie bereits vor dem Download die kostenlose Software Adobe Digital Editions (siehe E-Book Hilfe).
- Tablet/Smartphone (Android; iOS): Installieren Sie bereits vor dem Download die kostenlose App Adobe Digital Editions oder die App PocketBook (siehe E-Book Hilfe).
- E-Book-Reader: Bookeen, Kobo, Pocketbook, Sony, Tolino u.v.a.m. (nicht Kindle)
Das Dateiformat PDF zeigt auf jeder Hardware eine Buchseite stets identisch an. Daher ist eine PDF auch für ein komplexes Layout geeignet, wie es bei Lehr- und Fachbüchern verwendet wird (Bilder, Tabellen, Spalten, Fußnoten). Bei kleinen Displays von E-Readern oder Smartphones sind PDF leider eher nervig, weil zu viel Scrollen notwendig ist.
Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.
Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.