
Hadoop: The Definitive Guide
Beschreibung
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
Weitere Details
Weitere Ausgaben
Nachauflagen

Andere Ausgaben
Inhalt
- Intro
- Table of Contents
- Foreword
- Preface
- Administrative Notes
- What's in This Book?
- Conventions Used in This Book
- Using Code Examples
- Safari® Books Online
- How to Contact Us
- Acknowledgments
- Chapter 1. Meet Hadoop
- Data!
- Data Storage and Analysis
- Comparison with Other Systems
- RDBMS
- Grid Computing
- Volunteer Computing
- A Brief History of Hadoop
- The Apache Hadoop Project
- Chapter 2. MapReduce
- A Weather Dataset
- Data Format
- Analyzing the Data with Unix Tools
- Analyzing the Data with Hadoop
- Map and Reduce
- Java MapReduce
- A test run
- The new Java MapReduce API
- Scaling Out
- Data Flow
- Combiner Functions
- Specifying a combiner function
- Running a Distributed MapReduce Job
- Hadoop Streaming
- Ruby
- Python
- Hadoop Pipes
- Compiling and Running
- Chapter 3. The Hadoop Distributed Filesystem
- The Design of HDFS
- HDFS Concepts
- Blocks
- Namenodes and Datanodes
- The Command-Line Interface
- Basic Filesystem Operations
- Hadoop Filesystems
- Interfaces
- Thrift
- C
- FUSE
- WebDAV
- Other HDFS Interfaces
- The Java Interface
- Reading Data from a Hadoop URL
- Reading Data Using the FileSystem API
- FSDataInputStream
- Writing Data
- FSDataOutputStream
- Directories
- Querying the Filesystem
- File metadata: FileStatus
- Listing files
- File patterns
- PathFilter
- Deleting Data
- Data Flow
- Anatomy of a File Read
- Anatomy of a File Write
- Coherency Model
- Consequences for application design
- Parallel Copying with distcp
- Keeping an HDFS Cluster Balanced
- Hadoop Archives
- Using Hadoop Archives
- Limitations
- Chapter 4. Hadoop I/O
- Data Integrity
- Data Integrity in HDFS
- LocalFileSystem
- ChecksumFileSystem
- Compression
- Codecs
- Compressing and decompressing streams with CompressionCodec
- Inferring CompressionCodecs using CompressionCodecFactory
- Native libraries
- Compression and Input Splits
- Using Compression in MapReduce
- Compressing map output
- Serialization
- The Writable Interface
- WritableComparable and comparators
- Writable Classes
- Writable wrappers for Java primitives
- Text
- BytesWritable
- NullWritable
- ObjectWritable and GenericWritable
- Writable collections
- Implementing a Custom Writable
- Implementing a RawComparator for speed
- Custom comparators
- Serialization Frameworks
- Serialization IDL
- File-Based Data Structures
- SequenceFile
- Writing a SequenceFile
- Reading a SequenceFile
- Displaying a SequenceFile with the command-line interface
- Sorting and merging SequenceFiles
- The SequenceFile Format
- MapFile
- Writing a MapFile
- Reading a MapFile
- Converting a SequenceFile to a MapFile
- Chapter 5. Developing a MapReduce Application
- The Configuration API
- Combining Resources
- Variable Expansion
- Configuring the Development Environment
- Managing Configuration
- GenericOptionsParser, Tool, and ToolRunner
- Writing a Unit Test
- Mapper
- Reducer
- Running Locally on Test Data
- Running a Job in a Local Job Runner
- Fixing the mapper
- Testing the Driver
- Running on a Cluster
- Packaging
- Launching a Job
- The MapReduce Web UI
- The jobtracker page
- The job page
- Retrieving the Results
- Debugging a Job
- The tasks page
- The task details page
- Handling malformed data
- Using a Remote Debugger
- Tuning a Job
- Profiling Tasks
- The HPROF profiler
- Other profilers
- MapReduce Workflows
- Decomposing a Problem into MapReduce Jobs
- Running Dependent Jobs
- Chapter 6. How MapReduce Works
- Anatomy of a MapReduce Job Run
- Job Submission
- Job Initialization
- Task Assignment
- Task Execution
- Streaming and Pipes
- Progress and Status Updates
- Job Completion
- Failures
- Task Failure
- Tasktracker Failure
- Jobtracker Failure
- Job Scheduling
- The Fair Scheduler
- Shuffle and Sort
- The Map Side
- The Reduce Side
- Configuration Tuning
- Task Execution
- Speculative Execution
- Task JVM Reuse
- Skipping Bad Records
- The Task Execution Environment
- Streaming environment variables
- Task side-effect files
- Chapter 7. MapReduce Types and Formats
- MapReduce Types
- The Default MapReduce Job
- The default Streaming job
- Keys and values in Streaming
- Input Formats
- Input Splits and Records
- FileInputFormat
- FileInputFormat input paths
- FileInputFormat input splits
- Small files and CombineFileInputFormat
- Preventing splitting
- File information in the mapper
- Processing a whole file as a record
- Text Input
- TextInputFormat
- KeyValueTextInputFormat
- NLineInputFormat
- XML
- Binary Input
- SequenceFileInputFormat
- SequenceFileAsTextInputFormat
- SequenceFileAsBinaryInputFormat
- Multiple Inputs
- Database Input (and Output)
- Output Formats
- Text Output
- Binary Output
- SequenceFileOutputFormat
- SequenceFileAsBinaryOutputFormat
- MapFileOutputFormat
- Multiple Outputs
- An example: Partitioning data
- MultipleOutputFormat
- MultipleOutputs
- Lazy Output
- Database Output
- Chapter 8. MapReduce Features
- Counters
- Built-in Counters
- User-Defined Java Counters
- Dynamic counters
- Readable counter names
- Retrieving counters
- User-Defined Streaming Counters
- Sorting
- Preparation
- Partial Sort
- An application: Partitioned MapFile lookups
- Total Sort
- Secondary Sort
- Java code
- Streaming
- Joins
- Map-Side Joins
- Reduce-Side Joins
- Side Data Distribution
- Using the Job Configuration
- Distributed Cache
- Usage
- How it works
- The DistributedCache API
- MapReduce Library Classes
- Chapter 9. Setting Up a Hadoop Cluster
- Cluster Specification
- Network Topology
- Rack awareness
- Cluster Setup and Installation
- Installing Java
- Creating a Hadoop User
- Installing Hadoop
- Testing the Installation
- SSH Configuration
- Hadoop Configuration
- Configuration Management
- Control scripts
- Master node scenarios
- Environment Settings
- Memory
- Java
- System logfiles
- SSH settings
- Important Hadoop Daemon Properties
- HDFS
- MapReduce
- Hadoop Daemon Addresses and Ports
- Other Hadoop Properties
- Cluster membership
- Service-level authorization
- Buffer size
- HDFS block size
- Reserved storage space
- Trash
- Task memory limits
- Job scheduler
- Post Install
- Benchmarking a Hadoop Cluster
- Hadoop Benchmarks
- Benchmarking HDFS with TestDFSIO
- Benchmarking MapReduce with Sort
- Other benchmarks
- User Jobs
- Hadoop in the Cloud
- Hadoop on Amazon EC2
- Setup
- Launching a cluster
- Running a MapReduce job
- Terminating a cluster
- Chapter 10. Administering Hadoop
- HDFS
- Persistent Data Structures
- Namenode directory structure
- The filesystem image and edit log
- Secondary namenode directory structure
- Datanode directory structure
- Safe Mode
- Entering and leaving safe mode
- Audit Logging
- Tools
- dfsadmin
- Filesystem check (fsck)
- Datanode block scanner
- balancer
- Monitoring
- Logging
- Setting log levels
- Getting stack traces
- Metrics
- FileContext
- GangliaContext
- NullContextWithUpdateThread
- CompositeContext
- Java Management Extensions
- Maintenance
- Routine Administration Procedures
- Metadata backups
- Data backups
- Filesystem check (fsck)
- Filesystem balancer
- Commissioning and Decommissioning Nodes
- Commissioning new nodes
- Decommissioning old nodes
- Upgrades
- HDFS data and metadata upgrades
- Chapter 11. Pig
- Installing and Running Pig
- Execution Types
- Local mode
- Hadoop mode
- Running Pig Programs
- Grunt
- Pig Latin Editors
- An Example
- Generating Examples
- Comparison with Databases
- Pig Latin
- Structure
- Statements
- Expressions
- Types
- Schemas
- Validation and nulls
- Schema merging
- Functions
- User-Defined Functions
- A Filter UDF
- Leveraging types
- An Eval UDF
- A Load UDF
- Using a schema
- Advanced loading with Slicer
- Data Processing Operators
- Loading and Storing Data
- Filtering Data
- FOREACH .. GENERATE
- STREAM
- Grouping and Joining Data
- JOIN
- COGROUP
- CROSS
- GROUP
- Sorting Data
- Combining and Splitting Data
- Pig in Practice
- Parallelism
- Parameter Substitution
- Dynamic parameters
- Parameter substitution processing
- Chapter 12. HBase
- HBasics
- Backdrop
- Concepts
- Whirlwind Tour of the Data Model
- Regions
- Locking
- Implementation
- HBase in operation
- Installation
- Test Drive
- Clients
- Java
- MapReduce
- REST and Thrift
- REST
- Thrift
- Example
- Schemas
- Loading Data
- Optimization notes
- Web Queries
- HBase Versus RDBMS
- Successful Service
- HBase
- Use Case: HBase at streamy.com
- Very large items tables
- Very large sort merges
- Life with HBase
- Praxis
- Versions
- Love and Hate: HBase and HDFS
- UI
- Metrics
- Schema Design
- Joins
- Row keys
- Chapter 13. ZooKeeper
- Installing and Running ZooKeeper
- An Example
- Group Membership in ZooKeeper
- Creating the Group
- Joining a Group
- Listing Members in a Group
- ZooKeeper command-line tools
- Deleting a Group
- The ZooKeeper Service
- Data Model
- Ephemeral znodes
- Sequence numbers
- Watches
- Operations
- APIs
- Watch triggers
- ACLs
- Implementation
- Consistency
- Sessions
- Time
- States
- Building Applications with ZooKeeper
- A Configuration Service
- The Resilient ZooKeeper Application
- InterruptedException
- KeeperException
- A reliable configuration service
- A Lock Service
- The herd effect
- Recoverable exceptions
- Unrecoverable exceptions
- Implementation
- More Distributed Data Structures and Protocols
- BookKeeper
- ZooKeeper in Production
- Resilience and Performance
- Configuration
- Chapter 14. Case Studies
- Hadoop Usage at Last.fm
- Last.fm: The Social Music Revolution
- Hadoop at Last.fm
- Generating Charts with Hadoop
- The Track Statistics Program
- Calculating the number of unique listeners
- Summing the track totals
- Merging the results
- Summary
- Hadoop and Hive at Facebook
- Introduction
- Hadoop at Facebook
- History
- Use cases
- Data architecture
- Hadoop configuration
- Hypothetical Use Case Studies
- Advertiser insights and performance
- Ad hoc analysis and product feedback
- Data analysis
- Hive
- Overview
- Data organization
- Query language
- Data pipelines using Hive
- Problems and Future Work
- Fair sharing
- Space management
- Scribe-HDFS integration
- Improvements to Hive
- Nutch Search Engine
- Background
- Data Structures
- CrawlDb
- LinkDb
- Segments
- Selected Examples of Hadoop Data Processing in Nutch
- Link inversion
- Generation of fetchlists
- Fetcher: A multi-threaded MapRunner in action
- Indexer: Using custom OutputFormat
- Summary
- Log Processing at Rackspace
- Requirements/The Problem
- Logs
- Brief History
- Choosing Hadoop
- Collection and Storage
- Log collection
- Log storage
- MapReduce for Logs
- Processing
- Merging for near-term search
- Archiving for analysis
- Cascading
- Fields, Tuples, and Pipes
- Operations
- Taps, Schemes, and Flows
- Cascading in Practice
- Flexibility
- Hadoop and Cascading at ShareThis
- Summary
- TeraByte Sort on Apache Hadoop
- Appendix A. Installing Apache Hadoop
- Prerequisites
- Installation
- Configuration
- Standalone Mode
- Pseudo-Distributed Mode
- Configuring SSH
- Formatting the HDFS filesystem
- Starting and stopping the daemons
- Fully Distributed Mode
- Appendix B. Cloudera's Distribution for Hadoop
- Prerequisites
- Standalone Mode
- Pseudo-Distributed Mode
- Fully Distributed Mode
- Hadoop-Related Packages
- Appendix C. Preparing the NCDC Weather Data
- Index
Systemvoraussetzungen
Dateiformat: PDF
Kopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
- Computer (Windows; MacOS X; Linux): Installieren Sie bereits vor dem Download die kostenlose Software Adobe Digital Editions (siehe E-Book Hilfe).
- Tablet/Smartphone (Android; iOS): Installieren Sie bereits vor dem Download die kostenlose App Adobe Digital Editions oder die App PocketBook (siehe E-Book Hilfe).
- E-Book-Reader: Bookeen, Kobo, Pocketbook, Sony, Tolino u.v.a.m. (nicht Kindle)
Das Dateiformat PDF zeigt auf jeder Hardware eine Buchseite stets identisch an. Daher ist eine PDF auch für ein komplexes Layout geeignet, wie es bei Lehr- und Fachbüchern verwendet wird (Bilder, Tabellen, Spalten, Fußnoten). Bei kleinen Displays von E-Readern oder Smartphones sind PDF leider eher nervig, weil zu viel Scrollen notwendig ist.
Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.
Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.