
Pro Apache Hadoop
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
More details
Other editions
Additional editions

Persons
Content
- Intro
- Contents at a Glance
- Contents
- About the Authors
- About the Technical Reviewer
- Acknowledgments
- Introduction
- Chapter 1: Motivation for Big Data
- What Is Big Data?
- Key Idea Behind Big Data Techniques
- Data Is Distributed Across Several Nodes
- Applications Are Moved to the Data
- Data Is Processed Local to a Node
- Sequential Reads Preferred Over Random Reads
- An Example
- Big Data Programming Models
- Massively Parallel Processing (MPP) Database Systems
- In-Memory Database Systems
- MapReduce Systems
- Bulk Synchronous Parallel (BSP) Systems
- Big Data and Transactional Systems
- How Much Can We Scale?
- A Compute-Intensive Example
- Amdhal's Law
- Business Use-Cases for Big Data
- Summary
- Chapter 2: Hadoop Concepts
- Introducing Hadoop
- Introducing the MapReduce Model
- Components of Hadoop
- Hadoop Distributed File System (HDFS)
- Block Storage Nature of Hadoop Files
- File Metadata and NameNode
- Mechanics of an HDFS Write
- Mechanics of an HDFS Read
- Mechanics of an HDFS Delete
- Ensuring HDFS Reliability
- Secondary NameNode
- TaskTracker
- JobTracker
- Hadoop 2.0
- Components of YARN
- Container
- Node Manager
- Resource Manager
- Application Master
- Anatomy of a YARN Request
- HDFS High Availability
- Summary
- Chapter 3: Getting Started with the Hadoop Framework
- Types of Installation
- Stand-Alone Mode
- Pseudo-Distributed Cluster
- Multinode Node Cluster Installation
- Preinstalled Using Amazon Elastic MapReduce
- Setting up a Development Environment with a Cloudera Virtual Machine
- Components of a MapReduce program
- Your First Hadoop Program
- Prerequisites to Run Programs in Local Mode
- WordCount Using the Old API
- Building the Application
- Running WordCount in Cluster Mode
- WordCount Using the New API
- Building the Application
- Running WordCount in Cluster Mode
- Third-Party Libraries in Hadoop Jobs
- Summary
- Chapter 4: Hadoop Administration
- Hadoop Configuration Files
- Configuring Hadoop Daemons
- Precedence of Hadoop Configuration Files
- Diving into Hadoop Configuration Files
- core-site.xml
- hdfs-*.xml
- mapred-site.xml
- yarn-site.xml
- Memory Allocations in YARN
- Scheduler
- Capacity Scheduler
- Configuring Capacity Guarantees Across Organizational Groups
- Enforcing Capacity Limits
- Enforcing Access Control Limits
- Updating Capacity Limit changes in the Cluster
- Capacity Scheduler Scenario
- Fair Scheduler
- Fair Scheduler Configuration
- yarn-site.xml Configurations
- Allocation File Format and Configurations
- Determine Dominant Resource Share in drf Policy
- Slaves File
- Rack Awareness
- Providing Hadoop with Network Topology
- Cluster Administration Utilities
- Check the HDFS
- Command-Line HDFS Administration
- HDFS Cluster Health Report
- Add/Remove Nodes
- Placing the HDFS in Safemode
- Rebalancing HDFS Data
- Copying Large Amounts of Data from the HDFS
- Summary
- Chapter 5: Basics of MapReduce Development
- Hadoop and Data Process ing
- Reviewing the Airline Dataset
- Preparing the Development Environment
- Preparing the Hadoop System
- MapReduce Programming Patterns
- Map-Only Jobs (SELECT and WHERE Queries)
- Problem Definition: SELECT Clause
- run( ) method
- SelectClauseMapper Class
- Running the SELECT Clause Job in the Development Environment
- Running the SELECT Clause Job on the Cluster
- View SELECT Clause Job Results
- Recapping Key Hadoop Features Explored with SELECT
- Problem Definition: WHERE Clause
- WhereClauseMapper Class
- Running the WHERE Clause Job on the Cluster
- GenericOptionsParser Revisited
- Key Hadoop Features Explored with the WHERE Clause
- Map and Reduce Jobs (Aggregation Queries)
- Problem Definition: GROUP BY and SUM Clauses
- run( ) method
- AggregationMapper Class
- Reduce Phase
- Running the GROUP BY and SUM Clause Job in the Cluster
- Results for the Entire Dataset
- Recapping the Key Hadoop Features Explored with GROUP BY and SUM
- Improving Aggregation Performance Using the Combiner
- Problem Definition: Optimized Aggregators
- AggregationMapper Class from the AggregationWithCombinerMRJob Class
- run( ) Method
- Running the Combiner-Based Aggregation Job in the Cluster
- Key Aggregation Hadoop Features Explored with Aggregation
- Role of the Partitioner
- Problem Definition: Split Airline Data by Month
- run( ) Method
- Partitioner
- SplitByMonthMapper Class
- SortByMonthAndDayOfWeekReducer Class
- MonthPartitioner Class
- Run the Partitioner Job in the Cluster
- Key Hadoop Features Explored with Partitioner
- Bringing it All Together
- Summary
- Chapter 6: Advanced MapReduce Development
- MapReduce Programming Patterns
- Introduction to Hadoop I/O
- Writable and WritableComparable Interfaces
- Problem Definition: Sorting
- Primary Challenge: Total Order Sorting
- Implementing a Custom Key Class: MonthDoWWritable
- Implementing a Custom Value Class: DelaysWritable
- Sorting MapReduce Program
- run() method
- SortAscMonthDescWeekMapper Class
- MonthDoWPartitioner Class
- SortAscMonthDescWeekReducer Class
- Run the Sorting Job on the Cluster
- Sorting with Writable-Only Keys
- Recapping the Key Hadoop Features Explored with Sorting
- Problem Definition: Analyzing Consecutive Records
- Key Components Supporting Secondary Sort
- Custom Key Class: ArrivalFlightKey
- Custom Partitioner: ArrivalFlightKeyBasedPartioner
- Sorting Comparator Class: ArrivalFlightKeySortingComparator
- Grouping ComparatorClass: ArrivalFlightKeyGroupingComparator
- Mapper Class: AnalyzeConsecutiveDelaysMapper
- Reducer Class: AnalyzeConsecutiveDelaysReducer
- run( ) Method
- Implementing Secondary Sort without a Grouping Comparator
- Running the Secondary Sort Job on the Cluster
- Recapping the Key Hadoop Features Explored with Secondary Sort
- Problem Definition: Join Using MapReduce
- Handling Multiple Inputs: MultipleInputs Class
- Mapper Classes for Multiple Inputs
- Custom Partitioner: CarrierCodeBasedPartioner
- Implementing the Join in the Reducer
- SortingComparator Class: CarrierSortComparator
- Grouping Comparator Class: CarrierGroupComparator
- Reducer Class: JoinReducer
- Running the MapReduce Join Job on the Cluster
- Key Hadoop Features Explored with MapReduce
- Problem Definition: Join Using Map-Only jobs
- DistributedCache-Based Solution
- run( ) method
- MapSideJoinMapper Class
- Running the Map-Only Join Job on the Cluster
- Recapping the Key Hadoop Features Explored with Map-Only Join
- Writing to Multiple Output Files in a Single MR Job
- Collecting Statistics Using Counters
- Summary
- Chapter 7: Hadoop Input/Output
- Compression Schemes
- What Can Be Compressed?
- Compression Schemes
- Enabling Compression
- Inside the Hadoop I/O processes
- InputFormat
- Anatomy of InputSplit
- Anatomy of InputFormat
- TextInputFormat
- OutputFormat
- TextOutputFormat
- Custom OutputFormat: Conversion from Text to XML
- Run the Text-to-XML Job on the Cluster
- Custom InputFormat: Consuming a Custom XML file
- Anatomy of the org.apache.hadoop.mapreduce.Mapper.run() method
- CompositeInputFormat and Large Joins
- How Map-Side Join Works with Sorted Datasets
- Hadoop Files
- SequenceFile
- SequenceFileOutputFormat: Creating a SequenceFile
- SequenceFileInputFormat: Reading from a SequenceFile
- Compression and SequenceFiles
- Sequence File Header
- Sync Marker and Splittable Nature of SequenceFiles
- RECORD Compression
- BLOCK Compression
- MapFiles
- Map Files and Distributed Cache
- Avro Files
- FlightDelay.avsc: Avro Schema File
- Job to Convert from Text Format to Avro Format
- Job to Convert from Avro Format to Text Format
- Summary
- Chapter 8: Testing Hadoop Programs
- Revisiting the Word Counter
- Introducing MRUnit
- Installing MRUnit
- MRUnit Core Classes
- Writing an MRUnit Test Case
- Testing Counters
- Features of MRUnit
- Limitations of MRUnit
- Testing with LocalJobRunner
- setUp( ) method
- Limitations of LocalJobRunner
- Testing with MiniMRCluster
- Setting up the Development Environment
- Example for MiniMRCluster
- Limitations of MiniMRCluster
- Testing MR Jobs with Access Network Resources
- Summary
- Chapter 9: Monitoring Hadoop
- Writing Log Messages in Hadoop MapReduce Jobs
- Viewing Log Messages in Hadoop MapReduce Jobs
- User Log Management in Hadoop 2.x
- Log Storage in Hadoop 2.x
- Log Management Improvements
- Viewing Logs Using Web-Based UI
- Command-Line Interface
- Log Retention
- Hadoop Cluster Performance Monitoring
- Using YARN REST APIs
- Managing the Hadoop Cluster Using Vendor Tools
- Ambari Architecture
- Summary
- Chapter 10: Data Warehousing Using Hadoop
- Apache Hive
- Installing Hive
- Hive Architecture
- Metastore
- Compiler Basics
- Hive Concepts
- Databases
- Tables
- Views
- Partitions
- Buckets
- Indexes
- Serializer/Deserializer interface
- HiveQL Compiler Details
- Data Definition Language
- Data Manipulation Language
- Language Limitations
- External Interfaces
- CLI
- Beeline
- JDBC
- Hive Scripts
- Performance
- MapReduce Integration
- Reading from Hive External Tables
- Writing to Hive External Tables
- Creating Partitions
- User-Defined Functions
- User-Defined Aggregate Functions and Table Functions
- Impala
- Impala Architecture
- Impala Features
- Impala Limitations
- Shark
- Shark/Spark Architecture
- Summary
- Chapter 11: Data Processing Using Pig
- An Introduction to Pig
- Running Pig
- Executing in the Grunt Shell
- Executing a Pig Script
- Embedded Java Program
- Pig Latin
- Comments in a Pig Script
- Execution of Pig Statements
- Pig Commands
- Loading and Storing
- Diagnostic Functions
- Relational Functions
- Functions
- Macro Functions
- The SPLIT Function
- User-Defined Functions
- Eval Functions Invoked in the Mapper
- Eval Functions Invoked in the Reducer
- Aggregation Functions Using the EvalFunc.exec( ) Method
- Aggregation Functions Using the Accumulator Interface
- Aggregation Functions Using the Algebraic Interface
- How Does Pig Decide Which Interface to Use?
- Writing and Using a Custom FilterFunc
- Comparison of PIG versus Hive
- Crunch API
- How Crunch Differs from Pig
- Sample Crunch Pipeline
- Consuming Input Files in Crunch
- Supporting Various InputFormat Types
- Tokenizing String Instances
- Writing to the Output Folder
- Supporting Various OutputFormat Types
- Executing the Pipeline
- Summary
- Chapter 12: HCatalog and Hadoop in the Enterprise
- HCatalog and Enterprise Data Warehouse Users
- HCatalog: A Brief Technical Background
- HCatalog Command-Line Interface
- WebHCat
- HCatalog Interface for MapReduce
- HCatalog Interface for Pig
- HCatalog Notification Interface
- Security and Authorization in HCatalog
- Bringing It All Together
- Summary
- Chapter 13: Log Analysis Using Hadoop
- Log File Analysis Applications
- Web Analytics
- Security Compliance and Forensics
- Monitoring and Alerts
- Internet of Things
- Analysis Steps
- Load
- Refine
- Visualize
- Apache Flume
- Core Concepts
- Netflix Suro
- Cloud Solutions
- Summary
- Chapter 14: Building Real-Time Systems Using HBase
- What Is HBase?
- Typical HBase Use-Case Scenarios
- HBase Data Model
- HBase Logical or Client-Side View
- Differences Between HBase and RDBMSs
- RDBMSs
- HBase (or NoSQL Database)
- HBase Tables
- HBase Cells
- HBase Column Family
- HBase Commands and APIs
- Getting a Command List: help Command
- Creating a Table: create Command
- Adding Rows to a Table: put Command
- Retrieving Rows from the Table: get Command
- Reading Multiple Rows: scan Command
- Counting the Rows in the Table: count Command
- Deleting Rows: delete Command
- Truncating a Table: truncate Command
- Dropping a Table: drop Command
- Altering a Table: alter Command
- HBase Architecture
- HBase Components
- ZooKeeper
- HBase Master
- Region Server
- Write Ahead Log ( WAL)
- Block Cache
- Region
- MemStore
- HFile
- Catalog Tables in the HBase Master
- Bloom Filters
- Compaction and Splits in HBase
- Region Splits
- Compaction
- HBase Minor Compaction
- HBase Major Compaction
- HBase Configuration: An Overview
- hbase-default.xml and hbase-site.xml
- HBase Application Design
- Tall vs. Wide vs. Narrow Table Design
- Row Key Design
- HBase Operations Using Java API
- HBase Treats Everything as Bytes
- Create an HBase Table
- Administrative Functions Using HBaseAdmin
- Accessing Data Using the Java API
- Get
- Put
- Delete
- Scan
- HBase MapReduce Integration
- A MapReduce Job to Read an HBase Table
- HBase and MapReduce Clusters
- Scenario I: Frequent MapReduce Jobs Against HBase Tables
- Scenario II: HBase and MapReduce have Independent SLAs
- Summary
- Chapter 15: Data Science with Hadoop
- Hadoop Data Science Methods
- Apache Hama
- Bulk Synchronous Parallel Model
- Hama Hello World!
- Monte Carlo Methods
- K-Means Clustering
- Apache Spark
- Resilient Distributed Datasets ( RDDs)
- Monte Carlo with Spark
- KMeans with Spark
- RHadoop
- Summary
- Chapter 16: Hadoop in the Cloud
- Economics
- Self-Hosted Cluster
- Cloud-Hosted Cluster
- Elasticity
- On Demand
- Bid Pricing
- Hybrid Cloud
- Logistics
- Ingress/Egress
- Data Retention
- Security
- Cloud Usage Models
- Cloud Providers
- Amazon Web Services
- Simple Storage Service
- Elastic MapReduce
- Elastic Compute Cloud
- Google Cloud Platform
- Microsoft Azure
- Choosing a Cloud Vendor
- Case Study: Amazon Web Services
- Elastic MapReduce
- Elastic Compute Cloud
- Summary
- Chapter 17: Building a YARN Application
- YARN: A General-Purpose Distributed System
- YARN: A Quick Review
- Creating a YARN Application
- POM Configuration
- DownloadService.java Class
- Client.java
- Steps to Launch the Application Master from the Client
- Creating the YarnClient
- Configuring the Application
- Launching the Application Master
- Monitoring the Application
- ApplicationMaster .java
- Communication Protocol between Application Master and Resource Manager: Application Master Protocol
- Node Manager Communication Protocol: Container Management Protocol
- Steps to Launch the Worker Tasks
- Initializing the Application Manager Protocol and the Container Management Protocol
- Registering Application Master with the Resource Manager
- Configuring Container Specifications
- Requesting Containers from the Resource Manager
- Launching Containers on the Task Nodes
- Waiting for Containers to Finish the Worker Tasks
- Unregistering Application Master from the Resource Manager
- Executing the Application Master
- Launch the Application in Un-Managed Mode
- Launch the Application in Managed Mode
- Summary
- Appendix A: Installing Hadoop
- Installing Hadoop 2.2.0 on Windows
- Preparing the Installation Environment
- Building Hadoop 2.2.0 for Windows
- Installing Hadoop 2.2.0 for Windows
- Configuring Hadoop 2.2.0
- core-site.xml configuration
- hdfs-site.xml configuration
- yarn-site.xml configuration
- mapred-site.xml configuration
- Preparing the Hadoop Cluster
- Starting HDFS
- Starting MapReduce (YARN)
- Verifying that the Cluster Is Running
- Testing the Cluster
- Installing Hadoop 2.2.0 on Linux
- Appendix B: Using Maven with Eclipse
- A Quick Introduction to Maven
- Creating a Maven Project
- Using Maven with Eclipse
- Installing the m2e Maven Eclipse Plug-in
- Creating a Maven Project from Eclipse
- Building a Maven Project from Eclipse
- Appendix C: Apache Ambari
- Hadoop Components Supported by Apache Ambari
- Installing Apache Ambari
- Trying the Ambari Sandbox on Your OS
- Index
System requirements
File format: PDF
Copy protection: Watermark-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Use the free software Adobe Reader, Adobe Digital Editions, or any other PDF viewer of your choice (see eBook Help).
- Tablet/Smartphone (Android; iOS): Install the free app Adobe Digital Editions or another reading app for eBooks, e.g., PocketBook (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (only limited: Kindle).
The file format PDF always displays a book page identically on any hardware. This makes PDF suitable for complex layouts such as those used in textbooks and reference books (images, tables, columns, footnotes). Unfortunately, on the small screens of e-readers or smartphones, PDFs are rather annoying, requiring too much scrolling.
This eBook uses Watermark-DRM, a „soft” copy protection. This means that there are no technical restrictions to prevent illegal distribution. However, there is a personalised watermark embedded in the eBook that can be used to identify the purchaser of the eBook in the event of misuse and to provide evidence for legal purposes.
For more information, see our eBook Help page.