Pro Apache Hadoop

Name: Pro Apache Hadoop
Brand: APress
Price: 46.99 EUR
Availability: OnlineOnly

Jason Venner Sameer Wadkar Madhu Siddalingaiah(Author)

APress

2nd Edition

Published on 18. September 2014

XXII, 444 pages

E-Book

PDF with digital watermarking

System requirements

978-1-4302-4864-4 (ISBN)

€46.99incl. 7% vat

System requirements

for PDF with digital watermarking

E-Book Single Licence

Available for download

Description

More details

Other editions

Persons

Content

Intro
Contents at a Glance
Contents
About the Authors
About the Technical Reviewer
Acknowledgments
Introduction
Chapter 1: Motivation for Big Data
What Is Big Data?
Key Idea Behind Big Data Techniques
Data Is Distributed Across Several Nodes
Applications Are Moved to the Data
Data Is Processed Local to a Node
Sequential Reads Preferred Over Random Reads
An Example
Big Data Programming Models
Massively Parallel Processing (MPP) Database Systems
In-Memory Database Systems
MapReduce Systems
Bulk Synchronous Parallel (BSP) Systems
Big Data and Transactional Systems
How Much Can We Scale?
A Compute-Intensive Example
Amdhal's Law
Business Use-Cases for Big Data
Summary
Chapter 2: Hadoop Concepts
Introducing Hadoop
Introducing the MapReduce Model
Components of Hadoop
Hadoop Distributed File System (HDFS)
Block Storage Nature of Hadoop Files
File Metadata and NameNode
Mechanics of an HDFS Write
Mechanics of an HDFS Read
Mechanics of an HDFS Delete
Ensuring HDFS Reliability
Secondary NameNode
TaskTracker
JobTracker
Hadoop 2.0
Components of YARN
Container
Node Manager
Resource Manager
Application Master
Anatomy of a YARN Request
HDFS High Availability
Summary
Chapter 3: Getting Started with the Hadoop Framework
Types of Installation
Stand-Alone Mode
Pseudo-Distributed Cluster
Multinode Node Cluster Installation
Preinstalled Using Amazon Elastic MapReduce
Setting up a Development Environment with a Cloudera Virtual Machine
Components of a MapReduce program
Your First Hadoop Program
Prerequisites to Run Programs in Local Mode
WordCount Using the Old API
Building the Application
Running WordCount in Cluster Mode
WordCount Using the New API
Building the Application
Running WordCount in Cluster Mode
Third-Party Libraries in Hadoop Jobs
Summary
Chapter 4: Hadoop Administration
Hadoop Configuration Files
Configuring Hadoop Daemons
Precedence of Hadoop Configuration Files
Diving into Hadoop Configuration Files
core-site.xml
hdfs-*.xml
mapred-site.xml
yarn-site.xml
Memory Allocations in YARN
Scheduler
Capacity Scheduler
Configuring Capacity Guarantees Across Organizational Groups
Enforcing Capacity Limits
Enforcing Access Control Limits
Updating Capacity Limit changes in the Cluster
Capacity Scheduler Scenario
Fair Scheduler
Fair Scheduler Configuration
yarn-site.xml Configurations
Allocation File Format and Configurations
Determine Dominant Resource Share in drf Policy
Slaves File
Rack Awareness
Providing Hadoop with Network Topology
Cluster Administration Utilities
Check the HDFS
Command-Line HDFS Administration
HDFS Cluster Health Report
Add/Remove Nodes
Placing the HDFS in Safemode
Rebalancing HDFS Data
Copying Large Amounts of Data from the HDFS
Summary
Chapter 5: Basics of MapReduce Development
Hadoop and Data Process ing
Reviewing the Airline Dataset
Preparing the Development Environment
Preparing the Hadoop System
MapReduce Programming Patterns
Map-Only Jobs (SELECT and WHERE Queries)
Problem Definition: SELECT Clause
run( ) method
SelectClauseMapper Class
Running the SELECT Clause Job in the Development Environment
Running the SELECT Clause Job on the Cluster
View SELECT Clause Job Results
Recapping Key Hadoop Features Explored with SELECT
Problem Definition: WHERE Clause
WhereClauseMapper Class
Running the WHERE Clause Job on the Cluster
GenericOptionsParser Revisited
Key Hadoop Features Explored with the WHERE Clause
Map and Reduce Jobs (Aggregation Queries)
Problem Definition: GROUP BY and SUM Clauses
run( ) method
AggregationMapper Class
Reduce Phase
Running the GROUP BY and SUM Clause Job in the Cluster
Results for the Entire Dataset
Recapping the Key Hadoop Features Explored with GROUP BY and SUM
Improving Aggregation Performance Using the Combiner
Problem Definition: Optimized Aggregators
AggregationMapper Class from the AggregationWithCombinerMRJob Class
run( ) Method
Running the Combiner-Based Aggregation Job in the Cluster
Key Aggregation Hadoop Features Explored with Aggregation
Role of the Partitioner
Problem Definition: Split Airline Data by Month
run( ) Method
Partitioner
SplitByMonthMapper Class
SortByMonthAndDayOfWeekReducer Class
MonthPartitioner Class
Run the Partitioner Job in the Cluster
Key Hadoop Features Explored with Partitioner
Bringing it All Together
Summary
Chapter 6: Advanced MapReduce Development
MapReduce Programming Patterns
Introduction to Hadoop I/O
Writable and WritableComparable Interfaces
Problem Definition: Sorting
Primary Challenge: Total Order Sorting
Implementing a Custom Key Class: MonthDoWWritable
Implementing a Custom Value Class: DelaysWritable
Sorting MapReduce Program
run() method
SortAscMonthDescWeekMapper Class
MonthDoWPartitioner Class
SortAscMonthDescWeekReducer Class
Run the Sorting Job on the Cluster
Sorting with Writable-Only Keys
Recapping the Key Hadoop Features Explored with Sorting
Problem Definition: Analyzing Consecutive Records
Key Components Supporting Secondary Sort
Custom Key Class: ArrivalFlightKey
Custom Partitioner: ArrivalFlightKeyBasedPartioner
Sorting Comparator Class: ArrivalFlightKeySortingComparator
Grouping ComparatorClass: ArrivalFlightKeyGroupingComparator
Mapper Class: AnalyzeConsecutiveDelaysMapper
Reducer Class: AnalyzeConsecutiveDelaysReducer
run( ) Method
Implementing Secondary Sort without a Grouping Comparator
Running the Secondary Sort Job on the Cluster
Recapping the Key Hadoop Features Explored with Secondary Sort
Problem Definition: Join Using MapReduce
Handling Multiple Inputs: MultipleInputs Class
Mapper Classes for Multiple Inputs
Custom Partitioner: CarrierCodeBasedPartioner
Implementing the Join in the Reducer
SortingComparator Class: CarrierSortComparator
Grouping Comparator Class: CarrierGroupComparator
Reducer Class: JoinReducer
Running the MapReduce Join Job on the Cluster
Key Hadoop Features Explored with MapReduce
Problem Definition: Join Using Map-Only jobs
DistributedCache-Based Solution
run( ) method
MapSideJoinMapper Class
Running the Map-Only Join Job on the Cluster
Recapping the Key Hadoop Features Explored with Map-Only Join
Writing to Multiple Output Files in a Single MR Job
Collecting Statistics Using Counters
Summary
Chapter 7: Hadoop Input/Output
Compression Schemes
What Can Be Compressed?
Compression Schemes
Enabling Compression
Inside the Hadoop I/O processes
InputFormat
Anatomy of InputSplit
Anatomy of InputFormat
TextInputFormat
OutputFormat
TextOutputFormat
Custom OutputFormat: Conversion from Text to XML
Run the Text-to-XML Job on the Cluster
Custom InputFormat: Consuming a Custom XML file
Anatomy of the org.apache.hadoop.mapreduce.Mapper.run() method
CompositeInputFormat and Large Joins
How Map-Side Join Works with Sorted Datasets
Hadoop Files
SequenceFile
SequenceFileOutputFormat: Creating a SequenceFile
SequenceFileInputFormat: Reading from a SequenceFile
Compression and SequenceFiles
Sequence File Header
Sync Marker and Splittable Nature of SequenceFiles
RECORD Compression
BLOCK Compression
MapFiles
Map Files and Distributed Cache
Avro Files
FlightDelay.avsc: Avro Schema File
Job to Convert from Text Format to Avro Format
Job to Convert from Avro Format to Text Format
Summary
Chapter 8: Testing Hadoop Programs
Revisiting the Word Counter
Introducing MRUnit
Installing MRUnit
MRUnit Core Classes
Writing an MRUnit Test Case
Testing Counters
Features of MRUnit
Limitations of MRUnit
Testing with LocalJobRunner
setUp( ) method
Limitations of LocalJobRunner
Testing with MiniMRCluster
Setting up the Development Environment
Example for MiniMRCluster
Limitations of MiniMRCluster
Testing MR Jobs with Access Network Resources
Summary
Chapter 9: Monitoring Hadoop
Writing Log Messages in Hadoop MapReduce Jobs
Viewing Log Messages in Hadoop MapReduce Jobs
User Log Management in Hadoop 2.x
Log Storage in Hadoop 2.x
Log Management Improvements
Viewing Logs Using Web-Based UI
Command-Line Interface
Log Retention
Hadoop Cluster Performance Monitoring
Using YARN REST APIs
Managing the Hadoop Cluster Using Vendor Tools
Ambari Architecture
Summary
Chapter 10: Data Warehousing Using Hadoop
Apache Hive
Installing Hive
Hive Architecture
Metastore
Compiler Basics
Hive Concepts
Databases
Tables
Views
Partitions
Buckets
Indexes
Serializer/Deserializer interface
HiveQL Compiler Details
Data Definition Language
Data Manipulation Language
Language Limitations
External Interfaces
CLI
Beeline
JDBC
Hive Scripts
Performance
MapReduce Integration
Reading from Hive External Tables
Writing to Hive External Tables
Creating Partitions
User-Defined Functions
User-Defined Aggregate Functions and Table Functions
Impala
Impala Architecture
Impala Features
Impala Limitations
Shark
Shark/Spark Architecture
Summary
Chapter 11: Data Processing Using Pig
An Introduction to Pig
Running Pig
Executing in the Grunt Shell
Executing a Pig Script
Embedded Java Program
Pig Latin
Comments in a Pig Script
Execution of Pig Statements
Pig Commands
Loading and Storing
Diagnostic Functions
Relational Functions
Functions
Macro Functions
The SPLIT Function
User-Defined Functions
Eval Functions Invoked in the Mapper
Eval Functions Invoked in the Reducer
Aggregation Functions Using the EvalFunc.exec( ) Method
Aggregation Functions Using the Accumulator Interface
Aggregation Functions Using the Algebraic Interface
How Does Pig Decide Which Interface to Use?
Writing and Using a Custom FilterFunc
Comparison of PIG versus Hive
Crunch API
How Crunch Differs from Pig
Sample Crunch Pipeline
Consuming Input Files in Crunch
Supporting Various InputFormat Types
Tokenizing String Instances
Writing to the Output Folder
Supporting Various OutputFormat Types
Executing the Pipeline
Summary
Chapter 12: HCatalog and Hadoop in the Enterprise
HCatalog and Enterprise Data Warehouse Users
HCatalog: A Brief Technical Background
HCatalog Command-Line Interface
WebHCat
HCatalog Interface for MapReduce
HCatalog Interface for Pig
HCatalog Notification Interface
Security and Authorization in HCatalog
Bringing It All Together
Summary
Chapter 13: Log Analysis Using Hadoop
Log File Analysis Applications
Web Analytics
Security Compliance and Forensics
Monitoring and Alerts
Internet of Things
Analysis Steps
Load
Refine
Visualize
Apache Flume
Core Concepts
Netflix Suro
Cloud Solutions
Summary
Chapter 14: Building Real-Time Systems Using HBase
What Is HBase?
Typical HBase Use-Case Scenarios
HBase Data Model
HBase Logical or Client-Side View
Differences Between HBase and RDBMSs
RDBMSs
HBase (or NoSQL Database)
HBase Tables
HBase Cells
HBase Column Family
HBase Commands and APIs
Getting a Command List: help Command
Creating a Table: create Command
Adding Rows to a Table: put Command
Retrieving Rows from the Table: get Command
Reading Multiple Rows: scan Command
Counting the Rows in the Table: count Command
Deleting Rows: delete Command
Truncating a Table: truncate Command
Dropping a Table: drop Command
Altering a Table: alter Command
HBase Architecture
HBase Components
ZooKeeper
HBase Master
Region Server
Write Ahead Log ( WAL)
Block Cache
Region
MemStore
HFile
Catalog Tables in the HBase Master
Bloom Filters
Compaction and Splits in HBase
Region Splits
Compaction
HBase Minor Compaction
HBase Major Compaction
HBase Configuration: An Overview
hbase-default.xml and hbase-site.xml
HBase Application Design
Tall vs. Wide vs. Narrow Table Design
Row Key Design
HBase Operations Using Java API
HBase Treats Everything as Bytes
Create an HBase Table
Administrative Functions Using HBaseAdmin
Accessing Data Using the Java API
Get
Put
Delete
Scan
HBase MapReduce Integration
A MapReduce Job to Read an HBase Table
HBase and MapReduce Clusters
Scenario I: Frequent MapReduce Jobs Against HBase Tables
Scenario II: HBase and MapReduce have Independent SLAs
Summary
Chapter 15: Data Science with Hadoop
Hadoop Data Science Methods
Apache Hama
Bulk Synchronous Parallel Model
Hama Hello World!
Monte Carlo Methods
K-Means Clustering
Apache Spark
Resilient Distributed Datasets ( RDDs)
Monte Carlo with Spark
KMeans with Spark
RHadoop
Summary
Chapter 16: Hadoop in the Cloud
Economics
Self-Hosted Cluster
Cloud-Hosted Cluster
Elasticity
On Demand
Bid Pricing
Hybrid Cloud
Logistics
Ingress/Egress
Data Retention
Security
Cloud Usage Models
Cloud Providers
Amazon Web Services
Simple Storage Service
Elastic MapReduce
Elastic Compute Cloud
Google Cloud Platform
Microsoft Azure
Choosing a Cloud Vendor
Case Study: Amazon Web Services
Elastic MapReduce
Elastic Compute Cloud
Summary
Chapter 17: Building a YARN Application
YARN: A General-Purpose Distributed System
YARN: A Quick Review
Creating a YARN Application
POM Configuration
DownloadService.java Class
Client.java
Steps to Launch the Application Master from the Client
Creating the YarnClient
Configuring the Application
Launching the Application Master
Monitoring the Application
ApplicationMaster .java
Communication Protocol between Application Master and Resource Manager: Application Master Protocol
Node Manager Communication Protocol: Container Management Protocol
Steps to Launch the Worker Tasks
Initializing the Application Manager Protocol and the Container Management Protocol
Registering Application Master with the Resource Manager
Configuring Container Specifications
Requesting Containers from the Resource Manager
Launching Containers on the Task Nodes
Waiting for Containers to Finish the Worker Tasks
Unregistering Application Master from the Resource Manager
Executing the Application Master
Launch the Application in Un-Managed Mode
Launch the Application in Managed Mode
Summary
Appendix A: Installing Hadoop
Installing Hadoop 2.2.0 on Windows
Preparing the Installation Environment
Building Hadoop 2.2.0 for Windows
Installing Hadoop 2.2.0 for Windows
Configuring Hadoop 2.2.0
core-site.xml configuration
hdfs-site.xml configuration
yarn-site.xml configuration
mapred-site.xml configuration
Preparing the Hadoop Cluster
Starting HDFS
Starting MapReduce (YARN)
Verifying that the Cluster Is Running
Testing the Cluster
Installing Hadoop 2.2.0 on Linux
Appendix B: Using Maven with Eclipse
A Quick Introduction to Maven
Creating a Maven Project
Using Maven with Eclipse
Installing the m2e Maven Eclipse Plug-in
Creating a Maven Project from Eclipse
Building a Maven Project from Eclipse
Appendix C: Apache Ambari
Hadoop Components Supported by Apache Ambari
Installing Apache Ambari
Trying the Ambari Sandbox on Your OS
Index

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Pro Apache Hadoop

Description

More details

Other editions

Additional editions

Persons

Content

System requirements