Hadoop: Data Processing and Modelling

Name: Hadoop: Data Processing and Modelling | Data Processing and Modelling
Brand: Packt Publishing
Price: 70.99 EUR
Availability: OnlineOnly

Data Processing and Modelling

Sandeep Karanth Gerald Turkington Tanmay Deshpande(Author)

Packt Publishing

Published on 15. April 2025

979 pages

E-Book

PDF with Adobe-DRM

System requirements

978-1-78712-045-7 (ISBN)

€70.99incl. 7% vat

System requirements

for PDF with Adobe-DRM

E-Book Single Licence

Available for download

Description

More details

Other editions

Persons

Karanth Sandeep :
Sandeep Karanth is a technical architect who specializes in building and operationalizing software systems. He has more than 14 years of experience in the software industry, working on a gamut of products ranging from enterprise data applications to newer-generation mobile applications. He has primarily worked at Microsoft Corporation in Redmond, Microsoft Research in India, and is currently a cofounder at Scibler, architecting data intelligence products.Turkington Gerald :

Garry Turkington has over 15 years of industry experience, most of which has been focused on the design and implementation of large-scale distributed systems. In his current role as the CTO at Improve Digital, he is primarily responsible for the realization of systems that store, process, and extract value from the company's large data volumes. Before joining Improve Digital, he spent time at Amazon.co.uk, where he led several software development teams, building systems that process the Amazon catalog data for every item worldwide. Prior to this, he spent a decade in various government positions in both the UK and the USA.Deshpande Tanmay :

Tanmay Deshpande is a Hadoop and big data evangelist. He currently works with Schlumberger as a Big Data Architect in Pune, India. He has interest in a wide range of technologies, such as Hadoop, Hive, Pig, NoSQL databases, Mahout, Sqoop, Java, cloud computing, and so on. He has vast experience in application development in various domains, such as oil and gas, finance, telecom, manufacturing, security, and retail. He enjoys solving machine-learning problems and spends his time reading anything that he can get his hands on. He has great interest in open source technologies and has been promoting them through his talks. Before Schlumberger, he worked with Symantec, Lumiata, and Infosys. Through his innovative thinking and dynamic leadership, he has successfully completed various projects. He regularly blogs on his website http://hadooptutorials.co.in. You can connect with him on LinkedIn at https://www.linkedin.com/in/deshpandetanmay/. He has also authored Mastering DynamoDB, published in August 2014, DynamoDB Cookbook, published in September 2015, Hadoop Real World Solutions Cookbook-Second Edition, published in March 2016, Hadoop: Data Processing and Modelling, published in August, 2016, and Hadoop Blueprints, published in September 2016, all by Packt Publishing.

Content

Cover
Copyright
Credits
Preface
Table of Contents
Module 1: Hadoop Beginner's Guide
Chapter 1: What It's All About
Big data processing
Cloud computing with Amazon Web Services
Summary
Chapter 2: Getting Hadoop Up and Running
Hadoop on a local Ubuntu host
Using Elastic MapReduce
Comparison of local versus EMR Hadoop
Summary
Chapter 3: Understanding MapReduce
Key/value pairs
The Hadoop Java API for MapReduce
Writing MapReduce programs
Walking through a run of WordCount
Hadoop-specific data types
Input/output
Summary
Chapter 4: Developing MapReduce Programs
Using languages other than Java with Hadoop
Analyzing a large dataset
Counters, status, and other output
Summary
Chapter 5: Advanced MapReduce Techniques
Simple, advanced, and in-between
Joins
Graph algorithms
Using language-independent data structures
Summary
Chapter 6: When Things Break
Failure
Summary
Chapter 7: Keeping Things Running
A note on EMR
Hadoop configuration properties
Setting up a cluster
Cluster access control
Managing the NameNode
Managing HDFS
MapReduce management
Scaling
Summary
Chapter 8: A Relational View on Data with Hive
Overview of Hive
Setting up Hive
Using Hive
Hive on Amazon Web Services
Summary
Chapter 9: Working with Relational Databases
Common data paths
Setting up MySQL
Getting data into Hadoop
Getting data out of Hadoop
AWS considerations
Summary
Chapter 10: Data Collection with Flume
A note about AWS
Data data everywhere...
Introducing Apache Flume
The bigger picture
Summary
Chapter 11: Where to Go Next
What we did and didn't cover in this book
Upcoming Hadoop changes
Alternative distributions
Other Apache projects
Other programming abstractions
AWS resources
Sources of information
Summary
Appendix: Pop Quiz Answers
Module 2: Hadoop Real-World Solutions Cookbook, Second Edition
Chapter 1: Getting Started with Hadoop 2.X
Introduction
Installing a single-node Hadoop Cluster
Installing a multi-node Hadoop cluster
Adding new nodes to existing Hadoop clusters
Executing the balancer command for uniform data distribution
Entering and exiting from the safe mode in a Hadoop cluster
Decommissioning DataNodes
Performing benchmarking on a Hadoop cluster
Chapter 2: Exploring HDFS
Introduction
Loading data from a local machine to HDFS
Exporting HDFS data to a local machine
Changing the replication factor of an existing file in HDFS
Setting the HDFS block size for all the files in a cluster
Setting the HDFS block size for a specific file in a cluster
Enabling transparent encryption for HDFS
Importing data from another Hadoop cluster
Recycling deleted data from trash to HDFS
Saving compressed data in HDFS
Chapter 3: Mastering Map Reduce Programs
Introduction
Writing the Map Reduce program in Java to analyze web log data
Executing the Map Reduce program in a Hadoop cluster
Adding support for a new writable data type in Hadoop
Implementing a user-defined counter in a Map Reduce program
Map Reduce program to find the top X
Map Reduce program to find distinct values
Map Reduce program to partition data using a custom partitioner
Writing Map Reduce results to multiple output files
Performing Reduce side Joins using Map Reduce
Unit testing the Map Reduce code using MRUnit
Chapter 4: Data Analysis Using Hive, Pig, and Hbase
Introduction
Storing and processing Hive data in a sequential file format
Storing and processing Hive data in the RC file format
Storing and processing Hive data in the ORC file format
Storing and processing Hive data in the Parquet file format
Performing FILTER By queries in Pig
Performing Group By queries in Pig
Performing Order By queries in Pig
Performing JOINS in Pig
Writing a user-defined function in Pig
Analyzing web log data using Pig
Performing the Hbase operation in CLI
Performing Hbase operations in Java
Executing the MapReduce programming with an Hbase Table
Chapter 5: Advanced Data Analysis Using Hive
Introduction
Processing JSON data in Hive using JSON SerDe
Processing XML data in Hive using XML SerDe
Processing Hive data in the Avro format
Writing a user-defined function in Hive
Performing table joins in Hive
Executing map side joins in Hive
Performing context Ngram in Hive
Call Data Record Analytics using Hive
Twitter sentiment analysis using Hive
Implementing Change Data Capture using Hive
Multiple table inserting using Hive
Chapter 6: Data Import/Export Using Sqoop and Flume
Introduction
Importing data from RDMBS to HDFS using Sqoop
Exporting data from HDFS to RDBMS
Using query operator in Sqoop import
Importing data using Sqoop in compressed format
Performing Atomic export using Sqoop
Importing data into Hive tables using Sqoop
Importing data into HDFS from Mainframes
Incremental import using Sqoop
Creating and executing Sqoop job
Importing data from RDBMS to Hbase using Sqoop
Importing Twitter data into HDFS using Flume
Importing data from Kafka into HDFS using Flume
Importing web logs data into HDFS using Flume
Chapter 7: Automation of Hadoop Tasks Using Oozie
Introduction
Implementing a Sqoop action job using Oozie
Implementing a Map Reduce action job using Oozie
Implementing a Java action job using Oozie
Implementing a Hive action job using Oozie
Implementing a Pig action job using Oozie
Implementing an e-mail action job using Oozie
Executing parallel jobs using Oozie (fork)
Scheduling a job in Oozie
Chapter 8 : Machine Learning and Predictive Analytics Using Mahout and R
Introduction
Setting up the Mahout development environment
Creating an item-based recommendation engine using Mahout
Creating a user-based recommendation engine using Mahout
Using Predictive analytics on Bank Data using Mahout
Clustering text data using K-Means
Performing population Data Analytics using R
Performing Twitter Sentiment Analytics using R
Performing Predictive Analytics using R
Chapter 9: Integration with Apache Spark
Introduction
Running Spark standalone
Running Spark on YARN
Performing Olympics Athletes analytics using the Spark Shell
Creating Twitter trending topics using Spark Streaming
Twitter trending topics using Spark streaming
Analyzing Parquet files using Spark
Analyzing JSON data using Spark
Processing graphs using Graph X
Conducting predictive analytics using Spark MLib
Chapter 10: Hadoop Use Cases
Introduction
Call Data Record analytics
Web log analytics
Sensitive data masking and encryption using Hadoop
Module 3: Mastering Hadoop
Chapter 1: Hadoop 2.X
The inception of Hadoop
The evolution of Hadoop
Hadoop 2.X
Hadoop distributions
Summary
Chapter 2: Advanced MapReduce
MapReduce input
The Map task
The Reduce task
MapReduce output
MapReduce job counters
Handling data joins
Summary
Chapter 3: Advanced Pig
Pig versus SQL
Different modes of execution
Complex data types in Pig
Compiling Pig scripts
Development and debugging aids
The advanced Pig operators
User-defined functions
Pig performance optimizations
Best practices
Summary
Chapter 4: Advanced Hive
The Hive architecture
Data types
File formats
The data model
Hive query optimizers
Advanced DML
UDF, UDAF, and UDTF
Summary
Chapter 5: Serialization and Hadoop I/O
Data serialization in Hadoop
Avro serialization
File formats
Compression
Summary
Chapter 6: YARN - Bringing Other Paradigms to Hadoop
The YARN architecture
Developing YARN applications
Monitoring YARN
Job scheduling in YARN
YARN commands
Summary
Chapter 7: Storm on YARN - Low Latency Processing in Hadoop
Batch processing versus streaming
Apache Storm
Storm on YARN
Summary
Chapter 8: Hadoop on the Cloud
Cloud computing characteristics
Hadoop on the cloud
Amazon Elastic MapReduce (EMR)
Summary
Chapter 9: HDFS Replacements
HDFS - advantages and drawbacks
Amazon AWS S3
Implementing a filesystem in Hadoop
Implementing an S3 native filesystem in Hadoop
Summary
Chapter 10: HDFS Federation
Limitations of the older HDFS architecture
Architecture of HDFS Federation
HDFS high availability
HDFS block placement
Summary
Chapter 11: Hadoop Security
The security pillars
Authentication in Hadoop
Authorization in Hadoop
Data confidentiality in Hadoop
Audit logging in Hadoop
Summary
Chapter 12: Analytics Using Hadoop
Data analytics workflow
Machine learning
Apache Mahout
Document analysis using Hadoop and Mahout
RHadoop
Summary
Chapter 13: Hadoop for Microsoft Windows
Deploying Hadoop on Microsoft Windows
Summary
Appendix: Bibliography

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Hadoop: Data Processing and Modelling

Description

More details

Other editions

Additional editions

Persons

Content

System requirements