
Hadoop: Data Processing and Modelling
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
More details
Other editions
Additional editions

Persons
Sandeep Karanth is a technical architect who specializes in building and operationalizing software systems. He has more than 14 years of experience in the software industry, working on a gamut of products ranging from enterprise data applications to newer-generation mobile applications. He has primarily worked at Microsoft Corporation in Redmond, Microsoft Research in India, and is currently a cofounder at Scibler, architecting data intelligence products.Turkington Gerald :
Garry Turkington has over 15 years of industry experience, most of which has been focused on the design and implementation of large-scale distributed systems. In his current role as the CTO at Improve Digital, he is primarily responsible for the realization of systems that store, process, and extract value from the company's large data volumes. Before joining Improve Digital, he spent time at Amazon.co.uk, where he led several software development teams, building systems that process the Amazon catalog data for every item worldwide. Prior to this, he spent a decade in various government positions in both the UK and the USA.Deshpande Tanmay :
Tanmay Deshpande is a Hadoop and big data evangelist. He currently works with Schlumberger as a Big Data Architect in Pune, India. He has interest in a wide range of technologies, such as Hadoop, Hive, Pig, NoSQL databases, Mahout, Sqoop, Java, cloud computing, and so on. He has vast experience in application development in various domains, such as oil and gas, finance, telecom, manufacturing, security, and retail. He enjoys solving machine-learning problems and spends his time reading anything that he can get his hands on. He has great interest in open source technologies and has been promoting them through his talks. Before Schlumberger, he worked with Symantec, Lumiata, and Infosys. Through his innovative thinking and dynamic leadership, he has successfully completed various projects. He regularly blogs on his website http://hadooptutorials.co.in. You can connect with him on LinkedIn at https://www.linkedin.com/in/deshpandetanmay/. He has also authored Mastering DynamoDB, published in August 2014, DynamoDB Cookbook, published in September 2015, Hadoop Real World Solutions Cookbook-Second Edition, published in March 2016, Hadoop: Data Processing and Modelling, published in August, 2016, and Hadoop Blueprints, published in September 2016, all by Packt Publishing.
Content
- Cover
- Copyright
- Credits
- Preface
- Table of Contents
- Module 1: Hadoop Beginner's Guide
- Chapter 1: What It's All About
- Big data processing
- Cloud computing with Amazon Web Services
- Summary
- Chapter 2: Getting Hadoop Up and Running
- Hadoop on a local Ubuntu host
- Using Elastic MapReduce
- Comparison of local versus EMR Hadoop
- Summary
- Chapter 3: Understanding MapReduce
- Key/value pairs
- The Hadoop Java API for MapReduce
- Writing MapReduce programs
- Walking through a run of WordCount
- Hadoop-specific data types
- Input/output
- Summary
- Chapter 4: Developing MapReduce Programs
- Using languages other than Java with Hadoop
- Analyzing a large dataset
- Counters, status, and other output
- Summary
- Chapter 5: Advanced MapReduce Techniques
- Simple, advanced, and in-between
- Joins
- Graph algorithms
- Using language-independent data structures
- Summary
- Chapter 6: When Things Break
- Failure
- Summary
- Chapter 7: Keeping Things Running
- A note on EMR
- Hadoop configuration properties
- Setting up a cluster
- Cluster access control
- Managing the NameNode
- Managing HDFS
- MapReduce management
- Scaling
- Summary
- Chapter 8: A Relational View on Data with Hive
- Overview of Hive
- Setting up Hive
- Using Hive
- Hive on Amazon Web Services
- Summary
- Chapter 9: Working with Relational Databases
- Common data paths
- Setting up MySQL
- Getting data into Hadoop
- Getting data out of Hadoop
- AWS considerations
- Summary
- Chapter 10: Data Collection with Flume
- A note about AWS
- Data data everywhere...
- Introducing Apache Flume
- The bigger picture
- Summary
- Chapter 11: Where to Go Next
- What we did and didn't cover in this book
- Upcoming Hadoop changes
- Alternative distributions
- Other Apache projects
- Other programming abstractions
- AWS resources
- Sources of information
- Summary
- Appendix: Pop Quiz Answers
- Module 2: Hadoop Real-World Solutions Cookbook, Second Edition
- Chapter 1: Getting Started with Hadoop 2.X
- Introduction
- Installing a single-node Hadoop Cluster
- Installing a multi-node Hadoop cluster
- Adding new nodes to existing Hadoop clusters
- Executing the balancer command for uniform data distribution
- Entering and exiting from the safe mode in a Hadoop cluster
- Decommissioning DataNodes
- Performing benchmarking on a Hadoop cluster
- Chapter 2: Exploring HDFS
- Introduction
- Loading data from a local machine to HDFS
- Exporting HDFS data to a local machine
- Changing the replication factor of an existing file in HDFS
- Setting the HDFS block size for all the files in a cluster
- Setting the HDFS block size for a specific file in a cluster
- Enabling transparent encryption for HDFS
- Importing data from another Hadoop cluster
- Recycling deleted data from trash to HDFS
- Saving compressed data in HDFS
- Chapter 3: Mastering Map Reduce Programs
- Introduction
- Writing the Map Reduce program in Java to analyze web log data
- Executing the Map Reduce program in a Hadoop cluster
- Adding support for a new writable data type in Hadoop
- Implementing a user-defined counter in a Map Reduce program
- Map Reduce program to find the top X
- Map Reduce program to find distinct values
- Map Reduce program to partition data using a custom partitioner
- Writing Map Reduce results to multiple output files
- Performing Reduce side Joins using Map Reduce
- Unit testing the Map Reduce code using MRUnit
- Chapter 4: Data Analysis Using Hive, Pig, and Hbase
- Introduction
- Storing and processing Hive data in a sequential file format
- Storing and processing Hive data in the RC file format
- Storing and processing Hive data in the ORC file format
- Storing and processing Hive data in the Parquet file format
- Performing FILTER By queries in Pig
- Performing Group By queries in Pig
- Performing Order By queries in Pig
- Performing JOINS in Pig
- Writing a user-defined function in Pig
- Analyzing web log data using Pig
- Performing the Hbase operation in CLI
- Performing Hbase operations in Java
- Executing the MapReduce programming with an Hbase Table
- Chapter 5: Advanced Data Analysis Using Hive
- Introduction
- Processing JSON data in Hive using JSON SerDe
- Processing XML data in Hive using XML SerDe
- Processing Hive data in the Avro format
- Writing a user-defined function in Hive
- Performing table joins in Hive
- Executing map side joins in Hive
- Performing context Ngram in Hive
- Call Data Record Analytics using Hive
- Twitter sentiment analysis using Hive
- Implementing Change Data Capture using Hive
- Multiple table inserting using Hive
- Chapter 6: Data Import/Export Using Sqoop and Flume
- Introduction
- Importing data from RDMBS to HDFS using Sqoop
- Exporting data from HDFS to RDBMS
- Using query operator in Sqoop import
- Importing data using Sqoop in compressed format
- Performing Atomic export using Sqoop
- Importing data into Hive tables using Sqoop
- Importing data into HDFS from Mainframes
- Incremental import using Sqoop
- Creating and executing Sqoop job
- Importing data from RDBMS to Hbase using Sqoop
- Importing Twitter data into HDFS using Flume
- Importing data from Kafka into HDFS using Flume
- Importing web logs data into HDFS using Flume
- Chapter 7: Automation of Hadoop Tasks Using Oozie
- Introduction
- Implementing a Sqoop action job using Oozie
- Implementing a Map Reduce action job using Oozie
- Implementing a Java action job using Oozie
- Implementing a Hive action job using Oozie
- Implementing a Pig action job using Oozie
- Implementing an e-mail action job using Oozie
- Executing parallel jobs using Oozie (fork)
- Scheduling a job in Oozie
- Chapter 8 : Machine Learning and Predictive Analytics Using Mahout and R
- Introduction
- Setting up the Mahout development environment
- Creating an item-based recommendation engine using Mahout
- Creating a user-based recommendation engine using Mahout
- Using Predictive analytics on Bank Data using Mahout
- Clustering text data using K-Means
- Performing population Data Analytics using R
- Performing Twitter Sentiment Analytics using R
- Performing Predictive Analytics using R
- Chapter 9: Integration with Apache Spark
- Introduction
- Running Spark standalone
- Running Spark on YARN
- Performing Olympics Athletes analytics using the Spark Shell
- Creating Twitter trending topics using Spark Streaming
- Twitter trending topics using Spark streaming
- Analyzing Parquet files using Spark
- Analyzing JSON data using Spark
- Processing graphs using Graph X
- Conducting predictive analytics using Spark MLib
- Chapter 10: Hadoop Use Cases
- Introduction
- Call Data Record analytics
- Web log analytics
- Sensitive data masking and encryption using Hadoop
- Module 3: Mastering Hadoop
- Chapter 1: Hadoop 2.X
- The inception of Hadoop
- The evolution of Hadoop
- Hadoop 2.X
- Hadoop distributions
- Summary
- Chapter 2: Advanced MapReduce
- MapReduce input
- The Map task
- The Reduce task
- MapReduce output
- MapReduce job counters
- Handling data joins
- Summary
- Chapter 3: Advanced Pig
- Pig versus SQL
- Different modes of execution
- Complex data types in Pig
- Compiling Pig scripts
- Development and debugging aids
- The advanced Pig operators
- User-defined functions
- Pig performance optimizations
- Best practices
- Summary
- Chapter 4: Advanced Hive
- The Hive architecture
- Data types
- File formats
- The data model
- Hive query optimizers
- Advanced DML
- UDF, UDAF, and UDTF
- Summary
- Chapter 5: Serialization and Hadoop I/O
- Data serialization in Hadoop
- Avro serialization
- File formats
- Compression
- Summary
- Chapter 6: YARN - Bringing Other Paradigms to Hadoop
- The YARN architecture
- Developing YARN applications
- Monitoring YARN
- Job scheduling in YARN
- YARN commands
- Summary
- Chapter 7: Storm on YARN - Low Latency Processing in Hadoop
- Batch processing versus streaming
- Apache Storm
- Storm on YARN
- Summary
- Chapter 8: Hadoop on the Cloud
- Cloud computing characteristics
- Hadoop on the cloud
- Amazon Elastic MapReduce (EMR)
- Summary
- Chapter 9: HDFS Replacements
- HDFS - advantages and drawbacks
- Amazon AWS S3
- Implementing a filesystem in Hadoop
- Implementing an S3 native filesystem in Hadoop
- Summary
- Chapter 10: HDFS Federation
- Limitations of the older HDFS architecture
- Architecture of HDFS Federation
- HDFS high availability
- HDFS block placement
- Summary
- Chapter 11: Hadoop Security
- The security pillars
- Authentication in Hadoop
- Authorization in Hadoop
- Data confidentiality in Hadoop
- Audit logging in Hadoop
- Summary
- Chapter 12: Analytics Using Hadoop
- Data analytics workflow
- Machine learning
- Apache Mahout
- Document analysis using Hadoop and Mahout
- RHadoop
- Summary
- Chapter 13: Hadoop for Microsoft Windows
- Deploying Hadoop on Microsoft Windows
- Summary
- Appendix: Bibliography
System requirements
File format: PDF
Copy-Protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (only limited: Kindle).
The file format PDF always displays a book page identically on any hardware. This makes PDF suitable for complex layouts such as those used in textbooks and reference books (images, tables, columns, footnotes). Unfortunately, on the small screens of e-readers or smartphones, PDFs are rather annoying, requiring too much scrolling.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our eBook Help page.