Apache Spark 2: Data Processing and Real-Time Analytics

Name: Apache Spark 2: Data Processing and Real-Time Analytics | Master complex big data processing, stream analytics, and machine learning with Apache Spark
Brand: Packt Publishing
Price: 38.99 EUR
Availability: OnlineOnly

Master complex big data processing, stream analytics, and machine learning with Apache Spark

Kienzler Romeo Kienzler(Author)

Packt Publishing

1st Edition

Published on 21. December 2018

616 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-78995-991-8 (ISBN)

€38.99incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.

Alles über E-Books, Kopierschutz & Dateiformate finden Sie in unserem Info- & Hilfebereich.

Build efficient data flow and machine learning programs with this flexible, multi-functional open-source cluster-computing frameworkKey FeaturesMaster the art of real-time big data processing and machine learning Explore a wide range of use-cases to analyze large data Discover ways to optimize your work by using many features of Spark 2.x and ScalaBook DescriptionApache Spark is an in-memory, cluster-based data processing system that provides a wide range of functionalities such as big data processing, analytics, machine learning, and more. With this Learning Path, you can take your knowledge of Apache Spark to the next level by learning how to expand Spark's functionality and building your own data flow and machine learning programs on this platform.You will work with the different modules in Apache Spark, such as interactive querying with Spark SQL, using DataFrames and datasets, implementing streaming analytics with Spark Streaming, and applying machine learning and deep learning techniques on Spark using MLlib and various external tools.By the end of this elaborately designed Learning Path, you will have all the knowledge you need to master Apache Spark, and build your own big data processing and analytics pipeline quickly and without any hassle.This Learning Path includes content from the following Packt products:Mastering Apache Spark 2.x by Romeo KienzlerScala and Spark for Big Data Analytics by Md. Rezaul Karim, Sridhar AllaApache Spark 2.x Machine Learning Cookbook by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen MeiCookbookWhat you will learnGet to grips with all the features of Apache Spark 2.xPerform highly optimized real-time big data processing Use ML and DL techniques with Spark MLlib and third-party toolsAnalyze structured and unstructured data using SparkSQL and GraphXUnderstand tuning, debugging, and monitoring of big data applications Build scalable and fault-tolerant streaming applications Develop scalable recommendation enginesWho this book is forIf you are an intermediate-level Spark developer looking to master the advanced capabilities and use-cases of Apache Spark 2.x, this Learning Path is ideal for you. Big data professionals who want to learn how to integrate and use the features of Apache Spark and build a strong big data pipeline will also find this Learning Path useful. To grasp the concepts explained in this Learning Path, you must know the fundamentals of Apache Spark and Scala.

More details

Other editions

Content

Cover
Title Page
Copyright
About Packt
Contributors
Table of Contents
Preface
Chapter 1: A First Taste and What's New in Apache Spark V2
Spark machine learning
Spark Streaming
Spark SQL
Spark graph processing
Extended ecosystem
What's new in Apache Spark V2?
Cluster design
Cluster management
Local
Standalone
Apache YARN
Apache Mesos
Cloud-based deployments
Performance
The cluster structure
Hadoop Distributed File System
Data locality
Memory
Coding
Cloud
Summary
Chapter 3: Apache Spark Streaming
Overview
Errors and recovery
Checkpointing
Streaming sources
TCP stream
File streams
Flume
Kafka
Summary
Chapter 4: Structured Streaming
The concept of continuous applications
True unification - same code, same engine
Windowing
How streaming engines use windowing
How Apache Spark improves windowing
Increased performance with good old friends
How transparent fault tolerance and exactly-once delivery guarantee is achieved
Replayable sources can replay streams from a given offset
Idempotent sinks prevent data duplication
State versioning guarantees consistent results after reruns
Example - connection to a MQTT message broker
Controlling continuous applications
More on stream life cycle management
Summary
Chapter 5: Apache Spark MLlib
Architecture
The development environment
Classification with Naive Bayes
Theory on Classification
Naive Bayes in practice
Clustering with K-Means
Theory on Clustering
K-Means in practice
Artificial neural networks
ANN in practice
Summary
Chapter 6: Apache SparkML
What does the new API look like?
The concept of pipelines
Transformers
String indexer
OneHotEncoder
VectorAssembler
Pipelines
Estimators
RandomForestClassifier
Model evaluation
CrossValidation and hyperparameter tuning
CrossValidation
Hyperparameter tuning
Winning a Kaggle competition with Apache SparkML
Data preparation
Feature engineering
Testing the feature engineering pipeline
Training the machine learning model
Model evaluation
CrossValidation and hyperparameter tuning
Using the evaluator to assess the quality of the cross-validated and tuned model
Summary
Chapter 7: Apache SystemML
Why do we need just another library?
Why on Apache Spark?
The history of Apache SystemML
A cost-based optimizer for machine learning algorithms
An example - alternating least squares
ApacheSystemML architecture
Language parsing
High-level operators are generated
How low-level operators are optimized on
Performance measurements
Apache SystemML in action
Summary
Chapter 8: Apache Spark GraphX
Overview
Graph analytics/processing with GraphX
The raw data
Creating a graph
Example 1 - counting
Example 2 - filtering
Example 3 - PageRank
Example 4 - triangle counting
Example 5 - connected components
Summary
Chapter 9: Spark Tuning
Monitoring Spark jobs
Spark web interface
Jobs
Stages
Storage
Environment
Executors
SQL
Visualizing Spark application using web UI
Observing the running and completed Spark jobs
Debugging Spark applications using logs
Logging with log4j with Spark
Spark configuration
Spark properties
Environmental variables
Logging
Common mistakes in Spark app development
Application failure
Slow jobs or unresponsiveness
Optimization techniques
Data serialization
Memory tuning
Memory usage and management
Tuning the data structures
Serialized RDD storage
Garbage collection tuning
Level of parallelism
Broadcasting
Data locality
Summary
Chapter 10: Testing and Debugging Spark
Testing in a distributed environment
Distributed environment
Issues in a distributed system
Challenges of software testing in a distributed environment
Testing Spark applications
Testing Scala methods
Unit testing
Testing Spark applications
Method 1: Using Scala JUnit test
Method 2: Testing Scala code using FunSuite
Method 3: Making life easier with Spark testing base
Configuring Hadoop runtime on Windows
Debugging Spark applications
Logging with log4j with Spark recap
Debugging the Spark application
Debugging Spark application on Eclipse as Scala debug
Debugging Spark jobs running as local and standalone mode
Debugging Spark applications on YARN or Mesos cluster
Debugging Spark application using SBT
Summary
Chapter 11: Practical Machine Learning with Spark Using Scala
Introduction
Apache Spark
Machine learning
Scala
Software versions and libraries used in this book
Configuring IntelliJ to work with Spark and run Spark ML sample codes
Getting ready
How to do it...
There's more...
See also
Running a sample ML code from Spark
Getting ready
How to do it...
Identifying data sources for practical machine learning
Getting ready
How to do it...
See also
Running your first program using Apache Spark 2.0 with the IntelliJ IDE
How to do it...
How it works...
There's more...
See also
How to add graphics to your Spark program
How to do it...
How it works...
There's more...
See also
Chapter 12: Spark's Three Data Musketeers for Machine Learning - Perfect Together
Introduction
RDDs - what started it all...
DataFrame - a natural evolution to unite API and SQL via a high-level API
Dataset - a high-level unifying Data API
Creating RDDs with Spark 2.0 using internal data sources
How to do it...
How it works...
Creating RDDs with Spark 2.0 using external data sources
How to do it...
How it works...
There's more...
See also
Transforming RDDs with Spark 2.0 using the filter() API
How to do it...
How it works...
There's more...
See also
Transforming RDDs with the super useful flatMap() API
How to do it...
How it works...
There's more...
See also
Transforming RDDs with set operation APIs
How to do it...
How it works...
See also
RDD transformation/aggregation with groupBy() and reduceByKey()
How to do it...
How it works...
There's more...
See also
Transforming RDDs with the zip() API
How to do it...
How it works...
See also
Join transformation with paired key-value RDDs
How to do it...
How it works...
There's more...
Reduce and grouping transformation with paired key-value RDDs
How to do it...
How it works...
See also
Creating DataFrames from Scala data structures
How to do it...
How it works...
There's more...
See also
Operating on DataFrames programmatically without SQL
How to do it...
How it works...
There's more...
See also
Loading DataFrames and setup from an external source
How to do it...
How it works...
There's more...
See also
Using DataFrames with standard SQL language - SparkSQL
How to do it...
How it works...
There's more...
See also
Working with the Dataset API using a Scala Sequence
How to do it...
How it works...
There's more...
See also
Creating and using Datasets from RDDs and back again
How to do it...
How it works...
There's more...
See also
Working with JSON using the Dataset API and SQL together
How to do it...
How it works...
There's more...
See also
Functional programming with the Dataset API using domain objects
How to do it...
How it works...
There's more...
See also
Chapter 13: Common Recipes for Implementing a Robust Machine Learning System
Introduction
Spark's basic statistical API to help you build your own algorithms
How to do it...
How it works...
There's more...
See also
ML pipelines for real-life machine learning applications
How to do it...
How it works...
There's more...
See also
Normalizing data with Spark
How to do it...
How it works...
There's more...
See also
Splitting data for training and testing
How to do it...
How it works...
There's more...
See also
Common operations with the new Dataset API
How to do it...
How it works...
There's more...
See also
Creating and using RDD versus DataFrame versus Dataset from a text file in Spark 2.0
How to do it...
How it works...
There's more...
See also
LabeledPoint data structure for Spark ML
How to do it...
How it works...
There's more...
See also
Getting access to Spark cluster in Spark 2.0
How to do it...
How it works...
There's more...
See also
Getting access to Spark cluster pre-Spark 2.0
How to do it...
How it works...
There's more...
See also
Getting access to SparkContext vis-a-vis SparkSession object in Spark 2.0
How to do it...
How it works...
There's more...
See also
New model export and PMML markup in Spark 2.0
How to do it...
How it works...
There's more...
See also
Regression model evaluation using Spark 2.0
How to do it...
How it works...
There's more...
See also
Binary classification model evaluation using Spark 2.0
How to do it...
How it works...
There's more...
See also
Multiclass classification model evaluation using Spark 2.0
How to do it...
How it works...
There's more...
See also
Multilabel classification model evaluation using Spark 2.0
How to do it...
How it works...
There's more...
See also
Using the Scala Breeze library to do graphics in Spark 2.0
How to do it...
How it works...
There's more...
See also
Chapter 14: Recommendation Engine that Scales with Spark
Introduction
Content filtering
Collaborative filtering
Neighborhood method
Latent factor models techniques
Setting up the required data for a scalable recommendation engine in Spark 2.0
How to do it...
How it works...
There's more...
See also
Exploring the movies data details for the recommendation system in Spark 2.0
How to do it...
How it works...
There's more...
See also
Exploring the ratings data details for the recommendation system in Spark 2.0
How to do it...
How it works...
There's more...
See also
Building a scalable recommendation engine using collaborative filtering in Spark 2.0
How to do it...
How it works...
There's more...
See also
Dealing with implicit input for training
Chapter 15: Unsupervised Clustering with Apache Spark 2.0
Introduction
Building a KMeans classifying system in Spark 2.0
How to do it...
How it works...
KMeans (Lloyd Algorithm)
KMeans++ (Arthur's algorithm)
KMeans|| (pronounced as KMeans Parallel)
There's more...
See also
Bisecting KMeans, the new kid on the block in Spark 2.0
How to do it...
How it works...
There's more...
See also
Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data
How to do it...
How it works...
New GaussianMixture()
There's more...
See also
Classifying the vertices of a graph using Power Iteration Clustering (PIC) in Spark 2.0
How to do it...
How it works...
There's more...
See also
Latent Dirichlet Allocation (LDA) to classify documents and text into topics
How to do it...
How it works...
There's more...
See also
Streaming KMeans to classify data in near real-time
How to do it...
How it works...
There's more...
See also
Chapter 16: Implementing Text Analytics with Spark 2.0 ML Library
Introduction
Doing term frequency with Spark - everything that counts
How to do it...
How it works...
There's more...
See also
Displaying similar words with Spark using Word2Vec
How to do it...
How it works...
There's more...
See also
Downloading a complete dump of Wikipedia for a real-life Spark ML project
How to do it...
There's more...
See also
Using Latent Semantic Analysis for text analytics with Spark 2.0
How to do it...
How it works...
There's more...
See also
Topic modeling with Latent Dirichlet allocation in Spark 2.0
How to do it...
How it works...
There's more...
See also
Chapter 17: Spark Streaming and Machine Learning Library
Introduction
Structured streaming for near real-time machine learning
How to do it...
How it works...
There's more...
See also
Streaming DataFrames for real-time machine learning
How to do it...
How it works...
There's more...
See also
Streaming Datasets for real-time machine learning
How to do it...
How it works...
There's more...
See also
Streaming data and debugging with queueStream
How to do it...
How it works...
See also
Downloading and understanding the famous Iris data for unsupervised classification
How to do it...
How it works...
There's more...
See also
Streaming KMeans for a real-time on-line classifier
How to do it...
How it works...
There's more...
See also
Downloading wine quality data for streaming regression
How to do it...
How it works...
There's more...
Streaming linear regression for a real-time regression
How to do it...
How it works...
There's more...
See also
Downloading Pima Diabetes data for supervised classification
How to do it...
How it works...
There's more...
See also
Streaming logistic regression for an on-line classifier
How to do it...
How it works...
There's more...
See also
Other Books You May Enjoy
Index

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Apache Spark 2: Data Processing and Real-Time Analytics

Description

More details

Other editions

Additional editions

Content

System requirements