Apache Spark 2.x Machine Learning Cookbook

Name: Apache Spark 2.x Machine Learning Cookbook | Over 100 recipes to simplify machine learning model implementations with Spark
Brand: Packt Publishing
Availability: OnlineOnly

Over 100 recipes to simplify machine learning model implementations with Spark

Siamak Amirghodsi(Author)

Packt Publishing

Published on 8. July 2025

666 pages

E-Book

ePUB with Adobe-DRM

System requirements

E-Book

PDF with Adobe-DRM

System requirements

978-1-78217-460-8 (ISBN)

from €45.59

Available for download

Watchlist: see prices

Description

All prices

More details

Other editions

Person

Content

Cover
Title Page
Copyright
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Table of Contents
Preface
Chapter 1: Practical Machine Learning with Spark Using Scala
Introduction
Apache Spark
Machine learning
Scala
Software versions and libraries used in this book
Downloading and installing the JDK
Getting ready
How to do it...
Downloading and installing IntelliJ
Getting ready
How to do it...
Downloading and installing Spark
Getting ready
How to do it...
Configuring IntelliJ to work with Spark and run Spark ML sample codes
Getting ready
How to do it...
There's more...
See also
Running a sample ML code from Spark
Getting ready
How to do it...
Identifying data sources for practical machine learning
Getting ready
How to do it...
See also
Running your first program using Apache Spark 2.0 with the IntelliJ IDE
How to do it...
How it works...
There's more...
See also
How to add graphics to your Spark program
How to do it...
How it works...
There's more...
See also
Chapter 2: Just Enough Linear Algebra for Machine Learning with Spark
Introduction
Package imports and initial setup for vectors and matrices
How to do it...
There's more...
See also
Creating DenseVector and setup with Spark 2.0
How to do it...
How it works...
There's more...
See also
Creating SparseVector and setup with Spark
How to do it...
How it works...
There's more...
See also
Creating dense matrix and setup with Spark 2.0
Getting ready
How to do it...
How it works...
There's more...
See also
Using sparse local matrices with Spark 2.0
How to do it...
How it works...
There's more...
See also
Performing vector arithmetic using Spark 2.0
How to do it...
How it works...
See also
Performing matrix arithmetic using Spark 2.0
How to do it...
How it works...
Exploring RowMatrix in Spark 2.0
How to do it...
How it works...
There's more...
See also
Exploring Distributed IndexedRowMatrix in Spark 2.0
How to do it...
How it works...
See also
Exploring distributed CoordinateMatrix in Spark 2.0
How to do it...
How it works...
See also
Exploring distributed BlockMatrix in Spark 2.0
How to do it...
How it works...
See also
Chapter 3: Spark'
s Three Data Musketeers for Machine Learning - Perfect Together
Introduction
RDDs - what started it all...
DataFrame - a natural evolution to unite API and SQL via a high-level API
Dataset - a high-level unifying Data API
Creating RDDs with Spark 2.0 using internal data sources
How to do it...
How it works...
Creating RDDs with Spark 2.0 using external data sources
How to do it...
How it works...
There's more...
See also
Transforming RDDs with Spark 2.0 using the filter() API
How to do it...
How it works...
There's more...
See also
Transforming RDDs with the super useful flatMap() API
How to do it...
How it works...
There's more...
See also
Transforming RDDs with set operation APIs
How to do it...
How it works...
See also
RDD transformation/aggregation with groupBy() and reduceByKey()
How to do it...
How it works...
There's more...
See also
Transforming RDDs with the zip() API
How to do it...
How it works...
See also
Join transformation with paired key-value RDDs
How to do it...
How it works...
There's more...
Reduce and grouping transformation with paired key-value RDDs
How to do it...
How it works...
See also
Creating DataFrames from Scala data structures
How to do it...
How it works...
There's more...
See also
Operating on DataFrames programmatically without SQL
How to do it...
How it works...
There's more...
See also
Loading DataFrames and setup from an external source
How to do it...
How it works...
There's more...
See also
Using DataFrames with standard SQL language - SparkSQL
How to do it...
How it works...
There's more...
See also
Working with the Dataset API using a Scala Sequence
How to do it...
How it works...
There's more...
See also
Creating and using Datasets from RDDs and back again
How to do it...
How it works...
There's more...
See also
Working with JSON using the Dataset API and SQL together
How to do it...
How it works...
There's more...
See also
Functional programming with the Dataset API using domain objects
How to do it...
How it works...
There's more...
See also
Chapter 4: Common Recipes for Implementing a Robust Machine Learning System
Introduction
Spark's basic statistical API to help you build your own algorithms
How to do it...
How it works...
There's more...
See also
ML pipelines for real-life machine learning applications
How to do it...
How it works...
There's more...
See also
Normalizing data with Spark
How to do it...
How it works...
There's more...
See also
Splitting data for training and testing
How to do it...
How it works...
There's more...
See also
Common operations with the new Dataset API
How to do it...
How it works...
There's more...
See also
Creating and using RDD versus DataFrame versus Dataset from a text file in Spark 2.0
How to do it...
How it works...
There's more...
See also
LabeledPoint data structure for Spark ML
How to do it...
How it works...
There's more...
See also
Getting access to Spark cluster in Spark 2.0
How to do it...
How it works...
There's more...
See also
Getting access to Spark cluster pre-Spark 2.0
How to do it...
How it works...
There's more...
See also
Getting access to SparkContext vis-a-vis SparkSession object in Spark 2.0
How to do it...
How it works...
There's more...
See also
New model export and PMML markup in Spark 2.0
How to do it...
How it works...
There's more...
See also
Regression model evaluation using Spark 2.0
How to do it...
How it works...
There's more...
See also
Binary classification model evaluation using Spark 2.0
How to do it...
How it works...
There's more...
See also
Multiclass classification model evaluation using Spark 2.0
How to do it...
How it works...
There's more...
See also
Multilabel classification model evaluation using Spark 2.0
How to do it...
How it works...
There's more...
See also
Using the Scala Breeze library to do graphics in Spark 2.0
How to do it...
How it works...
There's more...
See also
Chapter 5: Practical Machine Learning with Regression and Classification in Spark 2.0 - Part I
Introduction
Fitting a linear regression line to data the old fashioned way
How to do it...
How it works...
There's more...
See also
Generalized linear regression in Spark 2.0
How to do it...
How it works...
There's more...
See also
Linear regression API with Lasso and L-BFGS in Spark 2.0
How to do it...
How it works...
There's more...
See also
Linear regression API with Lasso and 'auto' optimization selection in Spark 2.0
How to do it...
How it works...
There's more...
See also
Linear regression API with ridge regression and 'auto' optimization selection in Spark 2.0
How to do it...
How it works...
There's more...
See also
Isotonic regression in Apache Spark 2.0
How to do it...
How it works...
There's more...
See also
Multilayer perceptron classifier in Apache Spark 2.0
How to do it...
How it works...
There's more...
See also
One-vs-Rest classifier (One-vs-All) in Apache Spark 2.0
How to do it...
How it works...
There's more...
See also
Survival regression - parametric AFT model in Apache Spark 2.0
How to do it...
How it works...
There's more...
See also
Chapter 6: Practical Machine Learning with Regression and Classification in Spark 2.0 - Part II
Introduction
Linear regression with SGD optimization in Spark 2.0
How to do it...
How it works...
There's more...
See also
Logistic regression with SGD optimization in Spark 2.0
How to do it...
How it works...
There's more...
See also
Ridge regression with SGD optimization in Spark 2.0
How to do it...
How it works...
There's more...
See also
Lasso regression with SGD optimization in Spark 2.0
How to do it...
How it works...
There's more...
See also
Logistic regression with L-BFGS optimization in Spark 2.0
How to do it...
How it works...
There's more...
See also
Support Vector Machine (SVM) with Spark 2.0
How to do it...
How it works...
There's more...
See also
Naive Bayes machine learning with Spark 2.0 MLlib
How to do it...
How it works...
There's more...
See also
Exploring ML pipelines and DataFrames using logistic regression in Spark 2.0
Getting ready
How to do it...
How it works...
There's more...
PipeLine
Vectors
See also
Chapter 7: Recommendation Engine that Scales with Spark
Introduction
Content filtering
Collaborative filtering
Neighborhood method
Latent factor models techniques
Setting up the required data for a scalable recommendation engine in Spark 2.0
How to do it...
How it works...
There's more...
See also
Exploring the movies data details for the recommendation system in Spark 2.0
How to do it...
How it works...
There's more...
See also
Exploring the ratings data details for the recommendation system in Spark 2.0
How to do it...
How it works...
There's more...
See also
Building a scalable recommendation engine using collaborative filtering in Spark 2.0
How to do it...
How it works...
There's more...
See also
Dealing with implicit input for training
Chapter 8: Unsupervised Clustering with Apache Spark 2.0
Introduction
Building a KMeans classifying system in Spark 2.0
How to do it...
How it works...
KMeans (Lloyd Algorithm)
KMeans++ (Arthur's algorithm)
KMeans|| (pronounced as KMeans Parallel)
There's more...
See also
Bisecting KMeans, the new kid on the block in Spark 2.0
How to do it...
How it works...
There's more...
See also
Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data
How to do it...
How it works...
New GaussianMixture()
There's more...
See also
Classifying the vertices of a graph using Power Iteration Clustering (PIC) in Spark 2.0
How to do it...
How it works...
There's more...
See also
Latent Dirichlet Allocation (LDA) to classify documents and text into topics
How to do it...
How it works...
There's more...
See also
Streaming KMeans to classify data in near real-time
How to do it...
How it works...
There's more...
See also
Chapter 9: Optimization - Going Down the Hill with Gradient Descent
Introduction
How do machines learn using an error-based system?
Optimizing a quadratic cost function and finding the minima using just math to gain insight
How to do it...
How it works...
There's more...
See also
Coding a quadratic cost function optimization using Gradient Descent (GD) from scratch
How to do it...
How it works...
There's more...
See also
Coding Gradient Descent optimization to solve Linear Regression from scratch
How to do it...
How it works...
There's more...
See also
Normal equations as an alternative for solving Linear Regression in Spark 2.0
How to do it...
How it works...
There's more...
See also
Chapter 10: Building Machine Learning Systems with Decision Tree and Ensemble Models
Introduction
Ensemble models
Measures of impurity
Getting and preparing real-world medical data for exploring Decision Trees and Ensemble models in Spark 2.0
How to do it...
There's more...
Building a classification system with Decision Trees in Spark 2.0
How to do it
How it works...
There's more...
See also
Solving Regression problems with Decision Trees in Spark 2.0
How to do it...
How it works...
See also
Building a classification system with Random Forest Trees in Spark 2.0
How to do it...
How it works...
See also
Solving regression problems with Random Forest Trees in Spark 2.0
How to do it...
How it works...
See also
Building a classification system with Gradient Boosted Trees (GBT) in Spark 2.0
How to do it...
How it works....
There's more...
See also
Solving regression problems with Gradient Boosted Trees (GBT) in Spark 2.0
How to do it...
How it works...
There's more...
See also
Chapter 11: Curse of High-Dimensionality in Big Data
Introduction
Feature selection versus feature extraction
Two methods of ingesting and preparing a CSV file for processing in Spark
How to do it...
How it works...
There's more...
See also
Singular Value Decomposition (SVD) to reduce high-dimensionality in Spark
How to do it...
How it works...
There's more...
See also
Principal Component Analysis (PCA) to pick the most effective latent factor for machine learning in Spark
How to do it...
How it works...
There's more...
See also
Chapter 12: Implementing Text Analytics with Spark 2.0 ML Library
Introduction
Doing term frequency with Spark - everything that counts
How to do it...
How it works...
There's more...
See also
Displaying similar words with Spark using Word2Vec
How to do it...
How it works...
There's more...
See also
Downloading a complete dump of Wikipedia for a real-life Spark ML project
How to do it...
There's more...
See also
Using Latent Semantic Analysis for text analytics with Spark 2.0
How to do it...
How it works...
There's more...
See also
Topic modeling with Latent Dirichlet allocation in Spark 2.0
How to do it...
How it works...
There's more...
See also
Chapter 13: Spark Streaming and Machine Learning Library
Introduction
Structured streaming for near real-time machine learning
How to do it...
How it works...
There's more...
See also
Streaming DataFrames for real-time machine learning
How to do it...
How it works...
There's more...
See also
Streaming Datasets for real-time machine learning
How to do it...
How it works...
There's more...
See also
Streaming data and debugging with queueStream
How to do it...
How it works...
See also
Downloading and understanding the famous Iris data for unsupervised classification
How to do it...
How it works...
There's more...
See also
Streaming KMeans for a real-time on-line classifier
How to do it...
How it works...
There's more...
See also
Downloading wine quality data for streaming regression
How to do it...
How it works...
There's more...
Streaming linear regression for a real-time regression
How to do it...
How it works...
There's more...
See also
Downloading Pima Diabetes data for supervised classification
How to do it...
How it works...
There's more...
See also
Streaming logistic regression for an on-line classifier
How to do it...
How it works...
There's more...
See also
Index

System requirements

File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)

System requirements:

Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).

The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.

Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.

For more information, see our ebook Help page.

File format: PDF
Copy-Protection: Adobe-DRM (Digital Rights Management)

System requirements:

Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (only limited: Kindle).

The file format PDF always displays a book page identically on any hardware. This makes PDF suitable for complex layouts such as those used in textbooks and reference books (images, tables, columns, footnotes). Unfortunately, on the small screens of e-readers or smartphones, PDFs are rather annoying, requiring too much scrolling.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.

Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.

For more information, see our eBook Help page.

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Apache Spark 2.x Machine Learning Cookbook

Description

All prices

More details

Other editions

Additional editions

Person

Content

System requirements