Machine Learning with Spark

Name: Machine Learning with Spark | Develop intelligent, distributed machine learning systems
Brand: De Gruyter
Availability: OnlineOnly

Develop intelligent, distributed machine learning systems

Rajdeep Dua(Autor*in)

De Gruyter (Verlag)

2. Auflage

Erschienen am 8. Juli 2025

532 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

E-Book

PDF mit Adobe-DRM

Systemvoraussetzungen

978-1-78588-642-3 (ISBN)

ab 39,59 €

Als Download verfügbar

Merkliste: siehe Preise

Beschreibung

Alle Preise

Weitere Details

Weitere Ausgaben

Person

Inhalt

Cover
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Table of Contents
Preface
Chapter 1: Getting Up and Running with Spark
Installing and setting up Spark locally
Spark clusters
The Spark programming model
SparkContext and SparkConf
SparkSession
The Spark shell
Resilient Distributed Datasets
Creating RDDs
Spark operations
Caching RDDs
Broadcast variables and accumulators
SchemaRDD
Spark data frame
The first step to a Spark program in Scala
The first step to a Spark program in Java
The first step to a Spark program in Python
The first step to a Spark program in R
SparkR DataFrames
Getting Spark running on Amazon EC2
Launching an EC2 Spark cluster
Configuring and running Spark on Amazon Elastic Map Reduce
UI in Spark
Supported machine learning algorithms by Spark
Benefits of using Spark ML as compared to existing libraries
Spark Cluster on Google Compute Engine - DataProc
Hadoop and Spark Versions
Creating a Cluster
Submitting a Job
Summary
Chapter 2: Math for Machine Learning
Linear algebra
Setting up the Scala environment in Intellij
Setting up the Scala environment on the Command Line
Fields
Real numbers
Complex numbers
Vectors
Vector spaces
Vector types
Vectors in Breeze
Vectors in Spark
Vector operations
Hyperplanes
Vectors in machine learning
Matrix
Types of matrices
Matrix in Spark
Distributed matrix in Spark
Matrix operations
Determinant
Eigenvalues and eigenvectors
Singular value decomposition
Matrices in machine learning
Functions
Function types
Functional composition
Hypothesis
Gradient descent
Prior, likelihood, and posterior
Calculus
Differential calculus
Integral calculus
Lagranges multipliers
Plotting
Summary
Chapter 3: Designing a Machine Learning System
What is Machine Learning?
Introducing MovieStream
Business use cases for a machine learning system
Personalization
Targeted marketing and customer segmentation
Predictive modeling and analytics
Types of machine learning models
The components of a data-driven machine learning system
Data ingestion and storage
Data cleansing and transformation
Model training and testing loop
Model deployment and integration
Model monitoring and feedback
Batch versus real time
Data Pipeline in Apache Spark
An architecture for a machine learning system
Spark MLlib
Performance improvements in Spark ML over Spark MLlib
Comparing algorithms supported by MLlib
Classification
Clustering
Regression
MLlib supported methods and developer APIs
Spark Integration
MLlib vision
MLlib versions compared
Spark 1.6 to 2.0
Summary
Chapter 4: Obtaining, Processing, and Preparing Data with Spark
Accessing publicly available datasets
The MovieLens 100k dataset
Exploring and visualizing your data
Exploring the user dataset
Count by occupation
Movie dataset
Exploring the rating dataset
Rating count bar chart
Distribution of number ratings
Processing and transforming your data
Filling in bad or missing data
Extracting useful features from your data
Numerical features
Categorical features
Derived features
Transforming timestamps into categorical features
Extract time of Day
Extract time of day
Text features
Simple text feature extraction
Sparse Vectors from Titles
Normalizing features
Using ML for feature normalization
Using packages for feature extraction
TFID
IDF
Word2Vector
Skip-gram model
Standard scalar
Summary
Chapter 5: Building a Recommendation Engine with Spark
Types of recommendation models
Content-based filtering
Collaborative filtering
Matrix factorization
Explicit matrix factorization
Implicit Matrix Factorization
Basic model for Matrix Factorization
Alternating least squares
Extracting the right features from your data
Extracting features from the MovieLens 100k dataset
Training the recommendation model
Training a model on the MovieLens 100k dataset
Training a model using Implicit feedback data
Using the recommendation model
ALS Model recommendations
User recommendations
Generating movie recommendations from the MovieLens 100k dataset
Inspecting the recommendations
Item recommendations
Generating similar movies for the MovieLens 100k dataset
Inspecting the similar items
Evaluating the performance of recommendation models
ALS Model Evaluation
Mean Squared Error
Mean Average Precision at K
Using MLlib's built-in evaluation functions
RMSE and MSE
MAP
FP-Growth algorithm
FP-Growth Basic Sample
FP-Growth Applied to Movie Lens Data
Summary
Chapter 6: Building a Classification Model with Spark
Types of classification models
Linear models
Logistic regression
Multinomial logistic regression
Visualizing the StumbleUpon dataset
Extracting features from the Kaggle/StumbleUpon evergreen classification dataset
StumbleUponExecutor
Linear support vector machines
The naive Bayes model
Decision trees
Ensembles of trees
Random Forests
Gradient-Boosted Trees
Multilayer perceptron classifier
Extracting the right features from your data
Training classification models
Training a classification model on the Kaggle/StumbleUpon evergreen classification dataset
Using classification models
Generating predictions for the Kaggle/StumbleUpon evergreen classification dataset
Evaluating the performance of classification models
Accuracy and prediction error
Precision and recall
ROC curve and AUC
Improving model performance and tuning parameters
Feature standardization
Additional features
Using the correct form of data
Tuning model parameters
Linear models
Iterations
Step size
Regularization
Decision trees
Tuning tree depth and impurity
The naive Bayes model
Cross-validation
Summary
Chapter 7: Building a Regression Model with Spark
Types of regression models
Least squares regression
Decision trees for regression
Evaluating the performance of regression models
Mean Squared Error and Root Mean Squared Error
Mean Absolute Error
Root Mean Squared Log Error
The R-squared coefficient
Extracting the right features from your data
Extracting features from the bike sharing dataset
Training and using regression models
BikeSharingExecutor
Training a regression model on the bike sharing dataset
Linear regression
Generalized linear regression
Decision tree regression
Ensembles of trees
Random forest regression
Gradient boosted tree regression
Improving model performance and tuning parameters
Transforming the target variable
Impact of training on log-transformed targets
Tuning model parameters
Creating training and testing sets to evaluate parameters
Splitting data for Decision tree
The impact of parameter settings for linear models
Iterations
Step size
L2 regularization
L1 regularization
Intercept
The impact of parameter settings for the decision tree
Tree depth
Maximum bins
The impact of parameter settings for the Gradient Boosted Trees
Iterations
MaxBins
Summary
Chapter 8: Building a Clustering Model with Spark
Types of clustering models
k-means clustering
Initialization methods
Mixture models
Hierarchical clustering
Extracting the right features from your data
Extracting features from the MovieLens dataset
K-means - training a clustering model
Training a clustering model on the MovieLens dataset
K-means - interpreting cluster predictions on the MovieLens dataset
Interpreting the movie clusters
Interpreting the movie clusters
K-means - evaluating the performance of clustering models
Internal evaluation metrics
External evaluation metrics
Computing performance metrics on the MovieLens dataset
Effect of iterations on WSSSE
Bisecting KMeans
Bisecting K-means - training a clustering model
WSSSE and iterations
Gaussian Mixture Model
Clustering using GMM
Plotting the user and item data with GMM clustering
GMM - effect of iterations on cluster boundaries
Summary
Chapter 9: Dimensionality Reduction with Spark
Types of dimensionality reduction
Principal components analysis
Singular value decomposition
Relationship with matrix factorization
Clustering as dimensionality reduction
Extracting the right features from your data
Extracting features from the LFW dataset
Exploring the face data
Visualizing the face data
Extracting facial images as vectors
Loading images
Converting to grayscale and resizing the images
Extracting feature vectors
Normalization
Training a dimensionality reduction model
Running PCA on the LFW dataset
Visualizing the Eigenfaces
Interpreting the Eigenfaces
Using a dimensionality reduction model
Projecting data using PCA on the LFW dataset
The relationship between PCA and SVD
Evaluating dimensionality reduction models
Evaluating k for SVD on the LFW dataset
Singular values
Summary
Chapter 10: Advanced Text Processing with Spark
What's so special about text data?
Extracting the right features from your data
Term weighting schemes
Feature hashing
Extracting the tf-idf features from the 20 Newsgroups dataset
Exploring the 20 Newsgroups data
Applying basic tokenization
Improving our tokenization
Removing stop words
Excluding terms based on frequency
A note about stemming
Feature Hashing
Building a tf-idf model
Analyzing the tf-idf weightings
Using a tf-idf model
Document similarity with the 20 Newsgroups dataset and tf-idf features
Training a text classifier on the 20 Newsgroups dataset using tf-idf
Evaluating the impact of text processing
Comparing raw features with processed tf-idf features on the 20 Newsgroups dataset
Text classification with Spark 2.0
Word2Vec models
Word2Vec with Spark MLlib on the 20 Newsgroups dataset
Word2Vec with Spark ML on the 20 Newsgroups dataset
Summary
Chapter 11: Real-Time Machine Learning with Spark Streaming
Online learning
Stream processing
An introduction to Spark Streaming
Input sources
Transformations
Keeping track of state
General transformations
Actions
Window operators
Caching and fault tolerance with Spark Streaming
Creating a basic streaming application
The producer application
Creating a basic streaming application
Streaming analytics
Stateful streaming
Online learning with Spark Streaming
Streaming regression
A simple streaming regression program
Creating a streaming data producer
Creating a streaming regression model
Streaming K-means
Online model evaluation
Comparing model performance with Spark Streaming
Structured Streaming
Summary
Chapter 12: Pipeline APIs for Spark ML
Introduction to pipelines
DataFrames
Pipeline components
Transformers
Estimators
How pipelines work
Machine learning pipeline with an example
StumbleUponExecutor
Summary
Index

Systemvoraussetzungen

Dateiformat: ePUB
Kopierschutz: Adobe-DRM (Digital Rights Management)

Systemvoraussetzungen:

Computer (Windows; MacOS X; Linux): Installieren Sie bereits vor dem Download die kostenlose Software Adobe Digital Editions (siehe E-Book Hilfe).
Tablet/Smartphone (Android; iOS): Installieren Sie bereits vor dem Download die kostenlose App Adobe Digital Editions oder die App PocketBook (siehe E-Book Hilfe).
E-Book-Reader: Bookeen, Kobo, Pocketbook, Sony, Tolino u.v.a.m. (nicht Kindle)

Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an.
Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.

Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!

Weitere Informationen finden Sie in unserer E-Book Hilfe.

Dateiformat: PDF
Kopierschutz: Adobe-DRM (Digital Rights Management)

Systemvoraussetzungen:

Computer (Windows; MacOS X; Linux): Installieren Sie bereits vor dem Download die kostenlose Software Adobe Digital Editions (siehe E-Book Hilfe).
Tablet/Smartphone (Android; iOS): Installieren Sie bereits vor dem Download die kostenlose App Adobe Digital Editions oder die App PocketBook (siehe E-Book Hilfe).
E-Book-Reader: Bookeen, Kobo, Pocketbook, Sony, Tolino u.v.a.m. (nicht Kindle)

Das Dateiformat PDF zeigt auf jeder Hardware eine Buchseite stets identisch an. Daher ist eine PDF auch für ein komplexes Layout geeignet, wie es bei Lehr- und Fachbüchern verwendet wird (Bilder, Tabellen, Spalten, Fußnoten). Bei kleinen Displays von E-Readern oder Smartphones sind PDF leider eher nervig, weil zu viel Scrollen notwendig ist.
Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.

Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!

Weitere Informationen finden Sie in unserer E-Book Hilfe.

Als PDF speichern Als Link merken