Scaling Machine Learning with Spark

Name: Scaling Machine Learning with Spark
Brand: O'Reilly
Price: 58.99 EUR
Availability: OnlineOnly

Adi Polak(Autor*in)

O'Reilly (Verlag)

Erschienen am 7. März 2023

294 Seiten

E-Book

PDF mit Adobe-DRM

Systemvoraussetzungen

978-1-0981-0679-9 (ISBN)

58,99 €inkl. 7% MwSt.

Systemvoraussetzungen

für PDF mit Adobe-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

Weitere Details

Weitere Ausgaben

Inhalt

Cover
Copyright
Table of Contents
Preface
Who Should Read This Book?
Do You Need Distributed Machine Learning?
Navigating This Book
What Is Not Covered
The Environment and Tools
The Tools
The Datasets
Conventions Used in This Book
Using Code Examples
O'Reilly Online Learning
How to Contact Us
Acknowledgments
Chapter 1. Distributed Machine Learning Terminology and Concepts
The Stages of the Machine Learning Workflow
Tools and Technologies in the Machine Learning Pipeline
Distributed Computing Models
General-Purpose Models
Dedicated Distributed Computing Models
Introduction to Distributed Systems Architecture
Centralized Versus Decentralized Systems
Interaction Models
Communication in a Distributed Setting
Introduction to Ensemble Methods
High Versus Low Bias
Types of Ensemble Methods
Distributed Training Topologies
The Challenges of Distributed Machine Learning Systems
Performance
Resource Management
Fault Tolerance
Privacy
Portability
Setting Up Your Local Environment
Chapters 2-6 Tutorials Environment
Chapters 7-10 Tutorials Environment
Summary
Chapter 2. Introduction to Spark and PySpark
Apache Spark Architecture
Intro to PySpark
Apache Spark Basics
Software Architecture
PySpark and Functional Programming
Executing PySpark Code
pandas DataFrames Versus Spark DataFrames
Scikit-Learn Versus MLlib
Summary
Chapter 3. Managing the Machine Learning Experiment Lifecycle with MLflow
Machine Learning Lifecycle Management Requirements
What Is MLflow?
Software Components of the MLflow Platform
Users of the MLflow Platform
MLflow Components
MLflow Tracking
MLflow Projects
MLflow Models
MLflow Model Registry
Using MLflow at Scale
Summary
Chapter 4. Data Ingestion, Preprocessing, and Descriptive Statistics
Data Ingestion with Spark
Working with Images
Working with Tabular Data
Preprocessing Data
Preprocessing Versus Processing
Why Preprocess the Data?
Data Structures
MLlib Data Types
Preprocessing with MLlib Transformers
Preprocessing Image Data
Save the Data and Avoid the Small Files Problem
Descriptive Statistics: Getting a Feel for the Data
Calculating Statistics
Descriptive Statistics with Spark Summarizer
Data Skewness
Correlation
Summary
Chapter 5. Feature Engineering
Features and Their Impact on Models
MLlib Featurization Tools
Extractors
Selectors
Example: Word2Vec
The Image Featurization Process
Understanding Image Manipulation
Extracting Features with Spark APIs
The Text Featurization Process
Bag-of-Words
TF-IDF
N-Gram
Additional Techniques
Enriching the Dataset
Summary
Chapter 6. Training Models with Spark MLlib
Algorithms
Supervised Machine Learning
Classification
Regression
Unsupervised Machine Learning
Frequent Pattern Mining
Clustering
Evaluating
Supervised Evaluators
Unsupervised Evaluators
Hyperparameters and Tuning Experiments
Building a Parameter Grid
Splitting the Data into Training and Test Sets
Cross-Validation: A Better Way to Test Your Models
Machine Learning Pipelines
Constructing a Pipeline
How Does Splitting Work with the Pipeline API?
Persistence
Summary
Chapter 7. Bridging Spark and Deep Learning Frameworks
The Two Clusters Approach
Implementing a Dedicated Data Access Layer
Features of a DAL
Selecting a DAL
What Is Petastorm?
SparkDatasetConverter
Petastorm as a Parquet Store
Project Hydrogen
Barrier Execution Mode
Accelerator-Aware Scheduling
A Brief Introduction to the Horovod Estimator API
Summary
Chapter 8. TensorFlow Distributed Machine Learning Approach
A Quick Overview of TensorFlow
What Is a Neural Network?
TensorFlow Cluster Process Roles and Responsibilities
Loading Parquet Data into a TensorFlow Dataset
An Inside Look at TensorFlow's Distributed Machine Learning Strategies
ParameterServerStrategy
CentralStorageStrategy: One Machine, Multiple Processors
MirroredStrategy: One Machine, Multiple Processors, Local Copy
MultiWorkerMirroredStrategy: Multiple Machines, Synchronous
TPUStrategy
What Things Change When You Switch Strategies?
Training APIs
Keras API
Custom Training Loop
Estimator API
Putting It All Together
Troubleshooting
Summary
Chapter 9. PyTorch Distributed Machine Learning Approach
A Quick Overview of PyTorch Basics
Computation Graph
PyTorch Mechanics and Concepts
PyTorch Distributed Strategies for Training Models
Introduction to PyTorch's Distributed Approach
Distributed Data-Parallel Training
RPC-Based Distributed Training
Communication Topologies in PyTorch (c10d)
What Can We Do with PyTorch's Low-Level APIs?
Loading Data with PyTorch and Petastorm
Troubleshooting Guidance for Working with Petastorm and Distributed PyTorch
The Enigma of Mismatched Data Types
The Mystery of Straggling Workers
How Does PyTorch Differ from TensorFlow?
Summary
Chapter 10. Deployment Patterns for Machine Learning Models
Deployment Patterns
Pattern 1: Batch Prediction
Pattern 2: Model-in-Service
Pattern 3: Model-as-a-Service
Determining Which Pattern to Use
Production Software Requirements
Monitoring Machine Learning Models in Production
Data Drift
Model Drift, Concept Drift
Distributional Domain Shift (the Long Tail)
What Metrics Should I Monitor in Production?
How Do I Measure Changes Using My Monitoring System?
What It Looks Like in Production
The Production Feedback Loop
Deploying with MLlib
Production Machine Learning Pipelines with Structured Streaming
Deploying with MLflow
Defining an MLflow Wrapper
Deploying the Model as a Microservice
Loading the Model as a Spark UDF
How to Develop Your System Iteratively
Summary
Index
About the Author
Colophon

Systemvoraussetzungen

Als PDF speichern Als Link merken