
Scaling Machine Learning with Spark
Beschreibung
Learn how to build end-to-end scalable machine learning solutions with Apache Spark. With this practical guide, author Adi Polak introduces data and ML practitioners to creative solutions that supersede today''s traditional methods. You''ll learn a more holistic approach that takes you beyond specific requirements and organizational goals--allowing data and ML practitioners to collaborate and understand each other better.
Scaling Machine Learning with Spark examines several technologies for building end-to-end distributed ML workflows based on the Apache Spark ecosystem with Spark MLlib, MLflow, TensorFlow, and PyTorch. If you''re a data scientist who works with machine learning, this book shows you when and why to use each technology.
You will:
- Explore machine learning, including distributed computing concepts and terminology
- Manage the ML lifecycle with MLflow
- Ingest data and perform basic preprocessing with Spark
- Explore feature engineering, and use Spark to extract features
- Train a model with MLlib and build a pipeline to reproduce it
- Build a data system to combine the power of Spark with deep learning
- Get a step-by-step example of working with distributed TensorFlow
- Use PyTorch to scale machine learning and its internal architecture
Weitere Details
Weitere Ausgaben
Inhalt
- Cover
- Copyright
- Table of Contents
- Preface
- Who Should Read This Book?
- Do You Need Distributed Machine Learning?
- Navigating This Book
- What Is Not Covered
- The Environment and Tools
- The Tools
- The Datasets
- Conventions Used in This Book
- Using Code Examples
- O'Reilly Online Learning
- How to Contact Us
- Acknowledgments
- Chapter 1. Distributed Machine Learning Terminology and Concepts
- The Stages of the Machine Learning Workflow
- Tools and Technologies in the Machine Learning Pipeline
- Distributed Computing Models
- General-Purpose Models
- Dedicated Distributed Computing Models
- Introduction to Distributed Systems Architecture
- Centralized Versus Decentralized Systems
- Interaction Models
- Communication in a Distributed Setting
- Introduction to Ensemble Methods
- High Versus Low Bias
- Types of Ensemble Methods
- Distributed Training Topologies
- The Challenges of Distributed Machine Learning Systems
- Performance
- Resource Management
- Fault Tolerance
- Privacy
- Portability
- Setting Up Your Local Environment
- Chapters 2-6 Tutorials Environment
- Chapters 7-10 Tutorials Environment
- Summary
- Chapter 2. Introduction to Spark and PySpark
- Apache Spark Architecture
- Intro to PySpark
- Apache Spark Basics
- Software Architecture
- PySpark and Functional Programming
- Executing PySpark Code
- pandas DataFrames Versus Spark DataFrames
- Scikit-Learn Versus MLlib
- Summary
- Chapter 3. Managing the Machine Learning Experiment Lifecycle with MLflow
- Machine Learning Lifecycle Management Requirements
- What Is MLflow?
- Software Components of the MLflow Platform
- Users of the MLflow Platform
- MLflow Components
- MLflow Tracking
- MLflow Projects
- MLflow Models
- MLflow Model Registry
- Using MLflow at Scale
- Summary
- Chapter 4. Data Ingestion, Preprocessing, and Descriptive Statistics
- Data Ingestion with Spark
- Working with Images
- Working with Tabular Data
- Preprocessing Data
- Preprocessing Versus Processing
- Why Preprocess the Data?
- Data Structures
- MLlib Data Types
- Preprocessing with MLlib Transformers
- Preprocessing Image Data
- Save the Data and Avoid the Small Files Problem
- Descriptive Statistics: Getting a Feel for the Data
- Calculating Statistics
- Descriptive Statistics with Spark Summarizer
- Data Skewness
- Correlation
- Summary
- Chapter 5. Feature Engineering
- Features and Their Impact on Models
- MLlib Featurization Tools
- Extractors
- Selectors
- Example: Word2Vec
- The Image Featurization Process
- Understanding Image Manipulation
- Extracting Features with Spark APIs
- The Text Featurization Process
- Bag-of-Words
- TF-IDF
- N-Gram
- Additional Techniques
- Enriching the Dataset
- Summary
- Chapter 6. Training Models with Spark MLlib
- Algorithms
- Supervised Machine Learning
- Classification
- Regression
- Unsupervised Machine Learning
- Frequent Pattern Mining
- Clustering
- Evaluating
- Supervised Evaluators
- Unsupervised Evaluators
- Hyperparameters and Tuning Experiments
- Building a Parameter Grid
- Splitting the Data into Training and Test Sets
- Cross-Validation: A Better Way to Test Your Models
- Machine Learning Pipelines
- Constructing a Pipeline
- How Does Splitting Work with the Pipeline API?
- Persistence
- Summary
- Chapter 7. Bridging Spark and Deep Learning Frameworks
- The Two Clusters Approach
- Implementing a Dedicated Data Access Layer
- Features of a DAL
- Selecting a DAL
- What Is Petastorm?
- SparkDatasetConverter
- Petastorm as a Parquet Store
- Project Hydrogen
- Barrier Execution Mode
- Accelerator-Aware Scheduling
- A Brief Introduction to the Horovod Estimator API
- Summary
- Chapter 8. TensorFlow Distributed Machine Learning Approach
- A Quick Overview of TensorFlow
- What Is a Neural Network?
- TensorFlow Cluster Process Roles and Responsibilities
- Loading Parquet Data into a TensorFlow Dataset
- An Inside Look at TensorFlow's Distributed Machine Learning Strategies
- ParameterServerStrategy
- CentralStorageStrategy: One Machine, Multiple Processors
- MirroredStrategy: One Machine, Multiple Processors, Local Copy
- MultiWorkerMirroredStrategy: Multiple Machines, Synchronous
- TPUStrategy
- What Things Change When You Switch Strategies?
- Training APIs
- Keras API
- Custom Training Loop
- Estimator API
- Putting It All Together
- Troubleshooting
- Summary
- Chapter 9. PyTorch Distributed Machine Learning Approach
- A Quick Overview of PyTorch Basics
- Computation Graph
- PyTorch Mechanics and Concepts
- PyTorch Distributed Strategies for Training Models
- Introduction to PyTorch's Distributed Approach
- Distributed Data-Parallel Training
- RPC-Based Distributed Training
- Communication Topologies in PyTorch (c10d)
- What Can We Do with PyTorch's Low-Level APIs?
- Loading Data with PyTorch and Petastorm
- Troubleshooting Guidance for Working with Petastorm and Distributed PyTorch
- The Enigma of Mismatched Data Types
- The Mystery of Straggling Workers
- How Does PyTorch Differ from TensorFlow?
- Summary
- Chapter 10. Deployment Patterns for Machine Learning Models
- Deployment Patterns
- Pattern 1: Batch Prediction
- Pattern 2: Model-in-Service
- Pattern 3: Model-as-a-Service
- Determining Which Pattern to Use
- Production Software Requirements
- Monitoring Machine Learning Models in Production
- Data Drift
- Model Drift, Concept Drift
- Distributional Domain Shift (the Long Tail)
- What Metrics Should I Monitor in Production?
- How Do I Measure Changes Using My Monitoring System?
- What It Looks Like in Production
- The Production Feedback Loop
- Deploying with MLlib
- Production Machine Learning Pipelines with Structured Streaming
- Deploying with MLflow
- Defining an MLflow Wrapper
- Deploying the Model as a Microservice
- Loading the Model as a Spark UDF
- How to Develop Your System Iteratively
- Summary
- Index
- About the Author
- Colophon
Systemvoraussetzungen
Dateiformat: PDF
Kopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
- Computer (Windows; MacOS X; Linux): Installieren Sie bereits vor dem Download die kostenlose Software Adobe Digital Editions (siehe E-Book Hilfe).
- Tablet/Smartphone (Android; iOS): Installieren Sie bereits vor dem Download die kostenlose App Adobe Digital Editions oder die App PocketBook (siehe E-Book Hilfe).
- E-Book-Reader: Bookeen, Kobo, Pocketbook, Sony, Tolino u.v.a.m. (nicht Kindle)
Das Dateiformat PDF zeigt auf jeder Hardware eine Buchseite stets identisch an. Daher ist eine PDF auch für ein komplexes Layout geeignet, wie es bei Lehr- und Fachbüchern verwendet wird (Bilder, Tabellen, Spalten, Fußnoten). Bei kleinen Displays von E-Readern oder Smartphones sind PDF leider eher nervig, weil zu viel Scrollen notwendig ist.
Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.
Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.