
Spark: The Definitive Guide
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
Learn how to use, deploy, and maintain Apache Spark with this comprehensive guide, written by the creators of the open-source cluster-computing framework. With an emphasis on improvements and new features in Spark 2.0, authors Bill Chambers and Matei Zaharia break down Spark topics into distinct sections, each with unique goals.
Youâ??ll explore the basic operations and common functions of Sparkâ??s structured APIs, as well as Structured Streaming, a new high-level API for building end-to-end streaming applications. Developers and system administrators will learn the fundamentals of monitoring, tuning, and debugging Spark, and explore machine learning techniques and scenarios for employing MLlib, Sparkâ??s scalable machine-learning library.
- Get a gentle overview of big data and Spark
- Learn about DataFrames, SQL, and Datasetsâ??Sparkâ??s core APIsâ??through worked examples
- Dive into Sparkâ??s low-level APIs, RDDs, and execution of SQL and DataFrames
- Understand how Spark runs on a cluster
- Debug, monitor, and tune Spark clusters and applications
- Learn the power of Structured Streaming, Sparkâ??s stream-processing engine
- Learn how you can apply MLlib to a variety of problems, including classification or recommendation
More details
Other editions
Additional editions

Content
- Cover
- Copyright
- Table of Contents
- Preface
- About the Authors
- Who This Book Is For
- Conventions Used in This Book
- Using Code Examples
- O'Reilly Safari
- How to Contact Us
- Acknowledgments
- Part I. Gentle Overview of Big Data and Spark
- Chapter 1. What Is Apache Spark?
- Apache Spark's Philosophy
- Context: The Big Data Problem
- History of Spark
- The Present and Future of Spark
- Running Spark
- Downloading Spark Locally
- Launching Spark's Interactive Consoles
- Running Spark in the Cloud
- Data Used in This Book
- Chapter 2. A Gentle Introduction to Spark
- Spark's Basic Architecture
- Spark Applications
- Spark's Language APIs
- Spark's APIs
- Starting Spark
- The SparkSession
- DataFrames
- Partitions
- Transformations
- Lazy Evaluation
- Actions
- Spark UI
- An End-to-End Example
- DataFrames and SQL
- Conclusion
- Chapter 3. A Tour of Spark's Toolset
- Running Production Applications
- Datasets: Type-Safe Structured APIs
- Structured Streaming
- Machine Learning and Advanced Analytics
- Lower-Level APIs
- SparkR
- Spark's Ecosystem and Packages
- Conclusion
- Part II. Structured APIs-DataFrames, SQL, and Datasets
- Chapter 4. Structured API Overview
- DataFrames and Datasets
- Schemas
- Overview of Structured Spark Types
- DataFrames Versus Datasets
- Columns
- Rows
- Spark Types
- Overview of Structured API Execution
- Logical Planning
- Physical Planning
- Execution
- Conclusion
- Chapter 5. Basic Structured Operations
- Schemas
- Columns and Expressions
- Columns
- Expressions
- Records and Rows
- Creating Rows
- DataFrame Transformations
- Creating DataFrames
- select and selectExpr
- Converting to Spark Types (Literals)
- Adding Columns
- Renaming Columns
- Reserved Characters and Keywords
- Case Sensitivity
- Removing Columns
- Changing a Column's Type (cast)
- Filtering Rows
- Getting Unique Rows
- Random Samples
- Random Splits
- Concatenating and Appending Rows (Union)
- Sorting Rows
- Limit
- Repartition and Coalesce
- Collecting Rows to the Driver
- Conclusion
- Chapter 6. Working with Different Types of Data
- Where to Look for APIs
- Converting to Spark Types
- Working with Booleans
- Working with Numbers
- Working with Strings
- Regular Expressions
- Working with Dates and Timestamps
- Working with Nulls in Data
- Coalesce
- ifnull, nullIf, nvl, and nvl2
- drop
- fill
- replace
- Ordering
- Working with Complex Types
- Structs
- Arrays
- split
- Array Length
- array_contains
- explode
- Maps
- Working with JSON
- User-Defined Functions
- Conclusion
- Chapter 7. Aggregations
- Aggregation Functions
- count
- countDistinct
- approx_count_distinct
- first and last
- min and max
- sum
- sumDistinct
- avg
- Variance and Standard Deviation
- skewness and kurtosis
- Covariance and Correlation
- Aggregating to Complex Types
- Grouping
- Grouping with Expressions
- Grouping with Maps
- Window Functions
- Grouping Sets
- Rollups
- Cube
- Grouping Metadata
- Pivot
- User-Defined Aggregation Functions
- Conclusion
- Chapter 8. Joins
- Join Expressions
- Join Types
- Inner Joins
- Outer Joins
- Left Outer Joins
- Right Outer Joins
- Left Semi Joins
- Left Anti Joins
- Natural Joins
- Cross (Cartesian) Joins
- Challenges When Using Joins
- Joins on Complex Types
- Handling Duplicate Column Names
- How Spark Performs Joins
- Communication Strategies
- Conclusion
- Chapter 9. Data Sources
- The Structure of the Data Sources API
- Read API Structure
- Basics of Reading Data
- Write API Structure
- Basics of Writing Data
- CSV Files
- CSV Options
- Reading CSV Files
- Writing CSV Files
- JSON Files
- JSON Options
- Reading JSON Files
- Writing JSON Files
- Parquet Files
- Reading Parquet Files
- Writing Parquet Files
- ORC Files
- Reading Orc Files
- Writing Orc Files
- SQL Databases
- Reading from SQL Databases
- Query Pushdown
- Writing to SQL Databases
- Text Files
- Reading Text Files
- Writing Text Files
- Advanced I/O Concepts
- Splittable File Types and Compression
- Reading Data in Parallel
- Writing Data in Parallel
- Writing Complex Types
- Managing File Size
- Conclusion
- Chapter 10. Spark SQL
- What Is SQL?
- Big Data and SQL: Apache Hive
- Big Data and SQL: Spark SQL
- Spark's Relationship to Hive
- How to Run Spark SQL Queries
- Spark SQL CLI
- Spark's Programmatic SQL Interface
- SparkSQL Thrift JDBC/ODBC Server
- Catalog
- Tables
- Spark-Managed Tables
- Creating Tables
- Creating External Tables
- Inserting into Tables
- Describing Table Metadata
- Refreshing Table Metadata
- Dropping Tables
- Caching Tables
- Views
- Creating Views
- Dropping Views
- Databases
- Creating Databases
- Setting the Database
- Dropping Databases
- Select Statements
- case.when.then Statements
- Advanced Topics
- Complex Types
- Functions
- Subqueries
- Miscellaneous Features
- Configurations
- Setting Configuration Values in SQL
- Conclusion
- Chapter 11. Datasets
- When to Use Datasets
- Creating Datasets
- In Java: Encoders
- In Scala: Case Classes
- Actions
- Transformations
- Filtering
- Mapping
- Joins
- Grouping and Aggregations
- Conclusion
- Part III. Low-Level APIs
- Chapter 12. Resilient Distributed Datasets (RDDs)
- What Are the Low-Level APIs?
- When to Use the Low-Level APIs?
- How to Use the Low-Level APIs?
- About RDDs
- Types of RDDs
- When to Use RDDs?
- Datasets and RDDs of Case Classes
- Creating RDDs
- Interoperating Between DataFrames, Datasets, and RDDs
- From a Local Collection
- From Data Sources
- Manipulating RDDs
- Transformations
- distinct
- filter
- map
- sort
- Random Splits
- Actions
- reduce
- count
- first
- max and min
- take
- Saving Files
- saveAsTextFile
- SequenceFiles
- Hadoop Files
- Caching
- Checkpointing
- Pipe RDDs to System Commands
- mapPartitions
- foreachPartition
- glom
- Conclusion
- Chapter 13. Advanced RDDs
- Key-Value Basics (Key-Value RDDs)
- keyBy
- Mapping over Values
- Extracting Keys and Values
- lookup
- sampleByKey
- Aggregations
- countByKey
- Understanding Aggregation Implementations
- Other Aggregation Methods
- CoGroups
- Joins
- Inner Join
- zips
- Controlling Partitions
- coalesce
- repartition
- repartitionAndSortWithinPartitions
- Custom Partitioning
- Custom Serialization
- Conclusion
- Chapter 14. Distributed Shared Variables
- Broadcast Variables
- Accumulators
- Basic Example
- Custom Accumulators
- Conclusion
- Part IV. Production Applications
- Chapter 15. How Spark Runs on a Cluster
- The Architecture of a Spark Application
- Execution Modes
- The Life Cycle of a Spark Application (Outside Spark)
- Client Request
- Launch
- Execution
- Completion
- The Life Cycle of a Spark Application (Inside Spark)
- The SparkSession
- Logical Instructions
- A Spark Job
- Stages
- Tasks
- Execution Details
- Pipelining
- Shuffle Persistence
- Conclusion
- Chapter 16. Developing Spark Applications
- Writing Spark Applications
- A Simple Scala-Based App
- Writing Python Applications
- Writing Java Applications
- Testing Spark Applications
- Strategic Principles
- Tactical Takeaways
- Connecting to Unit Testing Frameworks
- Connecting to Data Sources
- The Development Process
- Launching Applications
- Application Launch Examples
- Configuring Applications
- The SparkConf
- Application Properties
- Runtime Properties
- Execution Properties
- Configuring Memory Management
- Configuring Shuffle Behavior
- Environmental Variables
- Job Scheduling Within an Application
- Conclusion
- Chapter 17. Deploying Spark
- Where to Deploy Your Cluster to Run Spark Applications
- On-Premises Cluster Deployments
- Spark in the Cloud
- Cluster Managers
- Standalone Mode
- Spark on YARN
- Configuring Spark on YARN Applications
- Spark on Mesos
- Secure Deployment Configurations
- Cluster Networking Configurations
- Application Scheduling
- Miscellaneous Considerations
- Conclusion
- Chapter 18. Monitoring and Debugging
- The Monitoring Landscape
- What to Monitor
- Driver and Executor Processes
- Queries, Jobs, Stages, and Tasks
- Spark Logs
- The Spark UI
- Spark REST API
- Spark UI History Server
- Debugging and Spark First Aid
- Spark Jobs Not Starting
- Errors Before Execution
- Errors During Execution
- Slow Tasks or Stragglers
- Slow Aggregations
- Slow Joins
- Slow Reads and Writes
- Driver OutOfMemoryError or Driver Unresponsive
- Executor OutOfMemoryError or Executor Unresponsive
- Unexpected Nulls in Results
- No Space Left on Disk Errors
- Serialization Errors
- Conclusion
- Chapter 19. Performance Tuning
- Indirect Performance Enhancements
- Design Choices
- Object Serialization in RDDs
- Cluster Configurations
- Scheduling
- Data at Rest
- Shuffle Configurations
- Memory Pressure and Garbage Collection
- Direct Performance Enhancements
- Parallelism
- Improved Filtering
- Repartitioning and Coalescing
- User-Defined Functions (UDFs)
- Temporary Data Storage (Caching)
- Joins
- Aggregations
- Broadcast Variables
- Conclusion
- Part V. Streaming
- Chapter 20. Stream Processing Fundamentals
- What Is Stream Processing?
- Stream Processing Use Cases
- Advantages of Stream Processing
- Challenges of Stream Processing
- Stream Processing Design Points
- Record-at-a-Time Versus Declarative APIs
- Event Time Versus Processing Time
- Continuous Versus Micro-Batch Execution
- Spark's Streaming APIs
- The DStream API
- Structured Streaming
- Conclusion
- Chapter 21. Structured Streaming Basics
- Structured Streaming Basics
- Core Concepts
- Transformations and Actions
- Input Sources
- Sinks
- Output Modes
- Triggers
- Event-Time Processing
- Structured Streaming in Action
- Transformations on Streams
- Selections and Filtering
- Aggregations
- Joins
- Input and Output
- Where Data Is Read and Written (Sources and Sinks)
- Reading from the Kafka Source
- Writing to the Kafka Sink
- How Data Is Output (Output Modes)
- When Data Is Output (Triggers)
- Streaming Dataset API
- Conclusion
- Chapter 22. Event-Time and Stateful Processing
- Event Time
- Stateful Processing
- Arbitrary Stateful Processing
- Event-Time Basics
- Windows on Event Time
- Tumbling Windows
- Handling Late Data with Watermarks
- Dropping Duplicates in a Stream
- Arbitrary Stateful Processing
- Time-Outs
- Output Modes
- mapGroupsWithState
- flatMapGroupsWithState
- Conclusion
- Chapter 23. Structured Streaming in Production
- Fault Tolerance and Checkpointing
- Updating Your Application
- Updating Your Streaming Application Code
- Updating Your Spark Version
- Sizing and Rescaling Your Application
- Metrics and Monitoring
- Query Status
- Recent Progress
- Spark UI
- Alerting
- Advanced Monitoring with the Streaming Listener
- Conclusion
- Part VI. Advanced Analytics and Machine Learning
- Chapter 24. Advanced Analytics and Machine Learning Overview
- A Short Primer on Advanced Analytics
- Supervised Learning
- Recommendation
- Unsupervised Learning
- Graph Analytics
- The Advanced Analytics Process
- Spark's Advanced Analytics Toolkit
- What Is MLlib?
- High-Level MLlib Concepts
- MLlib in Action
- Feature Engineering with Transformers
- Estimators
- Pipelining Our Workflow
- Training and Evaluation
- Persisting and Applying Models
- Deployment Patterns
- Conclusion
- Chapter 25. Preprocessing and Feature Engineering
- Formatting Models According to Your Use Case
- Transformers
- Estimators for Preprocessing
- Transformer Properties
- High-Level Transformers
- RFormula
- SQL Transformers
- VectorAssembler
- Working with Continuous Features
- Bucketing
- Scaling and Normalization
- StandardScaler
- Working with Categorical Features
- StringIndexer
- Converting Indexed Values Back to Text
- Indexing in Vectors
- One-Hot Encoding
- Text Data Transformers
- Tokenizing Text
- Removing Common Words
- Creating Word Combinations
- Converting Words into Numerical Representations
- Word2Vec
- Feature Manipulation
- PCA
- Interaction
- Polynomial Expansion
- Feature Selection
- ChiSqSelector
- Advanced Topics
- Persisting Transformers
- Writing a Custom Transformer
- Conclusion
- Chapter 26. Classification
- Use Cases
- Types of Classification
- Binary Classification
- Multiclass Classification
- Multilabel Classification
- Classification Models in MLlib
- Model Scalability
- Logistic Regression
- Model Hyperparameters
- Training Parameters
- Prediction Parameters
- Example
- Model Summary
- Decision Trees
- Model Hyperparameters
- Training Parameters
- Prediction Parameters
- Random Forest and Gradient-Boosted Trees
- Model Hyperparameters
- Training Parameters
- Prediction Parameters
- Naive Bayes
- Model Hyperparameters
- Training Parameters
- Prediction Parameters
- Evaluators for Classification and Automating Model Tuning
- Detailed Evaluation Metrics
- One-vs-Rest Classifier
- Multilayer Perceptron
- Conclusion
- Chapter 27. Regression
- Use Cases
- Regression Models in MLlib
- Model Scalability
- Linear Regression
- Model Hyperparameters
- Training Parameters
- Example
- Training Summary
- Generalized Linear Regression
- Model Hyperparameters
- Training Parameters
- Prediction Parameters
- Example
- Training Summary
- Decision Trees
- Model Hyperparameters
- Training Parameters
- Example
- Random Forests and Gradient-Boosted Trees
- Model Hyperparameters
- Training Parameters
- Example
- Advanced Methods
- Survival Regression (Accelerated Failure Time)
- Isotonic Regression
- Evaluators and Automating Model Tuning
- Metrics
- Conclusion
- Chapter 28. Recommendation
- Use Cases
- Collaborative Filtering with Alternating Least Squares
- Model Hyperparameters
- Training Parameters
- Prediction Parameters
- Example
- Evaluators for Recommendation
- Metrics
- Regression Metrics
- Ranking Metrics
- Frequent Pattern Mining
- Conclusion
- Chapter 29. Unsupervised Learning
- Use Cases
- Model Scalability
- k-means
- Model Hyperparameters
- Training Parameters
- Example
- k-means Metrics Summary
- Bisecting k-means
- Model Hyperparameters
- Training Parameters
- Example
- Bisecting k-means Summary
- Gaussian Mixture Models
- Model Hyperparameters
- Training Parameters
- Example
- Gaussian Mixture Model Summary
- Latent Dirichlet Allocation
- Model Hyperparameters
- Training Parameters
- Prediction Parameters
- Example
- Conclusion
- Chapter 30. Graph Analytics
- Building a Graph
- Querying the Graph
- Subgraphs
- Motif Finding
- Graph Algorithms
- PageRank
- In-Degree and Out-Degree Metrics
- Breadth-First Search
- Connected Components
- Strongly Connected Components
- Advanced Tasks
- Conclusion
- Chapter 31. Deep Learning
- What Is Deep Learning?
- Ways of Using Deep Learning in Spark
- Deep Learning Libraries
- MLlib Neural Network Support
- TensorFrames
- BigDL
- TensorFlowOnSpark
- DeepLearning4J
- Deep Learning Pipelines
- A Simple Example with Deep Learning Pipelines
- Setup
- Images and DataFrames
- Transfer Learning
- Applying Popular Models
- Conclusion
- Part VII. Ecosystem
- Chapter 32. Language Specifics: Python (PySpark) and R (SparkR and sparklyr)
- PySpark
- Fundamental PySpark Differences
- Pandas Integration
- R on Spark
- SparkR
- sparklyr
- Conclusion
- Chapter 33. Ecosystem and Community
- Spark Packages
- An Abridged List of Popular Packages
- Using Spark Packages
- External Packages
- Community
- Spark Summit
- Local Meetups
- Conclusion
- Index
- About the Authors
- Colophon
System requirements
File format: PDF
Copy-Protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (only limited: Kindle).
The file format PDF always displays a book page identically on any hardware. This makes PDF suitable for complex layouts such as those used in textbooks and reference books (images, tables, columns, footnotes). Unfortunately, on the small screens of e-readers or smartphones, PDFs are rather annoying, requiring too much scrolling.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our eBook Help page.