Mastering Machine Learning with Spark 2.x

Name: Mastering Machine Learning with Spark 2.x | Harness the potential of machine learning, through spark
Brand: Packt Publishing
Availability: OnlineOnly

Harness the potential of machine learning, through spark

Alex Tellez Max Pumperla Michal Malohlava(Author)

Packt Publishing

Published on 8. July 2025

340 pages

E-Book

ePUB with Adobe-DRM

System requirements

E-Book

PDF with Adobe-DRM

System requirements

978-1-78528-241-6 (ISBN)

from €45.59

Available for download

Watchlist: see prices

Description

All prices

More details

Other editions

Persons

Malohlava Michal :
Michal Malohlava, creator of Sparkling Water, is a geek and the developer; Java, Linux, programming languages enthusiast who has been developing software for over 10 years. He obtained his PhD from Charles University in Prague in 2012, and post doctorate from Purdue University. During his studies, he was interested in the construction of not only distributed but also embedded and real-time, component-based systems, using model-driven methods and domain-specific languages. He participated in the design and development of various systems, including SOFA and Fractal component systems and the jPapabench control system. Now, his main interest is big data computation. He participates in the development of the H2O platform for advanced big data math and computation, and its embedding into Spark engine, published as a project called Sparkling Water.Tellez Alex :

Alex Tellez is a life-long data hacker/enthusiast with a passion for data science and its application to business problems. He has a wealth of experience working across multiple industries, including banking, health care, online dating, human resources, and online gaming. Alex has also given multiple talks at various AI/machine learning conferences, in addition to lectures at universities about neural networks. When hes not neck-deep in a textbook, Alex enjoys spending time with family, riding bikes, and utilizing machine learning to feed his French wine curiosity!Pumperla Max :

Max Pumperla is a data scientist and engineer specializing in deep learning and its applications. He currently works as a deep learning engineer at Skymind and is a co-founder of aetros.com. Max is the author and maintainer of several Python packages, including elephas, a distributed deep learning library using Spark. His open source footprint includes contributions to many popular machine learning libraries, such as keras, deeplearning4j, and hyperopt. He holds a PhD in algebraic geometry from the University of Hamburg.

Content

Cover
Copyright
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Table of Contents
Preface
Chapter 1: Introduction to Large-Scale Machine Learning and Spark
Data science
The sexiest role of the 21st century - data scientist?
A day in the life of a data scientist
Working with big data
The machine learning algorithm using a distributed environment
Splitting of data into multiple machines
From Hadoop MapReduce to Spark
What is Databricks?
Inside the box
Introducing H2O.ai
Design of Sparkling Water
What's the difference between H2O and Spark's MLlib?
Data munging
Data science - an iterative process
Summary
Chapter 2: Detecting Dark Matter - The Higgs-Boson Particle
Type I versus type II error
Finding the Higgs-Boson particle
The LHC and data creation
The theory behind the Higgs-Boson
Measuring for the Higgs-Boson
The dataset
Spark start and data load
Labeled point vector
Data caching
Creating a training and testing set
What about cross-validation?
Our first model - decision tree
Gini versus Entropy
Next model - tree ensembles
Random forest model
Grid search
Gradient boosting machine
Last model - H2O deep learning
Build a 3-layer DNN
Adding more layers
Building models and inspecting results
Summary
Chapter 3: Ensemble Methods for Multi-Class Classification
Data
Modeling goal
Challenges
Machine learning workflow
Starting Spark shell
Exploring data
Missing data
Summary of missing value analysis
Data unification
Missing values
Categorical values
Final transformation
Modelling data with Random Forest
Building a classification model using Spark RandomForest
Classification model evaluation
Spark model metrics
Building a classification model using H2O RandomForest
Summary
Chapter 4: Predicting Movie Reviews Using NLP and Spark Streaming
NLP - a brief primer
The dataset
Dataset preparation
Feature extraction
Feature extraction method- bag-of-words model
Text tokenization
Declaring our stopwords list
Stemming and lemmatization
Featurization - feature hashing
Term Frequency - Inverse Document Frequency (TF-IDF) weighting scheme
Let's do some (model) training!
Spark decision tree model
Spark Naive Bayes model
Spark random forest model
Spark GBM model
Super-learner model
Super learner
Composing all transformations together
Using the super-learner model
Summary
Chapter 5: Word2vec for Prediction and Clustering
Motivation of word vectors
Word2vec explained
What is a word vector?
The CBOW model
The skip-gram model
Fun with word vectors
Cosine similarity
Doc2vec explained
The distributed-memory model
The distributed bag-of-words model
Applying word2vec and exploring our data with vectors
Creating document vectors
Supervised learning task
Summary
Chapter 6: Extracting Patterns from Clickstream Data
Frequent pattern mining
Pattern mining terminology
Frequent pattern mining problem
The association rule mining problem
The sequential pattern mining problem
Pattern mining with Spark MLlib
Frequent pattern mining with FP-growth
Association rule mining
Sequential pattern mining with prefix span
Pattern mining on MSNBC clickstream data
Deploying a pattern mining application
The Spark Streaming module
Summary
Chapter 7: Graph Analytics with GraphX
Basic graph theory
Graphs
Directed and undirected graphs
Order and degree
Directed acyclic graphs
Connected components
Trees
Multigraphs
Property graphs
GraphX distributed graph processing engine
Graph representation in GraphX
Graph properties and operations
Building and loading graphs
Visualizing graphs with Gephi
Gephi
Creating GEXF files from GraphX graphs
Advanced graph processing
Aggregating messages
Pregel
GraphFrames
Graph algorithms and applications
Clustering
Vertex importance
GraphX in context
Summary
Chapter 8: Lending Club Loan Prediction
Motivation
Goal
Data
Data dictionary
Preparation of the environment
Data load
Exploration - data analysis
Basic clean up
Useless columns
String columns
Loan progress columns
Categorical columns
Text columns
Missing data
Prediction targets
Loan status model
Base model
The emp_title column transformation
The desc column transformation
Interest RateModel
Using models for scoring
Model deployment
Stream creation
Stream transformation
Stream output
Summary
Index

System requirements

File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)

System requirements:

Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).

The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.

Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.

For more information, see our ebook Help page.

File format: PDF
Copy-Protection: Adobe-DRM (Digital Rights Management)

System requirements:

Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (only limited: Kindle).

The file format PDF always displays a book page identically on any hardware. This makes PDF suitable for complex layouts such as those used in textbooks and reference books (images, tables, columns, footnotes). Unfortunately, on the small screens of e-readers or smartphones, PDFs are rather annoying, requiring too much scrolling.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.

Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.

For more information, see our eBook Help page.

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Mastering Machine Learning with Spark 2.x

Description

All prices

More details

Other editions

Additional editions

Persons

Content

System requirements