Scala: Guide for Data Science Professionals

Name: Scala: Guide for Data Science Professionals
Brand: Packt Publishing
Price: 88.49 EUR
Availability: OnlineOnly

Pascal Bugnion(Author)

Packt Publishing

Published on 24. February 2017

1100 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-78728-103-5 (ISBN)

€88.49incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.

Alles über E-Books, Kopierschutz & Dateiformate finden Sie in unserem Info- & Hilfebereich.

Scala will be a valuable tool to have on hand during your data science journey for everything from data cleaning to cutting-edge machine learningAbout This BookBuild data science and data engineering solutions with easeAn in-depth look at each stage of the data analysis process - from reading and collecting data to distributed analyticsExplore a broad variety of data processing, machine learning, and genetic algorithms through diagrams, mathematical formulations, and source codeWho This Book Is ForThis learning path is perfect for those who are comfortable with Scala programming and now want to enter the field of data science. Some knowledge of statistics is expected.What You Will LearnTransfer and filter tabular data to extract features for machine learningRead, clean, transform, and write data to both SQL and NoSQL databasesCreate Scala web applications that couple with JavaScript libraries such as D3 to create compelling interactive visualizationsLoad data from HDFS and HIVE with easeRun streaming and graph analytics in Spark for exploratory analysisBundle and scale up Spark jobs by deploying them into a variety of cluster managersBuild dynamic workflows for scientific computingLeverage open source libraries to extract patterns from time seriesMaster probabilistic models for sequential dataIn DetailScala is especially good for analyzing large sets of data as the scale of the task doesn't have any significant impact on performance. Scala's powerful functional libraries can interact with databases and build scalable frameworks - resulting in the creation of robust data pipelines.The first module introduces you to Scala libraries to ingest, store, manipulate, process, and visualize data. Using real world examples, you will learn how to design scalable architecture to process and model data - starting from simple concurrency constructs and progressing to actor systems and Apache Spark. After this, you will also learn how to build interactive visualizations with web frameworks.Once you have become familiar with all the tasks involved in data science, you will explore data analytics with Scala in the second module. You'll see how Scala can be used to make sense of data through easy to follow recipes. You will learn about Bokeh bindings for exploratory data analysis and quintessential machine learning with algorithms with Spark ML library. You'll get a sufficient understanding of Spark streaming, machine learning for streaming data, and Spark graphX.Armed with a firm understanding of data analysis, you will be ready to explore the most cutting-edge aspect of data science - machine learning. The final module teaches you the A to Z of machine learning with Scala. You'll explore Scala for dependency injections and implicits, which are used to write machine learning algorithms. You'll also explore machine learning topics such as clustering, dimentionality reduction, Naive Bayes, Regression models, SVMs, neural networks, and more.This learning path combines some of the best that Packt has to offer into one complete, curated package. It includes content from the following Packt products:Scala for Data Science, Pascal BugnionScala Data Analysis Cookbook, Arun ManivannanScala for Machine Learning, Patrick R. NicolasStyle and approachA complete package with all the information necessary to start building useful data engineering and data science solutions straight away. It contains a diverse set of recipes that cover the full spectrum of interesting data analysis tasks and will help you revolutionize your data analysis skills using Scala.

More details

Content

Cover
Copyright
Credits
Preface
Table of Contents
Module 1: Scala for Data Science
Chapter 1: Scala and Data Science
Data science
Programming in data science
Why Scala?
When not to use Scala
Summary
References
Chapter 2: Manipulating Data with Breeze
Code examples
Installing Breeze
Getting help on Breeze
Basic Breeze data types
An example - logistic regression
Towards re-usable code
Alternatives to Breeze
Summary
References
Chapter 3: Plotting with breeze-viz
Diving into Breeze
Customizing plots
Customizing the line type
More advanced scatter plots
Multi-plot example - scatterplot matrix plots
Managing without documentation
Breeze-viz reference
Data visualization beyond breeze-viz
Summary
Chapter 4: Parallel Collections and Futures
Parallel collections
Futures
Summary
References
Chapter 5: Scala and SQL through JDBC
Interacting with JDBC
First steps with JDBC
JDBC summary
Functional wrappers for JDBC
Safer JDBC connections with the loan pattern
Enriching JDBC statements with the "pimp my library" pattern
Wrapping result sets in a stream
Looser coupling with type classes
Creating a data access layer
Summary
References
Chapter 6: Slick - A Functional Interface for SQL
FEC data
Invokers
Operations on columns
Aggregations with "Group by
Accessing database metadata
Slick versus JDBC
Summary
References
Chapter 7: Web APIs
A whirlwind tour of JSON
Querying web APIs
JSON in Scala - an exercise in pattern matching
Extraction using case classes
Concurrency and exception handling with futures
Authentication - adding HTTP headers
Summary
References
Chapter 8: Scala and MongoDB
MongoDB
Connecting to MongoDB with Casbah
Inserting documents
Extracting objects from the database
Complex queries
Casbah query DSL
Custom type serialization
Beyond Casbah
Summary
References
Chapter 9: Concurrency with Akka
GitHub follower graph
Actors as people
Hello world with Akka
Case classes as messages
Actor construction
Anatomy of an actor
Follower network crawler
Fetcher actors
Routing
Message passing between actors
Queue control and the pull pattern
Accessing the sender of a message
Stateful actors
Follower network crawler
Fault tolerance
Custom supervisor strategies
Life-cycle hooks
What we have not talked about
Summary
References
Chapter 10: Distributed Batch Processing with Spark
Installing Spark
Acquiring the example data
Resilient distributed datasets
Building and running standalone programs
Spam filtering
Lifting the hood
Data shuffling and partitions
Summary
Reference
Chapter 11: Spark SQL and DataFrames
DataFrames - a whirlwind introduction
Aggregation operations
Joining DataFrames together
Custom functions on DataFrames
DataFrame immutability and persistence
SQL statements on DataFrames
Complex data types - arrays, maps, and structs
Interacting with data sources
Standalone programs
Summary
References
Chapter 12: Distributed Machine Learning with MLlib
Introducing MLlib - Spam classification
Pipeline components
Evaluation
Regularization in logistic regression
Cross-validation and model selection
Beyond logistic regression
Summary
References
Chapter 13: Web APIs with Play
Client-server applications
Introduction to web frameworks
Model-View-Controller architecture
Single page applications
Building an application
The Play framework
Dynamic routing
Actions
Interacting with JSON
Querying external APIs and consuming JSON
Creating APIs with Play: a summary
Rest APIs: best practice
Summary
References
Chapter 14: Visualization with D3 and the Play Framework
GitHub user data
Do I need a backend?
JavaScript dependencies through web-jars
Towards a web application: HTML templates
Modular JavaScript through RequireJS
Bootstrapping the applications
Client-side program architecture
Drawing plots with NVD3
Summary
References
Appendix: Pattern Matching and Extractors
Pattern matching in for comprehensions
Pattern matching internals
Extracting sequences
Summary
Reference
Module 2: Scala Data Analysis Cookbook
Chapter 1: Getting Started with Breeze
Introduction
Getting Breeze - the linear algebra library
Working with vectors
Working with matrices
Vectors and matrices with randomly distributed values
Reading and writing CSV files
Chapter 2: Getting Started with Apache Spark DataFrames
Introduction
Getting Apache Spark
Creating a DataFrame from CSV
Manipulating DataFrames
Creating a DataFrame from Scala case classes
Chapter 3: Loading and Preparing Data - DataFrame
Introduction
Loading more than 22 features into classes
Loading JSON into DataFrames
Storing data as Parquet files
Using the Avro data model in Parquet
Loading from RDBMS
Preparing data in Dataframes
Chapter 4: Data Visualization
Introduction
Visualizing using Zeppelin
Creating scatter plots with Bokeh-Scala
Creating a time series MultiPlot with Bokeh-Scala
Chapter 5: Learning from Data
Introduction
Supervised and unsupervised learning
Gradient descent
Predicting continuous values using linear regression
Binary classification using LogisticRegression and SVM
Binary classification using LogisticRegression with Pipeline API
Clustering using K-means
Feature reduction using principal component analysis
Chapter 6: Scaling Up
Introduction
Building the Uber JAR
Submitting jobs to the Spark cluster (local)
Running the Spark Standalone cluster on EC2
Running the Spark Job on Mesos (local)
Running the Spark Job on YARN (local)
Chapter 7 : Going Further
Introduction
Using Spark Streaming to subscribe to a Twitter stream
Using Spark as an ETL tool
Using StreamingLogisticRegression to classify a Twitter stream using Kafka as a training stream
Using GraphX to analyze Twitter data
Module 3: Scala for Machine Learning
Chapter 1: Getting Started
Mathematical notation for the curious
Why machine learning?
Why Scala?
Model categorization
Taxonomy of machine learning algorithms
Tools and frameworks
Source code
Let's kick the tires
Summary
Chapter 3: Hello World!
Modeling
Designing a workflow
Assessing a model
Summary
Chapter 3: Data Preprocessing
Time series
Moving averages
Fourier analysis
The Kalman filter
Alternative preprocessing techniques
Summary
Chapter 4: Unsupervised Learning
Clustering
Dimension reduction
Performance considerations
Summary
Chapter 5: Naïve Bayes Classifiers
Probabilistic graphical models
Naïve Bayes classifiers
Multivariate Bernoulli classification
Naïve Bayes and text mining
Pros and cons
Summary
Chapter 6: Regression and Regularization
Linear regression
Regularization
Numerical optimization
The logistic regression
Summary
Chapter 7: Sequential Data Models
Markov decision processes
The hidden Markov model (HMM)
Conditional random fields
CRF and text analytics
Comparing CRF and HMM
Performance consideration
Summary
Chapter 8: Kernel Models and Support Vector Machines
Kernel functions
The support vector machine (SVM)
Support vector classifier (SVC)
Anomaly detection with one-class SVC
Support vector regression (SVR)
Performance considerations
Summary
Chapter 9: Artificial Neural Networks
Feed-forward neural networks (FFNN)
The multilayer perceptron (MLP)
Evaluation
Benefits and limitations
Summary
Chapter 10 : Genetic Algorithms
Evolution
Genetic algorithms and machine learning
Genetic algorithm components
Implementation
GA for trading strategies
Advantages and risks of genetic algorithms
Summary
Chapter 11: Reinforcement Learning
Introduction
Learning classifier systems
Summary
Chapter 12: Scalable Frameworks
Overview
Scala
Scalability with Actors
Akka
Apache Spark
Summary
Appendix A : Basic Concepts
Scala programming
Mathematics
Finances 101
Suggested online courses
References
Bibliography

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Scala: Guide for Data Science Professionals

Description

More details

Content

System requirements