Apache Mahout Essentials

Packt Publishing Limited
  • 1. Auflage
  • |
  • erschienen am 19. Juni 2015
  • |
  • 164 Seiten
E-Book | ePUB mit Adobe DRM | Systemvoraussetzungen
978-1-78355-500-0 (ISBN)
Apache Mahout is a scalable machine learning library with algorithms for clustering, classification, and recommendations. It empowers users to analyze patterns in large, diverse, and complex datasets faster and more scalably.This book is an all-inclusive guide to analyzing large and complex datasets using Apache Mahout. It explains complicated but very effective machine learning algorithms simply, in relation to real-world practical examples.Starting from the fundamental concepts of machine learning and Apache Mahout, this book guides you through Apache Mahout's implementations of machine learning techniques including classification, clustering, and recommendations. During this exciting walkthrough, real-world applications, a diverse range of popular algorithms and their implementations, code examples, evaluation strategies, and best practices are given for each technique. Finally, you will learn vdata visualization techniques for Apache Mahout to bring your data to life.
  • Englisch
  • Birmingham
  • |
  • Großbritannien
978-1-78355-500-0 (9781783555000)
1783555009 (1783555009)
weitere Ausgaben werden ermittelt
Jayani Withanawasam is R&D engineer and a senior software engineer at Zaizi Asia, where she focuses on applying machine learning techniques to provide smart content management solutions.
She is currently pursuing an MSc degree in artificial intelligence at the University of Moratuwa, Sri Lanka, and has completed her BE in software engineering (with first class honors) from the University of Westminster, UK.
She has more than 6 years of industry experience, and she has worked in areas such as machine learning, natural language processing, and semantic web technologies during her tenure.
She is passionate about working with semantic technologies and big data.
  • Cover
  • Copyright
  • Credits
  • About the Author
  • About the Reviewers
  • www.PacktPub.com
  • Table of Contents
  • Preface
  • Chapter 1: Introducing Apache Mahout
  • Machine learning in a nutshell
  • Features
  • Supervised learning versus unsupervised learning
  • Machine learning applications
  • Information retrieval
  • Business
  • Market segmentation (clustering)
  • Stock market predictions (regression)
  • Health care
  • Using a mammogram for cancer tissue detection
  • Machine learning libraries
  • Open source or commercial
  • Scalability
  • Languages used
  • Algorithm support
  • Batch processing versus stream processing
  • The story so far
  • Apache Mahout
  • Setting up Apache Mahout
  • How Apache Mahout works?
  • High-level design
  • Distribution
  • From Hadoop MapReduce to Spark
  • Problems with Hadoop MapReduce
  • In-memory data processing with Spark and H2O
  • Why is Mahout shifting from Hadoop MapReduce to Spark?
  • When is it appropriate to use Apache Mahout?
  • Summary
  • Chapter 2: Clustering
  • Unsupervised learning and clustering
  • Applications of clustering
  • Computer vision and image processing
  • Types of clustering
  • Hard clustering versus soft clustering
  • Flat clustering versus hierarchical clustering
  • Model-based clustering
  • K-Means clustering
  • Getting your hands dirty!
  • Running K-Means using Java programming
  • Data preparation
  • Understanding important parameters
  • Cluster visualization
  • Distance measure
  • Writing a custom distance measure
  • K-Means clustering with MapReduce
  • MapReduce in Apache Mahout
  • The map function
  • The reduce function
  • Additional clustering algorithms
  • Canopy clustering
  • Fuzzy K-Means
  • Streaming K-Means
  • The streaming step
  • The ball K-Means step
  • Spectral clustering
  • Dirichlet clustering
  • Text clustering
  • The vector space model and TF-IDF
  • N-grams and collocations
  • Preprocessing text with Lucene
  • Text clustering with the K-Means algorithm
  • Topic modeling
  • Optimizing clustering performance
  • Selecting the right features
  • Selecting the right algorithms
  • Selecting the right distance measure
  • Evaluating clusters
  • The initialization of centroids and the number of clusters
  • Tuning up parameters
  • Decision on infrastructure
  • Summary
  • Chapter 3: Regression and Classification
  • Supervised learning
  • Target variables and predictor variables
  • Predictive analytics techniques
  • Regression-based prediction
  • Model-based prediction
  • Tree-based prediction
  • Classification versus regression
  • Linear regression with Apache Spark
  • How does linear regression work?
  • A real-world example
  • The impact of smoking on mortality and different diseases
  • Linear regression with one variable and multiple variables
  • The integration of Apache Spark
  • Setting up Apache Spark with Apache Mahout
  • An example script
  • Distributed row matrix
  • An explanation of the code
  • Mahout references
  • The bias-variance trade-off
  • How to avoid over-fitting and under-fitting
  • Logistic regression with SGD
  • Logistic functions
  • Minimizing the cost function
  • Multinomial logistic regression versus binary logistic regression
  • A real-world example
  • An example script
  • Testing and evaluation
  • The confusion matrix
  • The area under the curve
  • The Naïve Bayes algorithm
  • The Bayes theorem
  • Text classification
  • Naive assumption and its pros and cons in text classification
  • Improvements that Apache Mahout has made to the Naïve Bayes classification
  • A text classification coding example using the 20 newsgroups, example
  • Understand the 20 newsgroups, dataset
  • Text classification using Naïve Bayes - A MapReduce implementation with Hadoop
  • Text classification using Naïve Bayes - the Spark implementation
  • The Markov chain
  • Hidden Markov Model
  • A real world example - Developing a POS tagger using HMM supervised learning
  • POS tagging
  • HMM for POS tagging
  • HMM implementation in Apache Mahout
  • HMM supervised learning
  • The important parameters
  • Returns
  • The Baum Welch Algorithm
  • A codeode example
  • The important parameters
  • The Viterbi evaluator
  • The Apache Mahout references
  • Summary
  • Chapter 4: Recommendations
  • Collaborative versus content-based filtering
  • Content-based filtering
  • Collaborative filtering
  • Hybrid filtering
  • User-based recommenders
  • A real-world example - movie recommendations
  • Data models
  • The similarity measure
  • The neighborhood
  • Recommenders
  • Evaluation techniques
  • The IR-based method (precision/recall)
  • Addressing the issues with inaccurate recommendation results
  • Item-based recommenders
  • Item-based recommenders with Spark
  • Matrix factorization-based recommenders
  • Alternative least squares
  • Singular value decomposition
  • Algorithm usage tips and tricks
  • Summary
  • Chapter 5: Apache Mahout in Production
  • Introduction
  • Apache Mahout with Hadoop
  • YARN with MapReduce 2.0
  • The resource manager
  • The application manager
  • A node manager
  • The application master
  • Containers
  • Managing storage with HDFS
  • The life cycle of a Hadoop application
  • Setting up Hadoop
  • Setting up Mahout in local mode
  • Prerequisites
  • Setting up Mahout in Hadoop distributed mode
  • Prerequisites
  • The pseudo-distributed mode
  • The fully-distributed mode
  • Monitoring Hadoop
  • Commands/scripts
  • Data nodes
  • Node managers
  • Web UIs
  • Setting up Mahout with Hadoop's fully-distributed mode
  • Troubleshooting Hadoop
  • Optimization tips
  • Summary
  • Chapter 6: Visualization
  • The significance of visualization in machine learning
  • D3.js
  • A visualization example for K-Means clustering
  • Summary
  • Index

Dateiformat: EPUB
Kopierschutz: Adobe-DRM (Digital Rights Management)


Computer (Windows; MacOS X; Linux): Installieren Sie bereits vor dem Download die kostenlose Software Adobe Digital Editions (siehe E-Book Hilfe).

Tablet/Smartphone (Android; iOS): Installieren Sie bereits vor dem Download die kostenlose App Adobe Digital Editions (siehe E-Book Hilfe).

E-Book-Reader: Bookeen, Kobo, Pocketbook, Sony, Tolino u.v.a.m. (nicht Kindle)

Das Dateiformat EPUB ist sehr gut für Romane und Sachbücher geeignet - also für "fließenden" Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein "harter" Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.

Weitere Informationen finden Sie in unserer E-Book Hilfe.

Download (sofort verfügbar)

21,81 €
inkl. 19% MwSt.
Download / Einzel-Lizenz
ePUB mit Adobe DRM
siehe Systemvoraussetzungen
E-Book bestellen

Unsere Web-Seiten verwenden Cookies. Mit der Nutzung dieser Web-Seiten erklären Sie sich damit einverstanden. Mehr Informationen finden Sie in unserem Datenschutzhinweis. Ok