Architecting Modern Data Platforms

Name: Architecting Modern Data Platforms | A Guide to Enterprise Hadoop at Scale
Brand: O'Reilly
Price: 50.49 EUR
Availability: OnlineOnly

A Guide to Enterprise Hadoop at Scale

Jan Kunigk(Author)

O'Reilly (Publisher)

Published on 5. December 2018

636 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-4919-6922-9 (ISBN)

€50.49incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

More details

Other editions

Content

Intro
Copyright
Table of Contents
Foreword
Preface
Some Misconceptions
Some General Trends
Horizontal Scaling
Adoption of Open Source
Embracing Cloud Compute
Decoupled Compute and Storage
What Is This Book About?
Who Should Read This Book?
The Road Ahead
Conventions Used in This Book
O'Reilly Safari
How to Contact Us
Acknowledgments
Chapter 1. Big Data Technology Primer
A Tour of the Landscape
Core Components
Computational Frameworks
Analytical SQL Engines
Storage Engines
Ingestion
Orchestration
Summary
Part I. Infrastructure
Chapter 2. Clusters
Reasons for Multiple Clusters
Multiple Clusters for Resiliency
Multiple Clusters for Software Development
Multiple Clusters for Workload Isolation
Multiple Clusters for Legal Separation
Multiple Clusters and Independent Storage and Compute
Multitenancy
Requirements for Multitenancy
Sizing Clusters
Sizing by Storage
Sizing by Ingest Rate
Sizing by Workload
Cluster Growth
The Drivers of Cluster Growth
Implementing Cluster Growth
Data Replication
Replication for Software Development
Replication and Workload Isolation
Summary
Chapter 3. Compute and Storage
Computer Architecture for Hadoop
Commodity Servers
Server CPUs and RAM
Nonuniform Memory Access
CPU Specifications
RAM
Commoditized Storage Meets the Enterprise
Modularity of Compute and Storage
Everything Is Java
Replication or Erasure Coding?
Alternatives
Hadoop and the Linux Storage Stack
User Space
Important System Calls
The Linux Page Cache
Short-Circuit and Zero-Copy Reads
Filesystems
Erasure Coding Versus Replication
Discussion
Guidance
Low-Level Storage
Storage Controllers
Disk Layer
Server Form Factors
Form Factor Comparison
Guidance
Workload Profiles
Cluster Configurations and Node Types
Master Nodes
Worker Nodes
Utility Nodes
Edge Nodes
Small Cluster Configurations
Medium Cluster Configurations
Large Cluster Configurations
Summary
Chapter 4. Networking
How Services Use a Network
Remote Procedure Calls (RPCs)
Data Transfers
Monitoring
Backup
Consensus
Network Architectures
Small Cluster Architectures
Medium Cluster Architectures
Large Cluster Architectures
Network Integration
Reusing an Existing Network
Creating an Additional Network
Network Design Considerations
Layer 1 Recommendations
Layer 2 Recommendations
Layer 3 Recommendations
Summary
Chapter 5. Organizational Challenges
Who Runs It?
Is It Infrastructure, Middleware, or an Application?
Case Study: A Typical Business Intelligence Project
The Traditional Approach
Typical Team Setup
Compartmentalization of IT
Revised Team Setup for Hadoop in the Enterprise
Solution Overview with Hadoop
New Team Setup
Split Responsibilities
Do I Need DevOps?
Do I Need a Center of Excellence/Competence?
Summary
Chapter 6. Datacenter Considerations
Why Does It Matter ?
Basic Datacenter Concepts
Cooling
Power
Network
Rack Awareness and Rack Failures
Failure Domain Alignment
Space and Racking Constraints
Ingest and Intercluster Connectivity
Software
Hardware
Replacements and Repair
Operational Procedures
Typical Pitfalls
Networking
Cluster Spanning
Summary
Part II. Platform
Chapter 7. Provisioning Clusters
Operating Systems
OS Choices
OS Configuration for Hadoop
Automated Configuration Example
Service Databases
Required Databases
Database Integration Options
Database Considerations
Hadoop Deployment
Hadoop Distributions
Installation Choices
Distribution Architecture
Installation Process
Summary
Chapter 8. Platform Validation
Testing Methodology
Useful Tools
Hardware Validation
CPU
Disks
Network
Hadoop Validation
HDFS Validation
General Validation
Validating Other Components
Operations Validation
Summary
Chapter 9. Security
In-Flight Encryption
TLS Encryption
SASL Quality of Protection
Enabling in-Flight Encryption
Authentication
Kerberos
LDAP Authentication
Delegation Tokens
Impersonation
Authorization
Group Resolution
Superusers and Supergroups
Hadoop Service Level Authorization
Centralized Security Management
HDFS
YARN
ZooKeeper
Hive
Impala
HBase
Solr
Kudu
Oozie
Hue
Kafka
Sentry
At-Rest Encryption
Volume Encryption with Cloudera Navigator Encrypt and Key Trustee Server
HDFS Transparent Data Encryption
Encrypting Temporary Files
Summary
Chapter 10. Integration with Identity Management Providers
Integration Areas
Integration Scenarios
Scenario 1: Writing a File to HDFS
Scenario 2: Submitting a Hive Query
Scenario 3: Running a Spark Job
Integration Providers
LDAP Integration
Background
LDAP Security
Load Balancing
Application Integration
Linux Integration
Kerberos Integration
Kerberos Clients
KDC Integration
Certificate Management
Signing Certificates
Converting Certificates
Wildcard Certificates
Automation
Summary
Chapter 11. Accessing and Interacting with Clusters
Access Mechanisms
Programmatic Access
Command-Line Access
Web UIs
Access Topologies
Interaction Patterns
Proxy Access
Load Balancing
Edge Node Interactions
Access Security
Administration Gateways
Workbenches
Hue
Notebooks
Landing Zones
Summary
Chapter 12. High Availability
High Availability Defined
Lateral/Service HA
Vertical/Systemic HA
Measuring Availability
Percentages
Percentiles
Operating for HA
Monitoring
Playbooks and Postmortems
HA Building Blocks
Quorums
Load Balancing
Database HA
Ancillary Services
General Considerations
Separation of Master and Worker Processes
Separation of Identical Service Roles
Master Servers in Separate Failure Domains
Balanced Master Configurations
Optimized Server Configurations
High Availability of Cluster Services
ZooKeeper
HDFS
YARN
HBase
KMS
Hive
Impala
Solr
Kafka
Oozie
Hue
Other Services
Autoconfiguration
Summary
Chapter 13. Backup and Disaster Recovery
Context
Many Distributed Systems
Policies and Objectives
Failure Scenarios
Suitable Data Sources
Strategies
Data Types
Consistency
Validation
Summary
Data Replication
HBase
Cluster Management Tools
Kafka
Summary
Hadoop Cluster Backups
Subsystems
Case Study: Automating Backups with Oozie
Restore
Summary
Part III. Taking Hadoop to the Cloud
Chapter 14. Basics of Virtualization for Hadoop
Compute Virtualization
Virtual Machine Distribution
Anti-Affinity Groups
Storage Virtualization
Virtualizing Local Storage
SANs
Object Storage and Network-Attached Storage
Network Virtualization
Cluster Life Cycle Models
Summary
Chapter 15. Solutions for Private Clouds
OpenStack
Automation and Integration
Life Cycle and Storage
Isolation
Summary
OpenShift
Automation
Life Cycle and Storage
Isolation
Summary
VMware and Pivotal Cloud Foundry
Do It Yourself?
Automation
Isolation
Life Cycle Model
Summary
Object Storage for Private Clouds
EMC Isilon
Ceph
Summary
Chapter 16. Solutions in the Public Cloud
Key Things to Know
Cloud Providers
AWS
Microsoft Azure
Google Cloud Platform
Implementing Clusters
Instances
Storage and Life Cycle Models
Network Architecture
High Availability
Summary
Chapter 17. Automated Provisioning
Long-Lived Clusters
Configuration and Templating
Deployment Phases
Vendor Solutions
One-Click Deployments
Homegrown Automation
Hooking Into a Provisioning Life Cycle
Scaling Up and Down
Deploying with Security
Transient Clusters
Sharing Metadata Services
Summary
Chapter 18. Security in the Cloud
Assessing the Risk
Risk Model
Environmental Risks
Deployment Risks
Identity Provider Options for Hadoop
Option A: Cloud-Only Self-Contained ID Services
Option B: Cloud-Only Shared ID Services
Option C: On-Premises ID Services
Object Storage Security and Hadoop
Identity and Access Management
Amazon Simple Storage Service
GCP Cloud Storage
Microsoft Azure
Auditing
Encryption for Data at Rest
Requirements for Key Material
Options for Encryption in the Cloud
On-Premises Key Persistence
Encryption via the Cloud Provider
Encryption Feature and Interoperability Summary
Recommendations and Summary for Cloud Encryption
Encrypting Data in Flight in the Cloud
Perimeter Controls and Firewalling
GCP
AWS
Azure
Summary
Appendix A. Backup Onboarding Checklist
Backup Onboarding Checklist
Backup
Services
Cloudera Manager
HDFS
HBase
Hive/Impala
Sqoop
Oozie
Hue
Sentry
Index
About the Authors
Colophon

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Architecting Modern Data Platforms

Description

More details

Other editions

Additional editions

Content

System requirements