
Architecting Modern Data Platforms
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
More details
Other editions
Additional editions

Content
- Intro
- Copyright
- Table of Contents
- Foreword
- Preface
- Some Misconceptions
- Some General Trends
- Horizontal Scaling
- Adoption of Open Source
- Embracing Cloud Compute
- Decoupled Compute and Storage
- What Is This Book About?
- Who Should Read This Book?
- The Road Ahead
- Conventions Used in This Book
- O'Reilly Safari
- How to Contact Us
- Acknowledgments
- Chapter 1. Big Data Technology Primer
- A Tour of the Landscape
- Core Components
- Computational Frameworks
- Analytical SQL Engines
- Storage Engines
- Ingestion
- Orchestration
- Summary
- Part I. Infrastructure
- Chapter 2. Clusters
- Reasons for Multiple Clusters
- Multiple Clusters for Resiliency
- Multiple Clusters for Software Development
- Multiple Clusters for Workload Isolation
- Multiple Clusters for Legal Separation
- Multiple Clusters and Independent Storage and Compute
- Multitenancy
- Requirements for Multitenancy
- Sizing Clusters
- Sizing by Storage
- Sizing by Ingest Rate
- Sizing by Workload
- Cluster Growth
- The Drivers of Cluster Growth
- Implementing Cluster Growth
- Data Replication
- Replication for Software Development
- Replication and Workload Isolation
- Summary
- Chapter 3. Compute and Storage
- Computer Architecture for Hadoop
- Commodity Servers
- Server CPUs and RAM
- Nonuniform Memory Access
- CPU Specifications
- RAM
- Commoditized Storage Meets the Enterprise
- Modularity of Compute and Storage
- Everything Is Java
- Replication or Erasure Coding?
- Alternatives
- Hadoop and the Linux Storage Stack
- User Space
- Important System Calls
- The Linux Page Cache
- Short-Circuit and Zero-Copy Reads
- Filesystems
- Erasure Coding Versus Replication
- Discussion
- Guidance
- Low-Level Storage
- Storage Controllers
- Disk Layer
- Server Form Factors
- Form Factor Comparison
- Guidance
- Workload Profiles
- Cluster Configurations and Node Types
- Master Nodes
- Worker Nodes
- Utility Nodes
- Edge Nodes
- Small Cluster Configurations
- Medium Cluster Configurations
- Large Cluster Configurations
- Summary
- Chapter 4. Networking
- How Services Use a Network
- Remote Procedure Calls (RPCs)
- Data Transfers
- Monitoring
- Backup
- Consensus
- Network Architectures
- Small Cluster Architectures
- Medium Cluster Architectures
- Large Cluster Architectures
- Network Integration
- Reusing an Existing Network
- Creating an Additional Network
- Network Design Considerations
- Layer 1 Recommendations
- Layer 2 Recommendations
- Layer 3 Recommendations
- Summary
- Chapter 5. Organizational Challenges
- Who Runs It?
- Is It Infrastructure, Middleware, or an Application?
- Case Study: A Typical Business Intelligence Project
- The Traditional Approach
- Typical Team Setup
- Compartmentalization of IT
- Revised Team Setup for Hadoop in the Enterprise
- Solution Overview with Hadoop
- New Team Setup
- Split Responsibilities
- Do I Need DevOps?
- Do I Need a Center of Excellence/Competence?
- Summary
- Chapter 6. Datacenter Considerations
- Why Does It Matter ?
- Basic Datacenter Concepts
- Cooling
- Power
- Network
- Rack Awareness and Rack Failures
- Failure Domain Alignment
- Space and Racking Constraints
- Ingest and Intercluster Connectivity
- Software
- Hardware
- Replacements and Repair
- Operational Procedures
- Typical Pitfalls
- Networking
- Cluster Spanning
- Summary
- Part II. Platform
- Chapter 7. Provisioning Clusters
- Operating Systems
- OS Choices
- OS Configuration for Hadoop
- Automated Configuration Example
- Service Databases
- Required Databases
- Database Integration Options
- Database Considerations
- Hadoop Deployment
- Hadoop Distributions
- Installation Choices
- Distribution Architecture
- Installation Process
- Summary
- Chapter 8. Platform Validation
- Testing Methodology
- Useful Tools
- Hardware Validation
- CPU
- Disks
- Network
- Hadoop Validation
- HDFS Validation
- General Validation
- Validating Other Components
- Operations Validation
- Summary
- Chapter 9. Security
- In-Flight Encryption
- TLS Encryption
- SASL Quality of Protection
- Enabling in-Flight Encryption
- Authentication
- Kerberos
- LDAP Authentication
- Delegation Tokens
- Impersonation
- Authorization
- Group Resolution
- Superusers and Supergroups
- Hadoop Service Level Authorization
- Centralized Security Management
- HDFS
- YARN
- ZooKeeper
- Hive
- Impala
- HBase
- Solr
- Kudu
- Oozie
- Hue
- Kafka
- Sentry
- At-Rest Encryption
- Volume Encryption with Cloudera Navigator Encrypt and Key Trustee Server
- HDFS Transparent Data Encryption
- Encrypting Temporary Files
- Summary
- Chapter 10. Integration with Identity Management Providers
- Integration Areas
- Integration Scenarios
- Scenario 1: Writing a File to HDFS
- Scenario 2: Submitting a Hive Query
- Scenario 3: Running a Spark Job
- Integration Providers
- LDAP Integration
- Background
- LDAP Security
- Load Balancing
- Application Integration
- Linux Integration
- Kerberos Integration
- Kerberos Clients
- KDC Integration
- Certificate Management
- Signing Certificates
- Converting Certificates
- Wildcard Certificates
- Automation
- Summary
- Chapter 11. Accessing and Interacting with Clusters
- Access Mechanisms
- Programmatic Access
- Command-Line Access
- Web UIs
- Access Topologies
- Interaction Patterns
- Proxy Access
- Load Balancing
- Edge Node Interactions
- Access Security
- Administration Gateways
- Workbenches
- Hue
- Notebooks
- Landing Zones
- Summary
- Chapter 12. High Availability
- High Availability Defined
- Lateral/Service HA
- Vertical/Systemic HA
- Measuring Availability
- Percentages
- Percentiles
- Operating for HA
- Monitoring
- Playbooks and Postmortems
- HA Building Blocks
- Quorums
- Load Balancing
- Database HA
- Ancillary Services
- General Considerations
- Separation of Master and Worker Processes
- Separation of Identical Service Roles
- Master Servers in Separate Failure Domains
- Balanced Master Configurations
- Optimized Server Configurations
- High Availability of Cluster Services
- ZooKeeper
- HDFS
- YARN
- HBase
- KMS
- Hive
- Impala
- Solr
- Kafka
- Oozie
- Hue
- Other Services
- Autoconfiguration
- Summary
- Chapter 13. Backup and Disaster Recovery
- Context
- Many Distributed Systems
- Policies and Objectives
- Failure Scenarios
- Suitable Data Sources
- Strategies
- Data Types
- Consistency
- Validation
- Summary
- Data Replication
- HBase
- Cluster Management Tools
- Kafka
- Summary
- Hadoop Cluster Backups
- Subsystems
- Case Study: Automating Backups with Oozie
- Restore
- Summary
- Part III. Taking Hadoop to the Cloud
- Chapter 14. Basics of Virtualization for Hadoop
- Compute Virtualization
- Virtual Machine Distribution
- Anti-Affinity Groups
- Storage Virtualization
- Virtualizing Local Storage
- SANs
- Object Storage and Network-Attached Storage
- Network Virtualization
- Cluster Life Cycle Models
- Summary
- Chapter 15. Solutions for Private Clouds
- OpenStack
- Automation and Integration
- Life Cycle and Storage
- Isolation
- Summary
- OpenShift
- Automation
- Life Cycle and Storage
- Isolation
- Summary
- VMware and Pivotal Cloud Foundry
- Do It Yourself?
- Automation
- Isolation
- Life Cycle Model
- Summary
- Object Storage for Private Clouds
- EMC Isilon
- Ceph
- Summary
- Chapter 16. Solutions in the Public Cloud
- Key Things to Know
- Cloud Providers
- AWS
- Microsoft Azure
- Google Cloud Platform
- Implementing Clusters
- Instances
- Storage and Life Cycle Models
- Network Architecture
- High Availability
- Summary
- Chapter 17. Automated Provisioning
- Long-Lived Clusters
- Configuration and Templating
- Deployment Phases
- Vendor Solutions
- One-Click Deployments
- Homegrown Automation
- Hooking Into a Provisioning Life Cycle
- Scaling Up and Down
- Deploying with Security
- Transient Clusters
- Sharing Metadata Services
- Summary
- Chapter 18. Security in the Cloud
- Assessing the Risk
- Risk Model
- Environmental Risks
- Deployment Risks
- Identity Provider Options for Hadoop
- Option A: Cloud-Only Self-Contained ID Services
- Option B: Cloud-Only Shared ID Services
- Option C: On-Premises ID Services
- Object Storage Security and Hadoop
- Identity and Access Management
- Amazon Simple Storage Service
- GCP Cloud Storage
- Microsoft Azure
- Auditing
- Encryption for Data at Rest
- Requirements for Key Material
- Options for Encryption in the Cloud
- On-Premises Key Persistence
- Encryption via the Cloud Provider
- Encryption Feature and Interoperability Summary
- Recommendations and Summary for Cloud Encryption
- Encrypting Data in Flight in the Cloud
- Perimeter Controls and Firewalling
- GCP
- AWS
- Azure
- Summary
- Appendix A. Backup Onboarding Checklist
- Backup Onboarding Checklist
- Backup
- Services
- Cloudera Manager
- HDFS
- HBase
- Hive/Impala
- Sqoop
- Oozie
- Hue
- Sentry
- Index
- About the Authors
- Colophon
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.