Data Engineering Best Practices

Name: Data Engineering Best Practices | Architect robust and cost-effective data solutions in the cloud era
Brand: Packt Publishing Limited
Availability: OnlineOnly

Architect robust and cost-effective data solutions in the cloud era

Richard J. Schiller David LaRochelle(Author)

Packt Publishing Limited

1st Edition

Published on 18. March 2025

550 pages

E-Book

ePUB with Adobe-DRM

System requirements

E-Book

ePUB without DRM

System requirements

978-1-80324-736-6 (ISBN)

from €29.99

Available for download

Watchlist: see prices

Description

All prices

More details

Other editions

Persons

Content

Cover
Title Page
Copyright and Credits
Contributors
Table of Contents
Preface
Chapter 1: Overview of the Business Problem Statement
What is the business problem statement?
Anti-patterns to avoid
Patterns in the future-proof architecture
Future-proofing is .
Organization into zone considerations
Cloud limitations
The Intelligence Age
Use case definitions
The mission, the vision, and the strategy
Principles and the development life cycle
The architecture definition, best practices, and key considerations
The DataOps convergence
Summary
Chapter 2: A Data Engineer's Journey - Background Challenges
Challenge #1 - platform architectures change rapidly
Platform architectures in the 21st century
Impacts on business strategy
A flexible software development life cycle to manage platform risk
Challenge #2 - Total cost of ownership (TCO) is high
ETL architecture costs are high!
Buy versus build choices impact a solution's longevity
Challenge #3 - Evolving data repository patterns - identifying big rocks for data engineers
Intake, integration, and storage challenges in data engineering
Identifying the big rocks to be placed first into your design
Being able to handle technology hype
Summary
Chapter 3: A Data Engineer's Journey - IT's Vision and Mission
The vision
Develop the IT engineering vision
Vision summary
The mission and the IT strategy
IT's vision
IT's mission
IT mission summary
Principles, frameworks, and best practices
The architecture reflects the vision
Principles summary
Data engineering patterns for IT operability
What patterns are required and how are they specified?
Pattern summary
Summary
Chapter 4: Architecture Principles
Architecture principles overview
Architecture foundation
Data lake, mesh, and fabric
Data immutability
Third party tool, cloud platform-as-a-service (PaaS), and framework integrations
Data mesh principles
Data mesh metadata
Data semantics in the data mesh
Data mesh, security, and tech stack considerations
What are the key foundational takeaways?
Architecture principles in depth
Principle #1 - Data lake as a centerpiece? No, implement the data journey!
Principle #2 - A data lake's immutable data is to remain explorable
Principle #3 - A data lake's immutable data remains available for analytics
Principle #4 - A data lake's sources are discoverable
Principle #5 - A data lake's tooling should be consistent with the architecture
Principle #6 - A data mesh defines data to be governed by domain-driven ownership
Principle #7 - A data mesh defines the data and derives insights as a product
Principle #8 - A data mesh defines data, information, and insights to be self-service
Principle #9 - A data mesh implements a federated governance processing system
Principle #10 - Metadata is associated with datasets and is relevant to the business
Principle #11 - Dataset lineage and at-rest metadata is subject to life cycle governance
Principle #12 - Datasets and metadata require cataloging and discovery services
Principle #13 - Semantic metadata guarantees correct business understanding at all stages in the data journey
Principle #14 - Data big rock architecture choices (time series, correction processing, security, privacy, and so on) are to be handled in the design early
Principle #15 - Implement foundational capabilities in the architecture framework first
Summary
Chapter 5: Architecture Framework - Conceptual Architecture Best Practices
Conceptual architecture overview
Best practice organization
How does the conceptual architecture align with the logical architecture and physical architecture?
Conceptual architecture best practices
Conceptual architecture description
Conceptual architecture glossary
What are the data architecture's key issues identified in the conceptual architecture?
Best practice composition of the conceptual architecture
Conceptual to logical architecture mapping
Summary
Chapter 6: Architecture Framework - Logical Architecture Best Practices
Logical architecture overview
Organizing best practices
How does the logical architecture align with the conceptual and physical architecture?
Detailed capabilities of the ingestion zones
ETL data pipelines
Bronze standard datasets
Detailed capabilities of the transformation zones
Data quality features
Data lake house and warehouse
Gold and silver standard datasets
Detailed capabilities of the consumption zones
Data analytics
Accessing silver standard datasets from the consumption zone
Trade-offs between public cloud, on-premises, and multi-cloud
Cost of ingest or egress for cloud data
Cost of a dedicated network line to the point of service
Cost of provisioning
Cost of monitoring and observability
Hybrid or multi-cloud choices!
The benefits of a multi-cloud strategy
Summary
Chapter 7: Architecture Framework - Physical Architecture Best Practices
Physical architecture overview
Best practice organization
How does the physical architecture align with the logical and conceptual architecture?
How should the physical architecture align with the operational processes/capabilities of the solution?
Examples of physical reference architectures
Summary
Chapter 8: Software Engineering Best Practice Considerations
SBP 1 - follow the architecture!
The core value of architectural integrity
The downstream impact of deviating
Ensuring adherence in your data engineering team
Continuous evolution and architecture
Conclusion
SBP 2 - implement Agile methodology for your organization!
Introduction to Agile methodology
Agile principles and their significance in data engineering
Benefits of implementing Agile in data engineering
Challenges and considerations in Agile data engineering
Steps to implement Agile in data engineering
Tools and Agile practices tailored for data engineering
Conclusion
SBP 3 - generate objectives and key results (OKRs)!
Introduction and deep dive into OKRs
Crafting data-centric OKRs
Potential challenges with OKRs in data engineering
Reviewing and iterating on OKRs in a data context
SBP 4 - implement data as a product!
SBP 5 - implement shift left testing (SLT) processes!
Understanding SLT
Benefits of SLT in data engineering
Implementing shift left testing
Specific shift left testing strategies for data engineering
Challenges in shift left testing for data engineering
Tools and technologies to facilitate shift left in data engineering
Synergy with other data best practices
SBP 6 - implement the difficult first!
The philosophy of tackling the hard tasks first
How data engineers can prioritize difficult tasks
Implementing difficult data tasks
Synergy with other data best practices
Conclusion
SBP 7 - avoid premature optimization
The true cost of premature optimization
Recognizing and avoiding the trap in data engineering
Balancing performance needs and over-optimization in data engineering
Synergy with other data best practices
SBP 8 - automate cloud code snippet deployments with standard deployment scripted wrappers
The importance of deployment automation
The deployment model choices
Benefits of using scripted deployment wrappers
Version control - ensuring consistency and traceability
Relevance to data engineering in cloud environments
Practical implementation steps
Challenges and precautions
Synergy with other software and data best practices
SBP 9 - define and implement NFRs first
Distinguishing functional (FRs) from non-functional requirements (NFRs)
Relevance to data engineering
Key NFRs in cloud data engineering
Defining and implementing NFRs
Risks of neglecting early implementation of NFRs
SBP 10 - implement data journey journaling to facilitate future problem resolution
Relevance to data engineering
Challenges and considerations
SBP 11 - implement data journey pipelines that are experimental first!
Enabling data pipeline experimentation as datasets are readied
Releasing data like code
Challenges and considerations
SBP 12 - choose languages with solid reasoning
Key languages in data engineering and their roles
The pressures and limitations imposed by PaaS offerings
Pitfalls to avoid
SBP 13 - drive scripting and PaaS code with parameterization using a secure configuration management repository tool
The power of parameterization and configuration management
The growth of configuration complexity
Why parameterize?
Configuration management repositories and configuration management databases (CMDBs)
Best practices for secure configuration management
SBP 14 - be prepared to prune dead code over time
The accumulation of dead code in software and PaaS systems
The unique challenge of PaaS service configurations
Pruning dead code
SBP 15 - if it doesn't fit, don't force it
use a microservice
PaaS and its boundaries
Microservices as a contingency strategy
Challenges and considerations of this dual approach
Pitfalls to avoid
Summary
Chapter 9: Key Considerations for Agile SDLC Best Practices
Prevent Agile from being fragile
Agile methodology
Core principles of Agile from the Agile Manifesto
Agile and data engineering
Why Agile?
The impact of a unified Agile approach on team performance and system quality
Software Development Lifecycle (SDLC) processes
Objectives and Key Results (OKRs)
Agile methodology (tuned to the organization)
Business development strategy
Test/quality strategy
Operational strategy
Data security strategy
Summary
Chapter 10: Key Considerations for Quality Testing Best Practices
Quality testing overview
How not to test!
Evolving the test discipline
Test terminology
Key test definitions
Test driven development (TDD)
Acceptance test driven development (ATDD) versus developer test driven development (DTDD)
Behavioral driven development (BDD)
Shift left testing
Test framework example
Data wrangling and profiling
Deterministic data profiling
Machine learning driven data QA
Summary
Chapter 11: Key Considerations for IT Operational Service Best Practices
IT operational best practices overview
IT operational best practices - introduction
Service Level Agreements (SLAs)
Data contract service level agreements/data contract management
Continuous integration/continuous deployment (CI/CD)
Observability with proactive alerting
Automated data and system anomaly detection and remediation
Data system anomaly detection
Application Performance Monitoring
Blue/green versus continuous deployment trade-offs
Key takeaways
Summary
Chapter 12: Key Considerations for Data Service Best Practices
Data service best practices overview
Software and data engineering drivers for best practices
Zero trust versus defense in depth
National data localization
Privacy protection
Personally Identifiable Information (PII)
Data service engineering best practices
Data engineering best practice 1 - implement a data mesh, not just a data fabric
Data engineering best practice 2 - implement data pipelines (for analytics)
Data engineering best practice 3 - implement data pipelines (for machine learning)
Data engineering best practice 4 - use equivalent production and staging environments
Data engineering best practice 5 - a pipeline's concurrent threads should run and scale in a distributed manner
Data engineering best practice 6 - data pipelines should run as streams making use of PaaS services where possible
Data engineering best practice 7 - create DataOps standards for data pipeline development
Data engineering best practice 8 - implement tool selection criteria with weighted selection toward built-in integration with core components of the architecture
Data engineering best practice 9 - implement Pub/Sub models for economy of scale when supporting customers with the same dataset subscription
Data engineering best practice 10 - data wrangling tool selection to create a clean gold zone copy of datasets
Data engineering best practice 11 - a data catalog is part of an essential metadata implementation
Data engineering best practice 12 - define data owners, security, rights, and access for consumers upfront
Data engineering best practice 13 - train your experts and let them train and retrain others
Data engineering best practice 14 - handle errors gracefully
Data engineering best practice 15 - run multiple data pipeline tasks as a directed acyclic graph (DAG)
Summary
Chapter 13: Key Considerations for Management Best Practices
Niche focus areas - best practices overview
Data profiling
Heuristic data analysis
Gaps
Calendar trend deviations
Metadata strategy
Azure cloud provider perspective on metadata
A metadata tool provider's perspective on metadata
A knowledge engineering provider's perspective on metadata
Metadata perspectives summary
Summary
Chapter 14: Key Considerations for Data Delivery Best Practices
Data delivery best practices overview
Streaming data delivery
Data delivery with publishing/subscribing (pub/sub) methods
Data delivery streaming with examples
Consumable data delivery as a repository
Custom-built analytics sandbox with Confidential Compute
Using a third-party aggregator for analytics
Data delivery into cloud service provided areas
Using cloud provider's offerings
Bulk data delivery
Using various cloud provider offerings
Summary
Chapter 15: Other Considerations - Measures, Calculations, Restatements, and Data Science Best Practices
Overview of other data engineering problems for consideration
Data engineer statement of value
Modern data science/analysis workbench
What are the capabilities of a modern data science/analyst's workbench?
Using notebooks in production
Calculations and measures
Difficult analytics features
Key capabilities in the analytics workbench
Historical data correction processing
Notebooks
Notebook technologies
Summary
Chapter 16: Machine Learning Pipeline Best Practices and Processes
Machine learning (ML)/artificial intelligence (AI) overview
The current and future state of AI
Machine learning pipeline
Model governance and compliance
Why are models rarely deployed?
What is the MLOps model life cycle?
Technology necessities
Data annotation
Sampling the data lake
Managing data annotation
Real data versus created (synthetic) data
Model training
Model testing and evaluation
Smoke tests
ML asset deployment
A/B testing
Handling regressions
MLOps frameworks
Summary
Chapter 17: Takeaway Summary - Putting It All Together
People, technology, and processes
Working with other people
Develop your solutions' technology with clear processes
Focus on the process of information value creation
Other important challenges
Business realities
Security and privacy
Knowledge engineering
Artificial intelligence
Summary
Chapter 18: Appendix and Use Cases
Use cases overview
Technology background deep-dive
Prompt engineering frameworks with examples
Learn about new knowledge base generation tools with examples
High level use cases
Health Sciences Knowledge Graph (KG)
Life Sciences Semantic Information Engine (ELSSIE)
Microblog message analysis (Dataminr)
Summary
Index
About Packt
Other Books You May Enjoy

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Data Engineering Best Practices

Description

All prices

More details

Other editions

Additional editions

Persons

Content

System requirements