
Data Engineering Best Practices
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
All prices
More details
Other editions
Additional editions

Persons
Richard J. Schiller is a chief architect, distinguished engineer, and startup entrepreneur with 40 years of experience delivering real-time large-scale data processing systems. He holds an MS in computer engineering from Columbia University's School of Engineering and Applied Science and a BA in computer science and applied mathematics. He has been involved with two prior successful startups and has co authored three patents. He is a hands-on systems developer and innovator.Larochelle David :
David Larochelle has been involved in data engineering for startups, Fortune 500 companies, and research institutes. He holds a BS in computer science from the College of William & Mary, a Masters in computer science from the University of Virginia, and a Master's in communication from the University of Pennsylvania. David's career spans over 20 years, and his strong background has enabled him to work in a wide range of organizations, including startups, established companies, and research labs.
Content
- Cover
- Title Page
- Copyright and Credits
- Contributors
- Table of Contents
- Preface
- Chapter 1: Overview of the Business Problem Statement
- What is the business problem statement?
- Anti-patterns to avoid
- Patterns in the future-proof architecture
- Future-proofing is .
- Organization into zone considerations
- Cloud limitations
- The Intelligence Age
- Use case definitions
- The mission, the vision, and the strategy
- Principles and the development life cycle
- The architecture definition, best practices, and key considerations
- The DataOps convergence
- Summary
- Chapter 2: A Data Engineer's Journey - Background Challenges
- Challenge #1 - platform architectures change rapidly
- Platform architectures in the 21st century
- Impacts on business strategy
- A flexible software development life cycle to manage platform risk
- Challenge #2 - Total cost of ownership (TCO) is high
- ETL architecture costs are high!
- Buy versus build choices impact a solution's longevity
- Challenge #3 - Evolving data repository patterns - identifying big rocks for data engineers
- Intake, integration, and storage challenges in data engineering
- Identifying the big rocks to be placed first into your design
- Being able to handle technology hype
- Summary
- Chapter 3: A Data Engineer's Journey - IT's Vision and Mission
- The vision
- Develop the IT engineering vision
- Vision summary
- The mission and the IT strategy
- IT's vision
- IT's mission
- IT mission summary
- Principles, frameworks, and best practices
- The architecture reflects the vision
- Principles summary
- Data engineering patterns for IT operability
- What patterns are required and how are they specified?
- Pattern summary
- Summary
- Chapter 4: Architecture Principles
- Architecture principles overview
- Architecture foundation
- Data lake, mesh, and fabric
- Data immutability
- Third party tool, cloud platform-as-a-service (PaaS), and framework integrations
- Data mesh principles
- Data mesh metadata
- Data semantics in the data mesh
- Data mesh, security, and tech stack considerations
- What are the key foundational takeaways?
- Architecture principles in depth
- Principle #1 - Data lake as a centerpiece? No, implement the data journey!
- Principle #2 - A data lake's immutable data is to remain explorable
- Principle #3 - A data lake's immutable data remains available for analytics
- Principle #4 - A data lake's sources are discoverable
- Principle #5 - A data lake's tooling should be consistent with the architecture
- Principle #6 - A data mesh defines data to be governed by domain-driven ownership
- Principle #7 - A data mesh defines the data and derives insights as a product
- Principle #8 - A data mesh defines data, information, and insights to be self-service
- Principle #9 - A data mesh implements a federated governance processing system
- Principle #10 - Metadata is associated with datasets and is relevant to the business
- Principle #11 - Dataset lineage and at-rest metadata is subject to life cycle governance
- Principle #12 - Datasets and metadata require cataloging and discovery services
- Principle #13 - Semantic metadata guarantees correct business understanding at all stages in the data journey
- Principle #14 - Data big rock architecture choices (time series, correction processing, security, privacy, and so on) are to be handled in the design early
- Principle #15 - Implement foundational capabilities in the architecture framework first
- Summary
- Chapter 5: Architecture Framework - Conceptual Architecture Best Practices
- Conceptual architecture overview
- Best practice organization
- How does the conceptual architecture align with the logical architecture and physical architecture?
- Conceptual architecture best practices
- Conceptual architecture description
- Conceptual architecture glossary
- What are the data architecture's key issues identified in the conceptual architecture?
- Best practice composition of the conceptual architecture
- Conceptual to logical architecture mapping
- Summary
- Chapter 6: Architecture Framework - Logical Architecture Best Practices
- Logical architecture overview
- Organizing best practices
- How does the logical architecture align with the conceptual and physical architecture?
- Detailed capabilities of the ingestion zones
- ETL data pipelines
- Bronze standard datasets
- Detailed capabilities of the transformation zones
- Data quality features
- Data lake house and warehouse
- Gold and silver standard datasets
- Detailed capabilities of the consumption zones
- Data analytics
- Accessing silver standard datasets from the consumption zone
- Trade-offs between public cloud, on-premises, and multi-cloud
- Cost of ingest or egress for cloud data
- Cost of a dedicated network line to the point of service
- Cost of provisioning
- Cost of monitoring and observability
- Hybrid or multi-cloud choices!
- The benefits of a multi-cloud strategy
- Summary
- Chapter 7: Architecture Framework - Physical Architecture Best Practices
- Physical architecture overview
- Best practice organization
- How does the physical architecture align with the logical and conceptual architecture?
- How should the physical architecture align with the operational processes/capabilities of the solution?
- Examples of physical reference architectures
- Summary
- Chapter 8: Software Engineering Best Practice Considerations
- SBP 1 - follow the architecture!
- The core value of architectural integrity
- The downstream impact of deviating
- Ensuring adherence in your data engineering team
- Continuous evolution and architecture
- Conclusion
- SBP 2 - implement Agile methodology for your organization!
- Introduction to Agile methodology
- Agile principles and their significance in data engineering
- Benefits of implementing Agile in data engineering
- Challenges and considerations in Agile data engineering
- Steps to implement Agile in data engineering
- Tools and Agile practices tailored for data engineering
- Conclusion
- SBP 3 - generate objectives and key results (OKRs)!
- Introduction and deep dive into OKRs
- Crafting data-centric OKRs
- Potential challenges with OKRs in data engineering
- Reviewing and iterating on OKRs in a data context
- SBP 4 - implement data as a product!
- SBP 5 - implement shift left testing (SLT) processes!
- Understanding SLT
- Benefits of SLT in data engineering
- Implementing shift left testing
- Specific shift left testing strategies for data engineering
- Challenges in shift left testing for data engineering
- Tools and technologies to facilitate shift left in data engineering
- Synergy with other data best practices
- SBP 6 - implement the difficult first!
- The philosophy of tackling the hard tasks first
- How data engineers can prioritize difficult tasks
- Implementing difficult data tasks
- Synergy with other data best practices
- Conclusion
- SBP 7 - avoid premature optimization
- The true cost of premature optimization
- Recognizing and avoiding the trap in data engineering
- Balancing performance needs and over-optimization in data engineering
- Synergy with other data best practices
- SBP 8 - automate cloud code snippet deployments with standard deployment scripted wrappers
- The importance of deployment automation
- The deployment model choices
- Benefits of using scripted deployment wrappers
- Version control - ensuring consistency and traceability
- Relevance to data engineering in cloud environments
- Practical implementation steps
- Challenges and precautions
- Synergy with other software and data best practices
- SBP 9 - define and implement NFRs first
- Distinguishing functional (FRs) from non-functional requirements (NFRs)
- Relevance to data engineering
- Key NFRs in cloud data engineering
- Defining and implementing NFRs
- Risks of neglecting early implementation of NFRs
- SBP 10 - implement data journey journaling to facilitate future problem resolution
- Relevance to data engineering
- Challenges and considerations
- SBP 11 - implement data journey pipelines that are experimental first!
- Enabling data pipeline experimentation as datasets are readied
- Releasing data like code
- Challenges and considerations
- SBP 12 - choose languages with solid reasoning
- Key languages in data engineering and their roles
- The pressures and limitations imposed by PaaS offerings
- Pitfalls to avoid
- SBP 13 - drive scripting and PaaS code with parameterization using a secure configuration management repository tool
- The power of parameterization and configuration management
- The growth of configuration complexity
- Why parameterize?
- Configuration management repositories and configuration management databases (CMDBs)
- Best practices for secure configuration management
- SBP 14 - be prepared to prune dead code over time
- The accumulation of dead code in software and PaaS systems
- The unique challenge of PaaS service configurations
- Pruning dead code
- SBP 15 - if it doesn't fit, don't force it
- use a microservice
- PaaS and its boundaries
- Microservices as a contingency strategy
- Challenges and considerations of this dual approach
- Pitfalls to avoid
- Summary
- Chapter 9: Key Considerations for Agile SDLC Best Practices
- Prevent Agile from being fragile
- Agile methodology
- Core principles of Agile from the Agile Manifesto
- Agile and data engineering
- Why Agile?
- The impact of a unified Agile approach on team performance and system quality
- Software Development Lifecycle (SDLC) processes
- Objectives and Key Results (OKRs)
- Agile methodology (tuned to the organization)
- Business development strategy
- Test/quality strategy
- Operational strategy
- Data security strategy
- Summary
- Chapter 10: Key Considerations for Quality Testing Best Practices
- Quality testing overview
- How not to test!
- Evolving the test discipline
- Test terminology
- Key test definitions
- Test driven development (TDD)
- Acceptance test driven development (ATDD) versus developer test driven development (DTDD)
- Behavioral driven development (BDD)
- Shift left testing
- Test framework example
- Data wrangling and profiling
- Deterministic data profiling
- Machine learning driven data QA
- Summary
- Chapter 11: Key Considerations for IT Operational Service Best Practices
- IT operational best practices overview
- IT operational best practices - introduction
- Service Level Agreements (SLAs)
- Data contract service level agreements/data contract management
- Continuous integration/continuous deployment (CI/CD)
- Observability with proactive alerting
- Automated data and system anomaly detection and remediation
- Data system anomaly detection
- Application Performance Monitoring
- Blue/green versus continuous deployment trade-offs
- Key takeaways
- Summary
- Chapter 12: Key Considerations for Data Service Best Practices
- Data service best practices overview
- Software and data engineering drivers for best practices
- Zero trust versus defense in depth
- National data localization
- Privacy protection
- Personally Identifiable Information (PII)
- Data service engineering best practices
- Data engineering best practice 1 - implement a data mesh, not just a data fabric
- Data engineering best practice 2 - implement data pipelines (for analytics)
- Data engineering best practice 3 - implement data pipelines (for machine learning)
- Data engineering best practice 4 - use equivalent production and staging environments
- Data engineering best practice 5 - a pipeline's concurrent threads should run and scale in a distributed manner
- Data engineering best practice 6 - data pipelines should run as streams making use of PaaS services where possible
- Data engineering best practice 7 - create DataOps standards for data pipeline development
- Data engineering best practice 8 - implement tool selection criteria with weighted selection toward built-in integration with core components of the architecture
- Data engineering best practice 9 - implement Pub/Sub models for economy of scale when supporting customers with the same dataset subscription
- Data engineering best practice 10 - data wrangling tool selection to create a clean gold zone copy of datasets
- Data engineering best practice 11 - a data catalog is part of an essential metadata implementation
- Data engineering best practice 12 - define data owners, security, rights, and access for consumers upfront
- Data engineering best practice 13 - train your experts and let them train and retrain others
- Data engineering best practice 14 - handle errors gracefully
- Data engineering best practice 15 - run multiple data pipeline tasks as a directed acyclic graph (DAG)
- Summary
- Chapter 13: Key Considerations for Management Best Practices
- Niche focus areas - best practices overview
- Data profiling
- Heuristic data analysis
- Gaps
- Calendar trend deviations
- Metadata strategy
- Azure cloud provider perspective on metadata
- A metadata tool provider's perspective on metadata
- A knowledge engineering provider's perspective on metadata
- Metadata perspectives summary
- Summary
- Chapter 14: Key Considerations for Data Delivery Best Practices
- Data delivery best practices overview
- Streaming data delivery
- Data delivery with publishing/subscribing (pub/sub) methods
- Data delivery streaming with examples
- Consumable data delivery as a repository
- Custom-built analytics sandbox with Confidential Compute
- Using a third-party aggregator for analytics
- Data delivery into cloud service provided areas
- Using cloud provider's offerings
- Bulk data delivery
- Using various cloud provider offerings
- Summary
- Chapter 15: Other Considerations - Measures, Calculations, Restatements, and Data Science Best Practices
- Overview of other data engineering problems for consideration
- Data engineer statement of value
- Modern data science/analysis workbench
- What are the capabilities of a modern data science/analyst's workbench?
- Using notebooks in production
- Calculations and measures
- Difficult analytics features
- Key capabilities in the analytics workbench
- Historical data correction processing
- Notebooks
- Notebook technologies
- Summary
- Chapter 16: Machine Learning Pipeline Best Practices and Processes
- Machine learning (ML)/artificial intelligence (AI) overview
- The current and future state of AI
- Machine learning pipeline
- Model governance and compliance
- Why are models rarely deployed?
- What is the MLOps model life cycle?
- Technology necessities
- Data annotation
- Sampling the data lake
- Managing data annotation
- Real data versus created (synthetic) data
- Model training
- Model testing and evaluation
- Smoke tests
- ML asset deployment
- A/B testing
- Handling regressions
- MLOps frameworks
- Summary
- Chapter 17: Takeaway Summary - Putting It All Together
- People, technology, and processes
- Working with other people
- Develop your solutions' technology with clear processes
- Focus on the process of information value creation
- Other important challenges
- Business realities
- Security and privacy
- Knowledge engineering
- Artificial intelligence
- Summary
- Chapter 18: Appendix and Use Cases
- Use cases overview
- Technology background deep-dive
- Prompt engineering frameworks with examples
- Learn about new knowledge base generation tools with examples
- High level use cases
- Health Sciences Knowledge Graph (KG)
- Life Sciences Semantic Information Engine (ELSSIE)
- Microblog message analysis (Dataminr)
- Summary
- Index
- About Packt
- Other Books You May Enjoy
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.
File format: ePUB
Copy protection: without DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Use a reader that can handle the file format ePUB, such as Adobe Digital Editions or FBReader – both free (see eBook Help).
- Tablet/Smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePUB works well for novels and non-fiction books – i.e., 'flowing' text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook does not use copy protection or Digital Rights Management
For more information, see our eBook Help page.