
97 Things Every Data Engineer Should Know
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
More details
Other editions
Additional editions

Content
- Intro
- Copyright
- Table of Contents
- Preface
- O'Reilly Online Learning
- How to Contact Us
- Acknowledgments
- Chapter 1. A (Book) Case for Eventual Consistency
- Denise Koessler Gosnell, PhD
- Chapter 2. A/B and How to Be
- Sonia Mehta
- Chapter 3. About the Storage Layer
- Julien Le Dem
- Chapter 4. Analytics as the Secret Glue for Microservice Architectures
- Elias Nema
- Chapter 5. Automate Your Infrastructure
- Christiano Anderson
- Chapter 6. Automate Your Pipeline Tests
- Tom White
- Build an End-to-End Test of the Whole Pipeline
- Use a Small Amount of Representative Data
- Prefer Textual Data Formats over Binary
- Ensure That Tests Can Be Run Locally
- Make Tests Deterministic
- Make It Easy to Add More Tests
- Chapter 7. Be Intentional About the Batching Model in Your Data Pipelines
- Raghotham Murthy
- Data Time Window Batching Model
- Arrival Time Window Batching Model
- ATW and DTW Batching in the Same Pipeline
- Chapter 8. Beware of Silver-Bullet Syndrome
- Thomas Nield
- Chapter 9. Building a Career as a Data Engineer
- Vijay Kiran
- Chapter 10. Business Dashboards for Data Pipelines
- Valliappa (Lak) Lakshmanan
- Chapter 11. Caution: Data Science Projects Can Turn into the Emperor's New Clothes
- Shweta Katre
- Chapter 12. Change Data Capture
- Raghotham Murthy
- Chapter 13. Column Names as Contracts
- Emily Riederer
- Chapter 14. Consensual, Privacy-Aware Data Collection
- Katharine Jarmul
- Attach Consent Metadata
- Track Data Provenance
- Drop or Encrypt Sensitive Fields
- Chapter 15. Cultivate Good Working Relationships with Data Consumers
- Ido Shlomo
- Don't Let Consumers Solve Engineering Problems
- Adapt Your Expectations
- Understand Consumers' Jobs
- Chapter 16. Data Engineering != Spark
- Jesse Anderson
- Batch and Real-Time Systems
- Computation Component
- Storage Component
- NoSQL Databases
- Messaging Component
- Chapter 17. Data Engineering for Autonomy and Rapid Innovation
- Jeff Magnusson
- Implement Reusable Patterns in the ETL Framework
- Choose a Framework and Tool Set Accessible Within the Organization
- Move the Logic to the Edges of the Pipelines
- Create and Support Staging Tables
- Bake Data-Flow Logic into Tooling and Infrastructure
- Chapter 18. Data Engineering from a Data Scientist's Perspective
- Bill Franks
- Database Administration, ETL, and Such
- Why the Need for Data Engineers?
- What's the Future?
- Chapter 19. Data Pipeline Design Patterns for Reusability and Extensibility
- Mukul Sood
- Chapter 20. Data Quality for Data Engineers
- Katharine Jarmul
- Chapter 21. Data Security for Data Engineers
- Katharine Jarmul
- Learn About Security
- Monitor, Log, and Test Access
- Encrypt Data
- Automate Security Tests
- Ask for Help
- Chapter 22. Data Validation Is More Than Summary Statistics
- Emily Riederer
- Chapter 23. Data Warehouses Are the Past, Present, and Future
- James Densmore
- Chapter 24. Defining and Managing Messages in Log-Centric Architectures
- Boris Lublinsky
- Chapter 25. Demystify the Source and Illuminate the Data Pipeline
- Meghan Kwartler
- Chapter 26. Develop Communities, Not Just Code
- Emily Riederer
- Chapter 27. Effective Data Engineering in the Cloud World
- Dipti Borkar
- Disaggregated Data Stack
- Orchestrate, Orchestrate, Orchestrate
- Copying Data Creates Problems
- S3 Compatibility
- SQL and Structured Data Are Still In
- Chapter 28. Embrace the Data Lake Architecture
- Vinoth Chandar
- Common Pitfalls
- Data Lakes
- Advantages
- Implementation
- Chapter 29. Embracing Data Silos
- Bin Fan and Amelia Wong
- Why Data Silos Exist
- Embracing Data Silos
- Chapter 30. Engineering Reproducible Data Science Projects
- Dr. Tianhui Michael Li
- Chapter 31. Five Best Practices for Stable Data Processing
- Christian Lauer
- Prevent Errors
- Set Fair Processing Times
- Use Data-Quality Measurement Jobs
- Ensure Transaction Security
- Consider Dependency on Other Systems
- Conclusion
- Chapter 32. Focus on Maintainability and Break Up Those ETL Tasks
- Chris Moradi
- Chapter 33. Friends Don't Let Friends Do Dual-Writes
- Gunnar Morling
- Chapter 34. Fundamental Knowledge
- Pedro Marcelino
- Chapter 35. Getting the "Structured" Back into SQL
- Elias Nema
- Chapter 36. Give Data Products a Frontend with Latent Documentation
- Emily Riederer
- Chapter 37. How Data Pipelines Evolve
- Chris Heinzmann
- Chapter 38. How to Build Your Data Platform like a Product
- Barr Moses and Atul Gupte
- Align Your Product's Goals with the Goals of the Business
- Gain Feedback and Buy-in from the Right Stakeholders
- Prioritize Long-Term Growth and Sustainability over Short-Term Gains
- Sign Off on Baseline Metrics for Your Data and How You Measure It
- Chapter 39. How to Prevent a Data Mutiny
- Sean Knapp
- Chapter 40. Know the Value per Byte of Your Data
- Dhruba Borthakur
- Chapter 41. Know Your Latencies
- Dhruba Borthakur
- Chapter 42. Learn to Use a NoSQL Database, but Not like an RDBMS
- Kirk Kirkconnell
- Chapter 43. Let the Robots Enforce the Rules
- Anthony Burdi
- Chapter 44. Listen to Your Users-but Not Too Much
- Amanda Tomlinson
- Chapter 45. Low-Cost Sensors and the Quality of Data
- Dr. Shivanand Prabhoolall Guness
- Chapter 46. Maintain Your Mechanical Sympathy
- Tobias Macey
- Chapter 47. Metadata = Data
- Jonathan Seidman
- Chapter 48. Metadata Services as a Core Component of the Data Platform
- Lohit VijayaRenu
- Discoverability
- Security Control
- Schema Management
- Application Interface and Service Guarantee
- Chapter 49. Mind the Gap: Your Data Lake Provides No ACID Guarantees
- Einat Orr
- Chapter 50. Modern Metadata for the Modern Data Stack
- Prukalpa Sankar
- Data Assets & Tables
- Complete Data Visibility, Not Piecemeal Solutions
- Built for Metadata That Itself Is Big Data
- Embedded Collaboration at Its Heart
- Chapter 51. Most Data Problems Are Not Big Data Problems
- Thomas Nield
- Chapter 52. Moving from Software Engineering to Data Engineering
- John Salinas
- Chapter 53. Observability for Data Engineers
- Barr Moses
- How Good Data Turns Bad
- Introducing Data Observability
- Chapter 54. Perfect Is the Enemy of Good
- Bob Haffner
- Chapter 55. Pipe Dreams
- Scott Haines
- Chapter 56. Preventing the Data Lake Abyss
- Scott Haines
- Establishing Data Contracts
- From Generic Data Lake to Data Structure Store
- Chapter 57. Prioritizing User Experience in Messaging Systems
- Jowanza Joseph
- Chapter 58. Privacy Is Your Problem
- Stephen Bailey, PhD
- Chapter 59. QA and All Its Sexiness
- Sonia Mehta
- Chapter 60. Seven Things Data Engineers Need to Watch Out for in ML Projects
- Dr. Sandeep Uttamchandani
- Chapter 61. Six Dimensions for Picking an Analytical Data Warehouse
- Gleb Mezhanskiy
- Scalability
- Price Elasticity
- Interoperability
- Querying and Transformation Features
- Speed
- Zero Maintenance
- Chapter 62. Small Files in a Big Data World
- Adi Polak
- What Are Small Files, and Why Are They a Problem?
- Why Does It Happen?
- Detect and Mitigate
- Conclusion
- References
- Chapter 63. Streaming Is Different from Batch
- Dean Wampler, PhD
- Chapter 64. Tardy Data
- Ariel Shaqed
- Chapter 65. Tech Should Take a Back Seat for Data Project Success
- Andrew Stevenson
- Chapter 66. Ten Must-Ask Questions for Data-Engineering Projects
- Haidar Hadi
- Question 1: What Are the Touch Points?
- Question 2: What Are the Granularities?
- Question 3: What Are the Input and Output Schemas?
- Question 4: What Is the Algorithm?
- Question 5: Do You Need Backfill Data?
- Question 6: When Is the Project Due Date?
- Question 7: Why Was That Due Date Set?
- Question 8: Which Hosting Environment?
- Question 9: What Is the SLA?
- Question 10: Who Will Be Taking Over This Project?
- Chapter 67. The Data Pipeline Is Not About Speed
- Rustem Feyzkhanov
- Chapter 68. The Dos and Don'ts of Data Engineering
- Christopher Bergh
- Don't Be a Hero
- Don't Rely on Hope
- Don't Rely on Caution
- Do DataOps
- Chapter 69. The End of ETL as We Know It
- Paul Singman
- Replacing ETL with Intentional Data Transfer
- Agreeing on a Data Model Contract
- Removing Data Processing Latencies
- Taking the First Steps
- Chapter 70. The Haiku Approach to Writing Software
- Mitch Seymour
- Understand the Constraints Up Front
- Start Strong Since Early Decisions Can Impact the Final Product
- Keep It as Simple as Possible
- Engage the Creative Side of Your Brain
- Chapter 71. The Hidden Cost of Data Input/Output
- Lohit VijayaRenu
- Data Compression
- Data Format
- Data Serialization
- Chapter 72. The Holy War Between Proprietary and Open Source Is a Lie
- Paige Roberts
- Chapter 73. The Implications of the CAP Theorem
- Paul Doran
- Chapter 74. The Importance of Data Lineage
- Julien Le Dem
- Chapter 75. The Many Meanings of Missingness
- Emily Riederer
- Chapter 76. The Six Words That Will Destroy Your Career
- Bartosz Mikulski
- Chapter 77. The Three Invaluable Benefits of Open Source for Testing Data Quality
- Tom Baeyens
- Chapter 78. The Three Rs of Data Engineering
- Tobias Macey
- Reliability
- Reproducibility
- Repeatability
- Conclusion
- Chapter 79. The Two Types of Data Engineering and Data Engineers
- Jesse Anderson
- Types of Data Engineering
- Types of Data Engineers
- Why These Differences Matter to You
- Chapter 80. The Yin and Yang of Big Data Scalability
- Paul Brebner
- Chapter 81. Threading and Concurrency in Data Processing
- Matthew Housley, PhD
- Operating System Threading
- Threading Overhead
- Solving the C10K Problem
- Scaling Is Not a Magic Bullet
- Further Reading
- Chapter 82. Three Important Distributed Programming Concepts
- Adi Polak
- MapReduce Algorithm
- Distributed Shared Memory Model
- Message Passing/Actors Model
- Conclusions
- Chapter 83. Time (Semantics) Won't Wait
- Marta Paes Moreira and Fabian Hueske
- Chapter 84. Tools Don't Matter, Patterns and Practices Do
- Bas Geerdink
- Chapter 85. Total Opportunity Cost of Ownership
- Joe Reis
- Chapter 86. Understanding the Ways Different Data Domains Solve Problems
- Matthew Seal
- Chapter 87. What Is a Data Engineer? Clue: We're Data Science Enablers
- Lewis Gavin
- AI and Machine Learning Models Require Data
- Clean Data == Better Model
- Finally Building a Model
- A Model Is Useful Only If Someone Will Use It
- So What Am I Getting At?
- Chapter 88. What Is a Data Mesh, and How Not to Mesh It Up
- Barr Moses and Lior Gavish
- Why Use a Data Mesh?
- The Final Link: Observability
- Chapter 89. What Is Big Data?
- Ami Levin
- Chapter 90. What to Do When You Don't Get Any Credit
- Jesse Anderson
- Chapter 91. When Our Data Science Team Didn't Produce Value
- Joel Nantais
- Chapter 92. When to Avoid the Naive Approach
- Nimrod Parasol
- Chapter 93. When to Be Cautious About Sharing Data
- Thomas Nield
- Chapter 94. When to Talk and When to Listen
- Steven Finkelstein
- Chapter 95. Why Data Science Teams Need Generalists, Not Specialists
- Eric Colson
- Chapter 96. With Great Data Comes Great Responsibility
- Lohit VijayaRenu
- Put Yourself in the User's Shoes
- Ensure Ethical Use of User Information
- Watch Your Data Footprint
- Chapter 97. Your Data Tests Failed! Now What?
- Sam Bail, PhD
- System Response
- Logging and Alerting
- Alert Response
- Stakeholder Communication
- Root Cause Identification
- Issue Resolution
- Contributors
- Adi Polak
- Amanda Tomlinson
- Amelia Wong
- Ami Levin
- Andrew Stevenson
- Anthony Burdi
- Ariel Shaqed (Scolnicov)
- Atul Gupte
- Barr Moses
- Bartosz Mikulski
- Bas Geerdink
- Bill Franks
- Bin Fan
- Bob Haffner
- Boris Lublinsky
- Chris Moradi
- Christian Heinzmann
- Christian Lauer
- Christiano Anderson
- Christopher Bergh
- Dean Wampler
- Denise Koessler Gosnell, PhD
- Dipti Borkar
- Dhruba Borthakur
- Einat Orr
- Elias Nema
- Emily Riederer
- Eric Colson
- Fabian Hueske
- Gleb Mezhanskiy
- Gunnar Morling
- Haidar Hadi
- Ido Shlomo
- James Densmore
- Jeff Magnusson
- Jesse Anderson
- Joe Reis
- Joel Nantais
- John Salinas
- Jonathan Seidman
- Jowanza Joseph
- Julien Le Dem
- Katharine Jarmul
- Kirk Kirkconnell
- Valliappa (Lak) Lakshmanan
- Lewis Gavin
- Lior Gavish
- Lohit VijayaRenu
- Marta Paes Moreira
- Matthew Housley, PhD
- Matthew Seal
- Meghan Kwartler
- Dr. Tianhui Michael Li
- Mitch Seymour
- Mukul Sood
- Nimrod Parasol
- Paige Roberts
- Paul Brebner
- Paul Doran
- Paul Singman
- Pedro Marcelino
- Dr. Shivanand Prabhoolall Guness
- Prukalpa Sankar
- Raghotham Murthy
- Rustem Feyzkhanov
- Sam Bail
- Sandeep Uttamchandani
- Scott Haines
- Sean Knapp
- Shweta Katre
- Sonia Mehta
- Stephen Bailey, PhD
- Steven Finkelstein
- Thomas Nield
- Tobias Macey
- Tom Baeyens
- Tom White
- Vijay Kiran
- Vinoth Chandar
- Index
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.