Schweitzer Fachinformationen
Wenn es um professionelles Wissen geht, ist Schweitzer Fachinformationen wegweisend. Kunden aus Recht und Beratung sowie Unternehmen, öffentliche Verwaltungen und Bibliotheken erhalten komplette Lösungen zum Beschaffen, Verwalten und Nutzen von digitalen und gedruckten Medien.
Discover the power of open source observability for your enterprise environment
In Mastering Observability and OpenTelemetry: Enhancing Application and Infrastructure Performance and Avoiding Outages, accomplished engineering leader and open source contributor Steve Flanders unlocks the secrets of enterprise application observability with a comprehensive guide to OpenTelemetry (OTel). Explore how OTel transforms observability, providing a robust toolkit for capturing and analyzing telemetry data across your environment.
You will learn how OTel delivers unmatched flexibility, extensibility, and vendor neutrality, freeing you from vendor lock-in and enabling data sovereignty and portability. You will also discover:
Whether you are a novice or a seasoned professional, Mastering Observability and OpenTelemetry is your roadmap to troubleshooting availability and performance problems by learning to detect anomalies, interpret data, and proactively optimize performance in your enterprise environment. Embark on your journey to observability mastery today!
STEVE FLANDERS is a Senior Director of Engineering at Splunk, a Cisco company. Steve is one of the founding members of the OpenTelemetry project.
Foreword xiii
Introduction xiv
The Mastering Series xvi
Chapter 1 What Is Observability? 1
Definition 1
Background 4
Cloud Native Era 4
Monitoring Compared to Observability 5
Metadata 8
Dimensionality 9
Cardinality 9
Semantic Conventions 10
Data Sensitivity 10
Signals 10
Metrics 10
Logs 13
Traces 14
Other Signals 20
Collecting Signals 20
Instrumentation 21
Push Versus Pull Collection 22
Data Collection 23
Sampling Signals 26
Observability 27
Platforms 27
Application Performance Monitoring 28
The Bottom Line 28
Notes 30
Chapter 2 Introducing OpenTelemetry! 31
Background 31
Observability Pain Points 31
The Rise of Open Source Software 34
Introducing OpenTelemetry 35
OpenTelemetry Components 37
OpenTelemetry Concepts 48
Roadmap 50
The Bottom Line 50
Notes 51
Chapter 3 Getting Started with the Astronomy Shop 53
Background 53
Architecture 54
Prerequisites 54
Getting Started 55
Accessing the Astronomy Shop 57
Accessing Telemetry Data 57
Beyond the Basics 58
Configuring Load Generation 58
Configuring Feature Flags 59
Configuring Tests Built from Traces 60
Configuring the OTel Collector 60
Configuring OTel Instrumentation 62
Troubleshooting Astronomy Shop 62
Astronomy Shop Scenarios 63
Troubleshooting Errors 63
Troubleshooting Availability 69
Troubleshooting Performance 70
Troubleshooting Telemetry 74
The Bottom Line 75
Notes 76
Chapter 4 Understanding the OpenTelemetry Specification 77
Background 77
API Specification 79
API Definition 80
API Context 80
API Signals 81
API Implementation 82
SDK Specification 82
SDK Definition 83
SDK Signals 83
SDK Implementation 84
Data Specification 84
Data Models 86
Data Protocols 88
Data Semantic Conventions 88
Data Compatibility 89
General Specification 90
The Bottom Line 91
Notes 92
Chapter 5 Managing the OpenTelemetry Collector 93
Background 94
Deployment Modes 95
Agent Mode 96
Gateway Mode 98
Reference Architectures 100
The Basics 101
The Binary 103
Sizing 103
Components 104
Configuration 106
Receivers and Exporters 115
Processors 116
Extensions 126
Connectors 127
Observing 128
Relevant Metrics 128
Health Check Extension 131
zPages Extension 131
Troubleshooting 134
Out of Memory Crashes 134
Data Not Being Received or Exported 134
Performance Issues 135
Beyond the Basics 135
Distributions 135
Securing 137
Management 138
The Bottom Line 140
Notes 141
Chapter 6 Leveraging OpenTelemetry Instrumentation 143
Environment Setup 144
Python Trace Instrumentation 149
Automatic Instrumentation 150
Manual Instrumentation 157
Programmatic Instrumentation 163
Mixing Automatic and Manual Trace Instrumentation 166
Python Metrics Instrumentation 167
Automatic Instrumentation 168
Manual Instrumentation 169
Programmatic Instrumentation 174
Mixing Automatic and Manual Metric Instrumentation 176
Python Log Instrumentation 178
Manual Metadata Enrichment 179
Trace Correlation 181
Language Considerations 183
NET 184
Java 184
Go 184
Node js 185
Deployment Models 185
Distributions 185
The Bottom Line 186
Notes 187
Chapter 7 Adopting OpenTelemetry 189
The Basics 189
Why OTel and Why Now? 190
Where to Start? 191
General Process 192
Data Collection 193
Instrumentation 195
Production Readiness 196
Maturity Framework 197
Brownfield Deployment 198
Data Collection 198
Instrumentation 200
Dashboards and Alerts 202
Greenfield Deployment 204
Data Collection 204
Instrumentation 208
Other Considerations 208
Administration and Maintenance 208
Environments 211
Semantic Conventions 212
The Future 213
The Bottom Line 213
Notes 214
Chapter 8 The Power of Context and Correlation 215
Background 215
Context 217
OTel Context 219
Trace Context 221
Resource Context 223
Logic Context 224
Correlation 225
Time Correlation 225
Context Correlation 226
Trace Correlation 228
Metric Correlation 230
The Bottom Line 230
Notes 231
Chapter 9 Choosing an Observability Platform 233
Primary Considerations 233
Platform Capabilities 235
Marketing Versus Reality 237
Price, Cost, and Value 238
Observability Fragmentation 241
Primary Factors 242
Build, Buy, or Manage 242
Licensing, Operations, and Deployment 244
OTel Compatibility and Vendor Lock-In 244
Stakeholders and Company Culture 245
Implementation Basics 246
Administration 247
Usage 248
Maturity Framework 248
The Bottom Line 250
Notes 250
Chapter 10 Observability Antipatterns and Pitfalls 251
Telemetry Data Missteps 251
Mixing Instrumentation Libraries Scenario 253
Automatic Instrumentation Scenario 253
Custom Instrumentation Scenario 254
Component Configuration Scenario 255
Performance Overhead Scenario 255
Resource Allocation Scenario 256
Security Considerations Scenario 256
Monitoring and Maintenance Scenario 257
Observability Platform Missteps 258
Vendor Lock-in Scenario 260
Fragmented Tooling Scenario 260
Tool Fatigue Scenario 261
Inadequate Scalability Scenario 261
Data Overload Scenario 262
Company Culture Implications 264
Lack of Leadership Support Scenario 265
Resistance to Change Scenario 266
Collaboration and Alignment Scenario 266
Goals and Success Criteria Scenario 267
Standardization and Consistency Scenario 268
Incentives and Recognition Scenario 268
Feedback and Improvement Scenario 269
Prioritization Framework 270
The Bottom Line 272
Notes 273
Chapter 11 Observability at Scale 275
Understanding the Challenges 275
Volume and Velocity of Telemetry Data 276
Distributed System Complexity 278
Observability Platform Complexity 281
Infrastructure and Resource Constraints 281
Strategies for Scaling Observability 282
Elasticity, Elasticity, Elasticity! 282
Leverage Cloud Native Technologies 284
Filter, Sample, and Aggregate 286
Anomaly Detection and Predictive Analytics 290
Emerging Technologies and Methodologies 291
Best Practices for Managing Scale 292
General Recommendations 292
Instrumentation and Data Collection 293
Observability Platform 293
The Bottom Line 294
Notes 295
Chapter 12 The Future of Observability 297
Challenges and Opportunities 297
Cost 297
Complexity 299
Compliance 300
Code 301
Emerging Trends and Innovations 302
Artificial Intelligence 303
Observability as Code 304
Service Mesh 305
eBPF 306
The Future of OpenTelemetry 307
Stabilization and Expansion 308
Expanded Signal Support 308
Unified Query Language 310
Community-driven Innovation 310
The Bottom Line 311
Notes 311
Appendix A The Bottom Line 313
Chapter 1: What Is Observability? 313
Chapter 2: Introducing OpenTelemetry! 315
Chapter 3: Getting Started with the Astronomy Shop 316
Chapter 4: Understanding the OpenTelemetry Specification 317
Chapter 5: Managing the OpenTelemetry Collector 318
Chapter 6: Leveraging OpenTelemetry Instrumentation 320
Chapter 7: Adopting OpenTelemetry 321
Chapter 8: The Power of Context and Correlation 323
Chapter 9: Choosing an Observability Platform 324
Chapter 10: Observability Antipatterns and Pitfalls 326
Chapter 11: Observability at Scale 327
Chapter 12: The Future of Observability 328
Appendix B Introduction 329
Chapter 2: Introducing OpenTelemetry! 330
> Roadmap 330
Chapter 3: Getting Started with the Astronomy Shop 330
> Architecture 330
Chapter 5: Managing the OpenTelemetry Collector 332
Background 332
> Components 332
Chapter 12: The Future of Observability 340
> Code 340
Notes 341
Index 343
In modern software development and operations, observability has emerged as a fundamental concept essential for maintaining and improving the performance, reliability, and scalability of complex systems. But what exactly is observability? At its core, observability is the practice of gaining insights into the internal states and behaviors of systems through the collection, analysis, and visualization of telemetry data. Unlike traditional monitoring, which primarily focuses on predefined metrics and thresholds, observability offers a more comprehensive and dynamic approach, enabling teams to proactively detect, diagnose, and resolve issues.
This chapter will explore the principles and components of observability, highlighting its significance in today's distributed and microservices-based architectures. Through a deep dive into the three pillars of observability-metrics, logs, and traces-you will understand the groundwork for how observability can transform the way resilient systems are built and managed.
IN THIS CHAPTER, YOU WILL LEARN TO:
So, what is observability in the realm of modern software development and operations? While many definitions exist, they all generally refer to observability providing the ability to quickly identify availability and performance problems, regardless of whether they have been experienced before, and help perform problem isolation, root cause analysis, and remediation. Because observability is about making it easier to understand complex systems and address unperceived issues, often referred to in the software industry as unknown unknowns,1 the data collected must be correlated across different telemetry types and be rich enough and immediately accessible to answer questions during a live incident.
The Cloud Native Computing Foundation (CNCF), described more fully later in this chapter, provides a definition for the term observability:2
Observability is a system property that defines the degree to which the system can generate actionable insights. It allows users to understand a system's state from these external outputs and take (corrective) action.
Computer systems are measured by observing low-level signals such as CPU time, memory, disk space, and higher-level and business signals, including API response times, errors, transactions per second, etc. These observable systems are observed (or monitored) through specialized tools, so-called observability tools. A list of these tools can be viewed in the Cloud Native Landscape's observability section.3
Observable systems yield meaningful, actionable data to their operators, allowing them to achieve favorable outcomes (faster incident response, increased developer productivity) and less toil and downtime.
Consequently, the observability of a system will significantly impact its operating and development costs.
While the CNCF's definition is good, it is missing a few critical aspects:
The OpenTelemetry project, which will be introduced in Chapter 2, "Introducing OpenTelemetry!," provides a definition of observability that is worth highlighting:
Observability lets you understand a system from the outside, by letting us ask questions about that system without knowing its inner workings. Furthermore, it allows you to easily troubleshoot and handle novel problems-that is, "unknown unknowns." It also helps you answer the question, "Why is this happening?"
To ask those questions about your system, your application must be properly instrumented. That is, the application code must emit signals such as traces, metrics, and logs. An application is properly instrumented when developers don't need to add more instrumentation to troubleshoot an issue, because they have all of the information they need.4
In short, observability is about collecting critical telemetry data with relevant context and using that data to quickly determine your systems' behavior and health. Observability goes beyond mere monitoring by enabling a proactive and comprehensive understanding of system behavior, facilitating quicker detection, diagnosis, and resolution of issues. This capability is crucial in today's fast-paced, microservices-driven, distributed environments, where the complexity and dynamic nature of systems demand robust and flexible observability solutions. Through the lens of the CNCF and OpenTelemetry, you can see observability is not just defined as a set of tools and practices but as a fundamental shift toward more resilient, reliable, and efficient system management.
Riley (she/her) is an experienced site reliability engineer (SRE) with deep observability and operations experience. She recently joined Jupiterian to address their observability problems and work with a new vendor. Riley joined Jupiterian from a large private equity (PE) advertising company, where she was the technical lead of the SRE team and was responsible for a large-scale, globally distributed, cloud native architecture. Before that, she was the founding member of a growth startup where she developed observability practices and culture while helping scale the business to over three million dollars in annual recurring revenue (ARR). Riley was excited about the challenge and opportunity of building observability practices from the ground up at a public enterprise company transitioning to the cloud.
Jupiterian is an e-commerce company that has been around for more than two decades. Over the last five years, the company has seen a massive influx of customers and has been on a journey to modernize its tech stack to keep up with demand and the competition. As part of these changes, it has been migrating from its on-premises monolithic application to a microservices-based architecture running on Kubernetes (K8s) and deployed in the cloud. Recently, outages have been plaguing the new architecture-a problem threatening the company and one that needed to be resolved before the annual peak traffic expected during the upcoming holiday season.
For the original architecture, the company had been using Zabbix, an open source monitoring solution to monitor the environment. The IT team was beginning to learn about DevOps practices and had set up Prometheus for the new architecture. Given organizational constraints and priorities, they did not have the time to develop the skill set to manage it and the ever-increasing number of collected metrics. In short, a critical piece of the new architecture was without ownership. On top of this, engineering teams continued to add data, dashboards, and alerts without defined standards or processes. Not surprisingly, this resulted in the company having difficulty proactively identifying availability and performance issues. It also resulted in various observability issues, including Prometheus availability, blind spots, and alert storms. In terms of observability, the company frequently experienced infrastructure issues and could not tell if it was because of an architecture limitation or an improper use of the new infrastructure. As a result, engineers feared going on-call, and innovation velocity was significantly below average.
The Jupiterian engineering team had been pushing management to invest more in observability and SRE. Instead, head count remained flat, and the product roadmaps, driven primarily by the sales team, continued to take priority. With the service missing its service-level agreement (SLA) target for the last three months, leadership demanded a focus on resiliency. To address the problem, the Chief Technology Officer (CTO) signed a three-year deal with Watchwhale, an observability vendor, so the company could focus on its core intellectual property (IP) instead of managing third-party software. An architect in the office of the CTO vetted the vendor and its technology. Given other organizational priorities, the engineering team was largely uninvolved in the proof of concept (PoC). The Vice President (VP) of Engineering was tasked with ensuring the service's SLA was consistently hit ahead of the holiday period as well as the adoption and success of the Watchwhale product. He allocated one of his budget IDs (BIDs) for a senior SRE position, which led to Riley being hired.
The term observability has been around since at least the mid-20th century and is mainly...
Dateiformat: ePUBKopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.