Mastering OpenTelemetry and Observability

Name: Mastering OpenTelemetry and Observability | Enhancing Application and Infrastructure Performance and Avoiding Outages
Brand: Wiley
Price: 46.99 EUR
Availability: OnlineOnly

Enhancing Application and Infrastructure Performance and Avoiding Outages

Steve Flanders(Autor*in)

Wiley (Verlag)

1. Auflage

Erschienen am 22. Oktober 2024

698 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

978-1-394-25313-5 (ISBN)

46,99 €inkl. 7% MwSt.

Systemvoraussetzungen

für ePUB mit Adobe-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

Weitere Details

Weitere Ausgaben

Person

Inhalt

Foreword xiii

Introduction xiv

The Mastering Series xvi

Chapter 1 What Is Observability? 1

Definition 1

Background 4

Cloud Native Era 4

Monitoring Compared to Observability 5

Metadata 8

Dimensionality 9

Cardinality 9

Semantic Conventions 10

Data Sensitivity 10

Signals 10

Metrics 10

Logs 13

Traces 14

Other Signals 20

Collecting Signals 20

Instrumentation 21

Push Versus Pull Collection 22

Data Collection 23

Sampling Signals 26

Observability 27

Platforms 27

Application Performance Monitoring 28

The Bottom Line 28

Notes 30

Chapter 2 Introducing OpenTelemetry! 31

Background 31

Observability Pain Points 31

The Rise of Open Source Software 34

Introducing OpenTelemetry 35

OpenTelemetry Components 37

OpenTelemetry Concepts 48

Roadmap 50

The Bottom Line 50

Notes 51

Chapter 3 Getting Started with the Astronomy Shop 53

Background 53

Architecture 54

Prerequisites 54

Getting Started 55

Accessing the Astronomy Shop 57

Accessing Telemetry Data 57

Beyond the Basics 58

Configuring Load Generation 58

Configuring Feature Flags 59

Configuring Tests Built from Traces 60

Configuring the OTel Collector 60

Configuring OTel Instrumentation 62

Troubleshooting Astronomy Shop 62

Astronomy Shop Scenarios 63

Troubleshooting Errors 63

Troubleshooting Availability 69

Troubleshooting Performance 70

Troubleshooting Telemetry 74

The Bottom Line 75

Notes 76

Chapter 4 Understanding the OpenTelemetry Specification 77

Background 77

API Specification 79

API Definition 80

API Context 80

API Signals 81

API Implementation 82

SDK Specification 82

SDK Definition 83

SDK Signals 83

SDK Implementation 84

Data Specification 84

Data Models 86

Data Protocols 88

Data Semantic Conventions 88

Data Compatibility 89

General Specification 90

The Bottom Line 91

Notes 92

Chapter 5 Managing the OpenTelemetry Collector 93

Background 94

Deployment Modes 95

Agent Mode 96

Gateway Mode 98

Reference Architectures 100

The Basics 101

The Binary 103

Sizing 103

Components 104

Configuration 106

Receivers and Exporters 115

Processors 116

Extensions 126

Connectors 127

Observing 128

Relevant Metrics 128

Health Check Extension 131

zPages Extension 131

Troubleshooting 134

Out of Memory Crashes 134

Data Not Being Received or Exported 134

Performance Issues 135

Beyond the Basics 135

Distributions 135

Securing 137

Management 138

The Bottom Line 140

Notes 141

Chapter 6 Leveraging OpenTelemetry Instrumentation 143

Environment Setup 144

Python Trace Instrumentation 149

Automatic Instrumentation 150

Manual Instrumentation 157

Programmatic Instrumentation 163

Mixing Automatic and Manual Trace Instrumentation 166

Python Metrics Instrumentation 167

Automatic Instrumentation 168

Manual Instrumentation 169

Programmatic Instrumentation 174

Mixing Automatic and Manual Metric Instrumentation 176

Python Log Instrumentation 178

Manual Metadata Enrichment 179

Trace Correlation 181

Language Considerations 183

NET 184

Java 184

Go 184

Node js 185

Deployment Models 185

Distributions 185

The Bottom Line 186

Notes 187

Chapter 7 Adopting OpenTelemetry 189

The Basics 189

Why OTel and Why Now? 190

Where to Start? 191

General Process 192

Data Collection 193

Instrumentation 195

Production Readiness 196

Maturity Framework 197

Brownfield Deployment 198

Data Collection 198

Instrumentation 200

Dashboards and Alerts 202

Greenfield Deployment 204

Data Collection 204

Instrumentation 208

Other Considerations 208

Administration and Maintenance 208

Environments 211

Semantic Conventions 212

The Future 213

The Bottom Line 213

Notes 214

Chapter 8 The Power of Context and Correlation 215

Background 215

Context 217

OTel Context 219

Trace Context 221

Resource Context 223

Logic Context 224

Correlation 225

Time Correlation 225

Context Correlation 226

Trace Correlation 228

Metric Correlation 230

The Bottom Line 230

Notes 231

Chapter 9 Choosing an Observability Platform 233

Primary Considerations 233

Platform Capabilities 235

Marketing Versus Reality 237

Price, Cost, and Value 238

Observability Fragmentation 241

Primary Factors 242

Build, Buy, or Manage 242

Licensing, Operations, and Deployment 244

OTel Compatibility and Vendor Lock-In 244

Stakeholders and Company Culture 245

Implementation Basics 246

Administration 247

Usage 248

Maturity Framework 248

The Bottom Line 250

Notes 250

Chapter 10 Observability Antipatterns and Pitfalls 251

Telemetry Data Missteps 251

Mixing Instrumentation Libraries Scenario 253

Automatic Instrumentation Scenario 253

Custom Instrumentation Scenario 254

Component Configuration Scenario 255

Performance Overhead Scenario 255

Resource Allocation Scenario 256

Security Considerations Scenario 256

Monitoring and Maintenance Scenario 257

Observability Platform Missteps 258

Vendor Lock-in Scenario 260

Fragmented Tooling Scenario 260

Tool Fatigue Scenario 261

Inadequate Scalability Scenario 261

Data Overload Scenario 262

Company Culture Implications 264

Lack of Leadership Support Scenario 265

Resistance to Change Scenario 266

Collaboration and Alignment Scenario 266

Goals and Success Criteria Scenario 267

Standardization and Consistency Scenario 268

Incentives and Recognition Scenario 268

Feedback and Improvement Scenario 269

Prioritization Framework 270

The Bottom Line 272

Notes 273

Chapter 11 Observability at Scale 275

Understanding the Challenges 275

Volume and Velocity of Telemetry Data 276

Distributed System Complexity 278

Observability Platform Complexity 281

Infrastructure and Resource Constraints 281

Strategies for Scaling Observability 282

Elasticity, Elasticity, Elasticity! 282

Leverage Cloud Native Technologies 284

Filter, Sample, and Aggregate 286

Anomaly Detection and Predictive Analytics 290

Emerging Technologies and Methodologies 291

Best Practices for Managing Scale 292

General Recommendations 292

Instrumentation and Data Collection 293

Observability Platform 293

The Bottom Line 294

Notes 295

Chapter 12 The Future of Observability 297

Challenges and Opportunities 297

Cost 297

Complexity 299

Compliance 300

Code 301

Emerging Trends and Innovations 302

Artificial Intelligence 303

Observability as Code 304

Service Mesh 305

eBPF 306

The Future of OpenTelemetry 307

Stabilization and Expansion 308

Expanded Signal Support 308

Unified Query Language 310

Community-driven Innovation 310

The Bottom Line 311

Notes 311

Appendix A The Bottom Line 313

Chapter 1: What Is Observability? 313

Chapter 2: Introducing OpenTelemetry! 315

Chapter 3: Getting Started with the Astronomy Shop 316

Chapter 4: Understanding the OpenTelemetry Specification 317

Chapter 5: Managing the OpenTelemetry Collector 318

Chapter 6: Leveraging OpenTelemetry Instrumentation 320

Chapter 7: Adopting OpenTelemetry 321

Chapter 8: The Power of Context and Correlation 323

Chapter 9: Choosing an Observability Platform 324

Chapter 10: Observability Antipatterns and Pitfalls 326

Chapter 11: Observability at Scale 327

Chapter 12: The Future of Observability 328

Appendix B Introduction 329

Chapter 2: Introducing OpenTelemetry! 330

> Roadmap 330

Chapter 3: Getting Started with the Astronomy Shop 330

> Architecture 330

Chapter 5: Managing the OpenTelemetry Collector 332

Background 332

> Components 332

Chapter 12: The Future of Observability 340

> Code 340

Notes 341

Index 343

Chapter 1
What Is Observability?

In modern software development and operations, observability has emerged as a fundamental concept essential for maintaining and improving the performance, reliability, and scalability of complex systems. But what exactly is observability? At its core, observability is the practice of gaining insights into the internal states and behaviors of systems through the collection, analysis, and visualization of telemetry data. Unlike traditional monitoring, which primarily focuses on predefined metrics and thresholds, observability offers a more comprehensive and dynamic approach, enabling teams to proactively detect, diagnose, and resolve issues.

This chapter will explore the principles and components of observability, highlighting its significance in today's distributed and microservices-based architectures. Through a deep dive into the three pillars of observability-metrics, logs, and traces-you will understand the groundwork for how observability can transform the way resilient systems are built and managed.

IN THIS CHAPTER, YOU WILL LEARN TO:

Differentiate between monitoring and observability
Explain the importance of metadata
Identify the differences between telemetry signals
Distinguish between instrumentation and data collection
Analyze the requirements for choosing an observability platform

Definition

So, what is observability in the realm of modern software development and operations? While many definitions exist, they all generally refer to observability providing the ability to quickly identify availability and performance problems, regardless of whether they have been experienced before, and help perform problem isolation, root cause analysis, and remediation. Because observability is about making it easier to understand complex systems and address unperceived issues, often referred to in the software industry as unknown unknowns,1 the data collected must be correlated across different telemetry types and be rich enough and immediately accessible to answer questions during a live incident.

The Cloud Native Computing Foundation (CNCF), described more fully later in this chapter, provides a definition for the term observability:2

Observability is a system property that defines the degree to which the system can generate actionable insights. It allows users to understand a system's state from these external outputs and take (corrective) action.

Computer systems are measured by observing low-level signals such as CPU time, memory, disk space, and higher-level and business signals, including API response times, errors, transactions per second, etc. These observable systems are observed (or monitored) through specialized tools, so-called observability tools. A list of these tools can be viewed in the Cloud Native Landscape's observability section.3

Observable systems yield meaningful, actionable data to their operators, allowing them to achieve favorable outcomes (faster incident response, increased developer productivity) and less toil and downtime.

Consequently, the observability of a system will significantly impact its operating and development costs.

While the CNCF's definition is good, it is missing a few critical aspects:

The goal of observability should be where a system's state can be fully understood from its external output without the need to ship code. This means you should be able to ask novel questions about your observability data, especially questions you had not thought of beforehand.
Observability is not just about collecting data but about collecting meaningful data, such as data with context and correlated across different sources, and storing it on a platform that offers rich analytics and query capabilities across signals.
A system is truly observable when you can troubleshoot without prior knowledge of the system.

The OpenTelemetry project, which will be introduced in Chapter 2, "Introducing OpenTelemetry!," provides a definition of observability that is worth highlighting:

Observability lets you understand a system from the outside, by letting us ask questions about that system without knowing its inner workings. Furthermore, it allows you to easily troubleshoot and handle novel problems-that is, "unknown unknowns." It also helps you answer the question, "Why is this happening?"

To ask those questions about your system, your application must be properly instrumented. That is, the application code must emit signals such as traces, metrics, and logs. An application is properly instrumented when developers don't need to add more instrumentation to troubleshoot an issue, because they have all of the information they need.4

In short, observability is about collecting critical telemetry data with relevant context and using that data to quickly determine your systems' behavior and health. Observability goes beyond mere monitoring by enabling a proactive and comprehensive understanding of system behavior, facilitating quicker detection, diagnosis, and resolution of issues. This capability is crucial in today's fast-paced, microservices-driven, distributed environments, where the complexity and dynamic nature of systems demand robust and flexible observability solutions. Through the lens of the CNCF and OpenTelemetry, you can see observability is not just defined as a set of tools and practices but as a fundamental shift toward more resilient, reliable, and efficient system management.

RILEY JOINS JUPITERIAN

Riley (she/her) is an experienced site reliability engineer (SRE) with deep observability and operations experience. She recently joined Jupiterian to address their observability problems and work with a new vendor. Riley joined Jupiterian from a large private equity (PE) advertising company, where she was the technical lead of the SRE team and was responsible for a large-scale, globally distributed, cloud native architecture. Before that, she was the founding member of a growth startup where she developed observability practices and culture while helping scale the business to over three million dollars in annual recurring revenue (ARR). Riley was excited about the challenge and opportunity of building observability practices from the ground up at a public enterprise company transitioning to the cloud.

Jupiterian is an e-commerce company that has been around for more than two decades. Over the last five years, the company has seen a massive influx of customers and has been on a journey to modernize its tech stack to keep up with demand and the competition. As part of these changes, it has been migrating from its on-premises monolithic application to a microservices-based architecture running on Kubernetes (K8s) and deployed in the cloud. Recently, outages have been plaguing the new architecture-a problem threatening the company and one that needed to be resolved before the annual peak traffic expected during the upcoming holiday season.

For the original architecture, the company had been using Zabbix, an open source monitoring solution to monitor the environment. The IT team was beginning to learn about DevOps practices and had set up Prometheus for the new architecture. Given organizational constraints and priorities, they did not have the time to develop the skill set to manage it and the ever-increasing number of collected metrics. In short, a critical piece of the new architecture was without ownership. On top of this, engineering teams continued to add data, dashboards, and alerts without defined standards or processes. Not surprisingly, this resulted in the company having difficulty proactively identifying availability and performance issues. It also resulted in various observability issues, including Prometheus availability, blind spots, and alert storms. In terms of observability, the company frequently experienced infrastructure issues and could not tell if it was because of an architecture limitation or an improper use of the new infrastructure. As a result, engineers feared going on-call, and innovation velocity was significantly below average.

The Jupiterian engineering team had been pushing management to invest more in observability and SRE. Instead, head count remained flat, and the product roadmaps, driven primarily by the sales team, continued to take priority. With the service missing its service-level agreement (SLA) target for the last three months, leadership demanded a focus on resiliency. To address the problem, the Chief Technology Officer (CTO) signed a three-year deal with Watchwhale, an observability vendor, so the company could focus on its core intellectual property (IP) instead of managing third-party software. An architect in the office of the CTO vetted the vendor and its technology. Given other organizational priorities, the engineering team was largely uninvolved in the proof of concept (PoC). The Vice President (VP) of Engineering was tasked with ensuring the service's SLA was consistently hit ahead of the holiday period as well as the adoption and success of the Watchwhale product. He allocated one of his budget IDs (BIDs) for a senior SRE position, which led to Riley being hired.

Background

The term observability has been around since at least the mid-20th century and is mainly...

Systemvoraussetzungen

Als PDF speichern Als Link merken