
Observability in the AI-Native Era
Beschreibung
Observability is mandatory for building and operating cloud-native distributed systems. Tools like OpenTelemetry have standardized how observability data is sourced, and AI now transforms how we extract value from the vast amounts of observability data generated by modern systems. This book guides you in implementing scalable observability, improving engineering efficiency with AI, and integrating observability throughout the Software Development Lifecycle (SDLC) via modern self-service internal developer platforms. You'll start with observability basics and learn how AIOps enhances signal correlation, anomaly detection, and root-cause analysis. Using real-world examples, the book demonstrates how to implement AIOps, build proactive detection pipelines, and automate diagnostics and remediation. You'll explore best practices for expanding observability using OpenTelemetry, Prometheus, Grafana, Dynatrace, Datadog, and New Relic alongside machine learning models, ensuring your systems are accurate, efficient, and secure. You'll also learn how to benchmark, measure, and secure your AIOps implementation, and gain a practical understanding of software compliance and how it applies to your systems. By the end of this book, you'll be ready to design and deliver AIOps-enabled observability solutions that make cloud-native systems more resilient, efficient, and secure.
Alle Preise
Weitere Details
Inhalt
- Intro
- Observability in the AI-Native Era
- Leveraging AIOps to build, observe, and operate resilient systems
- Observability in the AI-Native Era
- Foreword
- Contributors
- About the authors
- About the reviewers
- Table of Contents
- Preface
- What this book covers
- To get the most out of this book
- Download the example code files
- Download the color images
- Conventions used
- Get in touch
- Free benefits with your book
- How to Unlock
- Stay Sharp in Cloud and DevOps - Join 44,000+ Subscribers of CloudPro
- Share your thoughts
- Part 1
- From Monitoring via Observability to AIOps
- 1
- Observability: The Art of Turning Data into Insights
- What is observability?
- What is observability and how does it differ from monitoring?
- Let's ask ChatGPT!
- The early days: monitoring static systems
- The dawn of more complex and dynamic systems
- Cloud-native monitoring doesn't scale the way we need it to
- AIOps 2.0: observability ready for the cloud-native AI era
- Three pillars and beyond: use cases for logs, metrics, and traces
- What are metrics?
- Challenge: being careful with high cardinality and privacy in dimensional data
- What are logs?
- Challenge: tackling high-volume, excessive, and unstructured logs
- What are traces?
- Challenge: over-instrumentation, duplicated data, and sampling as challenges
- Beyond the three pillars: use cases based on events, profiling, and real users
- Use case: track your software development life cycle through events
- Use case: provide real-time business insights by extracting business events
- More use cases on existing observability signals
- Use cases: more to come as observability is evolving
- Emerging standards over the years
- OpenTelemetry
- Prometheus
- Visualization standards: Grafana and Perses
- Observability and distributed systems
- Highly distributed systems and the increased complexity
- Understanding distributed systems through observability
- Inventory: which components are part of the system we are responsible for?
- Dependencies: how are components connected and dependent on each other?
- Interfaces/APIs: what are the boundaries of our distributed system to the consumers?
- Health: are all components working as expected or is there abnormal behavior?
- SLA and root cause: are end-to-end critical transactions experiencing issues, and why?
- Shared infrastructure and how it impacts components
- Use case: identifying a noisy neighbor
- Use case: right-sizing infrastructure based on real needs
- Synchronous and asynchronous communication
- Network: metrics or eBPF
- Connections: pools on each side of the call
- Queues: the distributors of messages
- VMs, containers, and databases - oh my!
- Observing hypervisors and VMs
- Observing web and application servers
- Observing databases
- Observing containers
- Observing serverless
- Full stack: from networking to cloud to application observability
- What is full stack observability?
- The observability goal is 100% coverage: start with production infrastructure, then expand up and left
- Defining the focus of this book
- You will be able to prove the value of observability and AI!
- You will use exponential data growth as an opportunity!
- You will expand observability to the left!
- You will provide observability as a self-service!
- You will unleash the power of AIOps through AI-driven automation!
- You will see the journey of Financial One ACME
- Summary
- Further reading
- Get this book's PDF version and more
- 2
- The Elephant in the Room: Artificial Intelligence
- Technical requirements
- Why the hype around AI? What is AI good for?
- AI versus automation
- What is AI's unique value proposition?
- What is AI good for (right now)
- A value-adding abstraction layer
- Model Context Protocol
- MCP server components and features
- RAG versus CAG and how they relate to LLMs
- RAG versus CAG
- Choosing a language model
- What can and will go wrong (and what you can do about it)
- Incorrect user expectations
- Hallucination and errors
- Data poisoning
- Catastrophic forgetting
- Infinite loops
- Prompt engineering helps
- Why do AI projects fail, and how can you succeed?
- Summary
- Further reading
- Join us on Discord
- 3
- From Observability to AIOps and the Use Cases it Solves Today
- When data on glass and static alerts fail
- Alternatives to static alerts on infrastructure metrics
- Static thresholds: where they make sense
- Baselining: where static thresholds are impractical
- Beyond CPU and memory: cloud-native golden signals
- Resource layer
- Orchestration layer
- Workload layer
- Platform service layer
- Service layer
- Application layer
- Observability layer
- Choosing the right metrics through proper load testing
- Step 1: setting up a test environment
- Step 2: defining realistic scenarios
- Step 3: expand-left observability
- Step 4: running tests
- Step 5: identifying critical indicators
- Use case: baseline alerting on Kubernetes health
- Observability-driven development
- Step 1: defining internal and external system health indicators
- Step 2: defining how to measure health indicators
- Step 3: providing easy access to this data for engineering
- Where was the data captured, and who needs it?
- Where and when does this data need to be made available, and to whom?
- Self-service use case 1: providing the top three database queries for team standup
- Step 4: refining, enriching, and automating toward production
- Self-service use case 2: right-sizing container recommendations as a Git pull request
- Context is king: the quality of observability data
- From pets to cattle: semantics for observability
- Enriching observability data across the stack
- Enriching with tags from your infrastructure
- Use case: logs only accessible for sales
- Use case: traces from development only kept for two sprints
- Enriching with tags, labels, and annotations from your deployment
- Use case: is version 4.0.3 good enough to keep, or do we need to roll it back?
- Use case: is the quality of customer-service-portal in preproduction good enough for production?
- Enriching observability data across the SDLC
- Tracking the SDLC
- Use case: automated real-time artifact inventory and software catalog
- Use case: tracking DORA and other DevEx efficiency metrics
- Use case: automated correlation of deployment changes with problems
- Observing your DevOps tools
- Quality of data: where to enrich it and what to sample
- How and where to enrich observability data with context
- Do we need all the data? Sampling strategies
- AIOps: reducing the noise with anomaly and root cause detection
- Step 1: detecting abnormal events
- Scope
- Dependencies
- Shared resources
- External events
- Step 2: connecting the dots to find the root cause
- Horizontal stack, or call chain
- Vertical stack, or runs on another component
- Network connectivity
- Cross-application or cross-stack
- Step 3: explaining all the evidence
- From ops to business: SLO-based impact analysis
- The right questions to ask!
- Moving from technical to business objectives
- Why we still drown in incidents
- Expanded scope
- Tool consolidation
- Gaps in end-to-end observability
- Missing ownership
- Lack of criticality and business impact
- A primer on SLOs: learnings from Google's SRE handbook
- Service-level indicator
- Service-level objective
- Error budget and burndown rate
- Service-level agreement
- From SLOs to business objectives
- Asking business impact questions
- Connecting business objectives with technical objectives
- Business impact analysis as part of incident response
- Incident analysis without SLO context
- Connecting the SLO with the incident
- How to start this journey
- How would Financial One ACME make this transition?
- Summary
- Further reading
- Get this book's PDF version and more
- 4
- ACME Financial Services: Implementing AIOps
- Technical requirements
- Our fictitious company
- After the great cloud migration
- ACME Financial Services' current state of observability
- How their observability practices became unmanageable
- An explosion of operating costs
- What about regulations?
- Controlled deployments
- The old deployment process
- From continuous delivery to continuous incident
- The strain from adding features
- The technology stack
- Deciding what to improve from a sea of options
- Tackling the issues
- Issues determined to be medium effort
- An eye toward aiding fraud prevention
- Tracking feature usage
- Plans to address alert fatigue
- New tools, new possibilities
- The tool selection game
- Build versus buy
- Time to investigate
- The traceability group
- The feature usage group
- The fraud group
- The alerts group
- The SLO mess
- Static alerts
- Siloed data
- Alerts from logs
- Bringing it all together
- Plan of action
- Having the "what" and determining the "how
- It's finally done. What did we gain?
- Summary
- Join us on Discord
- Part 2
- Expanding Left: Moving AIOps into Platform Engineering
- 5
- Democratizing Observability: A Primer to Self-Service Platforms
- Technical requirements
- What is a self-service platform?
- Kubernetes standardization
- Kustomize
- Boilerplate code and templates
- Prometheus templates
- Helm templates
- Templates in non-Kubernetes architectures
- Terraform
- OpenTofu
- Ansible
- Choosing and using your templating technology
- How to life cycle templates
- Change tracking
- Usage tracking
- What is the role of the platform?
- Enforcement of the maturity models
- The platform is the unifying layer
- Use cases for the integration of AI into the IDP
- Instruction files and giving your AI direction
- A reliable AI is an observable AI
- Strategies for measuring response accuracy
- Natural language processing scoring
- AI is a capability of the IDP
- Beyond IDP: Other places to push observability data
- Summary
- Further reading
- Get this book's PDF version and more
- 6
- The Observability Agent: Real-Life Use Cases
- Agentic AI and MCP: How it can revolutionize modern cloud-native organizations
- Recap on agentic AI, LLMs, and MCP
- The four phases of agentic AI
- Perceive phase
- Reason phase
- Act phase
- Learn phase
- Beyond observability: The other systems agents connect to
- Change management
- Observability
- Bug tracking
- Status pages
- CRM
- Git code and code history
- Software catalog
- Connecting observability data to your agentic AI
- Defining tools: Connecting your observability platform through MCP
- Start with the questions you need to answer
- Tip: Don't mix read and write
- Tip: Generic versus use case-specific
- Tip: Local versus remote validation
- Providing guidance: Instruction files for your AI agents
- Tracking and measuring usage and impact
- Instrumenting the MCP server and beyond
- Analyzing the backend observability API calls
- Agentic AI observability
- Defining personas: Who is using the AI and where?
- Who are your internal end users and their use cases?
- Planning: What features to build, improve, or remove?
- Building: How to improve the resiliency of our architecture
- Testing: What use cases are we not covering right now?
- Releasing: When is a good time to release a new feature and to whom?
- Operating: What's the best way to mitigate production issues?
- From which tools do engineers interact with the agent?
- Project management and issue tracking tools
- Collaboration and chat tools
- Customer service and support tools
- Business reporting tools
- From manual pull prompts to automated push results
- Use case: Push daily standup insights into Slack/MS Teams
- Use case: Push observability insights to pull request
- Use case: Push daily business insights via emails
- Using observability agents to get from reactive to proactive operations
- Goal-driven engineering
- Defining goals for engineering
- Achieving goals: Prompting Center of Excellence
- Scaling: From prompt to autonomous agents
- Use cases to get from reactive to proactive
- Use case - FinOps: redistributing workloads
- Use case - Observability data quality checks
- Use case - Optimizing user experience
- Summary
- Further reading
- Join us on Discord
- 7
- ACME Financial Services: How to Move from AIOps to Agentic Platforms
- Technical requirements
- Once more into the breach
- The cost drivers
- A new challenge: Business process observability
- A new round of discussions
- Feedback from the application developers
- Feedback from the SREs
- Compliance: The great defense
- How observability can help with compliance
- Expanding the utility of the observability tools
- Tackling the costs
- Assessing sampling
- The new changeset
- A platform team?
- A big ask, a big undertaking
- A platform team does what?
- That's an awful lot of responsibility
- How do we do that?
- In search of more answers
- Connecting the systems
- Discovering how applications were deployed and managed
- Discovering how Kubernetes clusters were managed
- Discovering the concept of self-service
- What does a platform team deliverable look like?
- Who is a fit for a platform team?
- Lessons from the SREs
- Reactive versus proactive scaling
- Self-service platform details
- Assistance with RCA
- Automating regulation compliance
- Expanding left: Meeting developers where they work
- Observability-driven development
- Bringing it all together: The new proposal to senior leadership
- The presentation
- Phase 1 - Crawl: Setting the standards
- Phase 2 - Walk: IDP with its first golden paths
- Phase 3 - Run: From AI and platform to agentic platform
- The results of the presentation
- Time jump: What was achieved
- Summary
- Get this book's PDF version and more
- Part 3
- From AI Assistants to Self-Driving Architectures
- 8
- Evolving Operations: Proactive & Preventive & Self-Driven Architecture
- Technical requirements
- Defining our terms
- Differentiating proactive versus preventative versus self-driven
- Enabling AI-driven decision-making through event-driven architectures
- Feeding events into our observability platform
- Scenario: tying events and AI operations together
- How AI can help us analyze the situation
- How to let the AI also take automated actions
- Use case: AI and GitOps
- Integrating agentic AI for Kubernetes
- Use case: AI-driven scaling for performance and cost
- Types of scaling
- Scaling for performance
- Scaling for cost
- Scaling for carbon footprint
- How to approach auto scaling
- Enemy at the gates: let your AI be your defender
- Using AI to parse audit logs and system access
- AI for reducing risk surface area
- Risk surface 1: data
- Risk surface 2: users and permissions
- Use case: ensuring proper chain of custody for workloads
- Chain of custody explained
- Commit signing and signature verification
- Validating package contents
- Use case: repository maintenance - let AI take on the boring tasks
- Summary
- Join us on Discord
- 9
- No Future Without Challenges
- Technical requirements
- Building trust in the AI
- Regular dry runs
- Leveraging vendors and tools
- Implementing and observing governance
- Guardrails in the cloud
- Guardrails as middleware
- Demonstrating success
- Reliability
- The hidden costs of implementing AI
- Tokens and AI costs
- Tokens and spending
- Token management
- Prompt engineering and context engineering
- Fine-tuning
- Supervised fine-tuning
- Few-shot learning
- Transfer learning
- Domain-specific fine-tuning
- Parameter-efficient fine-tuning (PEFT)
- Continuous learning
- Breaking down your options
- GPUs are expensive
- Cost-benefit analysis
- Securing the AI
- What's different about security in an AI environment?
- Zero trust
- Code scanning and SBOMS
- Other best practices for zero trust
- Data security posture management
- Threat models
- How to keep up with industry trends, best practices, and threats
- Mind the gap: How to maintain compliance when implementing new AI technologies
- What is compliance?
- Compliance risk 1: Data sources unknown
- Risk 2: Skills and MCP servers may not pass muster
- Risk 3: Data loss
- Risk 4: Bias in the AI
- Mitigations
- Summary
- Further reading
- Get this book's PDF version and more
- 10
- ACME Financial Services: How Will the AI Future Shape Our Company?
- Technical requirements
- Our story thus far
- Evolving trust in the observability platform
- The SREs' path to trust
- The developers' path to trust
- A trust framework was born
- The experimentation and dry-run phase
- The trust phase
- The verify phase
- How trust changed the company
- How SREs were impacted by trusting the tool
- How developers were impacted by trusting the tool
- Business process observability bears fruit
- The importance of securing the observability platform
- Who can access the data
- Managing the system
- Watch the drift
- The continual efforts to stay current
- Monitoring industry trends
- Expansion increases the effort
- The cost of early adoption
- Always monitoring the cost
- Incorporating changes into roadmaps
- So, what's next?
- Summary
- Join us on Discord
- 11
- Unlock Your Exclusive Benefits
- Unlock this Book's Free Benefits in three Easy Steps
- Step 1
- Step 2
- Step 3
- Need Help
- Why subscribe?
- Other Books You May Enjoy
- Packt is searching for authors like you
- Share your thoughts
- Index
Table of Contents
Preface Free benefits with your book Part 1: From Monitoring via Observability to AIOps Chapter 1: Observability: The Art of Turning Data into Insights What is observability? What is observability and how does it differ from monitoring? Let's ask ChatGPT! The early days: monitoring static systems The dawn of more complex and dynamic systems Cloud-native monitoring doesn't scale the way we need it to AIOps 2.0: observability ready for the cloud-native AI era Three pillars and beyond: use cases for logs, metrics, and traces What are metrics? What are logs? What are traces? Beyond the three pillars: use cases based on events, profiling, and real users Emerging standards over the years OpenTelemetry Prometheus Visualization standards: Grafana and Perses Observability and distributed systems Highly distributed systems and the increased complexity Understanding distributed systems through observability Inventory: which components are part of the system we are responsible for? Dependencies: how are components connected and dependent on each other? Interfaces/APIs: what are the boundaries of our distributed system to the consumers? Health: are all components working as expected or is there abnormal behavior? SLA and root cause: are end-to-end critical transactions experiencing issues, and why? Shared infrastructure and how it impacts components Use case: identifying a noisy neighbor Use case: right-sizing infrastructure based on real needs Synchronous and asynchronous communication Network: metrics or eBPF Connections: pools on each side of the call Queues: the distributors of messages VMs, containers, and databases - oh my! Observing hypervisors and VMs Observing web and application servers Observing databases Observing containers Observing serverless Full stack: from networking to cloud to application observability What is full stack observability? The observability goal is 100% coverage: start with production infrastructure, then expand up and left Defining the focus of this book You will be able to prove the value of observability and AI! You will use exponential data growth as an opportunity! You will expand observability to the left! You will provide observability as a self-service! You will unleash the power of AIOps through AI-driven automation! You will see the journey of Financial One ACME Summary Further reading Get this book's PDF version and more Chapter 2: The Elephant in the Room: Artificial Intelligence Technical requirements Why the hype around AI? What is AI good for? AI versus automation What is AI's unique value proposition? What is AI good for (right now) A value-adding abstraction layer Model Context Protocol MCP server components and features RAG versus CAG and how they relate to LLMs RAG versus CAG Choosing a language model What can and will go wrong (and what you can do about it) Incorrect user expectations Hallucination and errors Data poisoning Catastrophic forgetting Infinite loops Prompt engineering helps Why do AI projects fail, and how can you succeed? Summary Further reading Chapter 3: From Observability to AIOps and the Use Cases it Solves Today When data on glass and static alerts fail Alternatives to static alerts on infrastructure metrics Static thresholds: where they make sense Baselining: where static thresholds are impractical Beyond CPU and memory: cloud-native golden signals Resource layer Orchestration layer Workload layer Platform service layer Service layer Application layer Observability layer Choosing the right metrics through proper load testing Step 1: setting up a test environment Step 2: defining realistic scenarios Step 3: expand-left observability Step 4: running tests Step 5: identifying critical indicators Use case: baseline alerting on Kubernetes health Observability-driven development Step 1: defining internal and external system health indicators Step 2: defining how to measure health indicators Step 3: providing easy access to this data for engineering Where was the data captured, and who needs it? Where and when does this data need to be made available, and to whom? Self-service use case 1: providing the top three database queries for team standup Step 4: refining, enriching, and automating toward production Self-service use case 2: right-sizing container recommendations as a Git pull request Context is king: the quality of observability data From pets to cattle: semantics for observability Enriching observability data across the stack Enriching with tags from your infrastructure Enriching with tags, labels, and annotations from your deployment Enriching observability data across the SDLC Tracking the SDLC Observing your DevOps tools Quality of data: where to enrich it and what to sample How and where to enrich observability data with context Do we need all the data? Sampling strategies ...
Systemvoraussetzungen
Dateiformat: ePUB
Kopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
- Computer (Windows; MacOS X; Linux): Installieren Sie bereits vor dem Download die kostenlose Software Adobe Digital Editions (siehe E-Book Hilfe).
- Tablet/Smartphone (Android; iOS): Installieren Sie bereits vor dem Download die kostenlose App Adobe Digital Editions oder die App PocketBook (siehe E-Book Hilfe).
- E-Book-Reader: Bookeen, Kobo, Pocketbook, Sony, Tolino u.v.a.m. (nicht Kindle)
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an.
Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.
Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.
Dateiformat: ePUB
Kopierschutz: ohne DRM (Digital Rights Management)
Systemvoraussetzungen:
- Computer (Windows; MacOS X; Linux): Verwenden Sie eine Lese-Software, die das Dateiformat ePUB verarbeiten kann: z.B. Adobe Digital Editions oder FBReader – beide kostenlos (siehe E-Book Hilfe).
- Tablet/Smartphone (Android; iOS): Installieren Sie bereits vor dem Download die kostenlose App Adobe Digital Editions oder die App PocketBook (siehe E-Book Hilfe).
- E-Book-Reader: Bookeen, Kobo, Pocketbook, Sony, Tolino u.v.a.m.
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „glatten” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an.
Ein Kopierschutz bzw. Digital Rights Management wird bei diesem E-Book nicht eingesetzt.
Weitere Informationen finden Sie in unserer E-Book Hilfe.