Observability in the AI-Native Era

Name: Observability in the AI-Native Era | Leveraging AIOps to build, observe, and operate resilient systems
Brand: Packt Publishing
Price: 38.39 EUR
Availability: OnlineOnly

Leveraging AIOps to build, observe, and operate resilient systems

Hilliary Lipsig Andreas Grabner Robert Rati(Autor*in)

Packt Publishing

1. Auflage

Erschienen am 13. März 2026

420 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

E-Book

ePUB ohne DRM

Systemvoraussetzungen

978-1-80638-958-2 (ISBN)

ab 38,39 €

Als Download verfügbar

Merkliste: siehe Preise

Beschreibung

Alle Preise

Weitere Details

Inhalt

Intro
Observability in the AI-Native Era
Leveraging AIOps to build, observe, and operate resilient systems
Observability in the AI-Native Era
Foreword
Contributors
About the authors
About the reviewers
Table of Contents
Preface
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Free benefits with your book
How to Unlock
Stay Sharp in Cloud and DevOps - Join 44,000+ Subscribers of CloudPro
Share your thoughts
Part 1
From Monitoring via Observability to AIOps
1
Observability: The Art of Turning Data into Insights
What is observability?
What is observability and how does it differ from monitoring?
Let's ask ChatGPT!
The early days: monitoring static systems
The dawn of more complex and dynamic systems
Cloud-native monitoring doesn't scale the way we need it to
AIOps 2.0: observability ready for the cloud-native AI era
Three pillars and beyond: use cases for logs, metrics, and traces
What are metrics?
Challenge: being careful with high cardinality and privacy in dimensional data
What are logs?
Challenge: tackling high-volume, excessive, and unstructured logs
What are traces?
Challenge: over-instrumentation, duplicated data, and sampling as challenges
Beyond the three pillars: use cases based on events, profiling, and real users
Use case: track your software development life cycle through events
Use case: provide real-time business insights by extracting business events
More use cases on existing observability signals
Use cases: more to come as observability is evolving
Emerging standards over the years
OpenTelemetry
Prometheus
Visualization standards: Grafana and Perses
Observability and distributed systems
Highly distributed systems and the increased complexity
Understanding distributed systems through observability
Inventory: which components are part of the system we are responsible for?
Dependencies: how are components connected and dependent on each other?
Interfaces/APIs: what are the boundaries of our distributed system to the consumers?
Health: are all components working as expected or is there abnormal behavior?
SLA and root cause: are end-to-end critical transactions experiencing issues, and why?
Shared infrastructure and how it impacts components
Use case: identifying a noisy neighbor
Use case: right-sizing infrastructure based on real needs
Synchronous and asynchronous communication
Network: metrics or eBPF
Connections: pools on each side of the call
Queues: the distributors of messages
VMs, containers, and databases - oh my!
Observing hypervisors and VMs
Observing web and application servers
Observing databases
Observing containers
Observing serverless
Full stack: from networking to cloud to application observability
What is full stack observability?
The observability goal is 100% coverage: start with production infrastructure, then expand up and left
Defining the focus of this book
You will be able to prove the value of observability and AI!
You will use exponential data growth as an opportunity!
You will expand observability to the left!
You will provide observability as a self-service!
You will unleash the power of AIOps through AI-driven automation!
You will see the journey of Financial One ACME
Summary
Further reading
Get this book's PDF version and more
2
The Elephant in the Room: Artificial Intelligence
Technical requirements
Why the hype around AI? What is AI good for?
AI versus automation
What is AI's unique value proposition?
What is AI good for (right now)
A value-adding abstraction layer
Model Context Protocol
MCP server components and features
RAG versus CAG and how they relate to LLMs
RAG versus CAG
Choosing a language model
What can and will go wrong (and what you can do about it)
Incorrect user expectations
Hallucination and errors
Data poisoning
Catastrophic forgetting
Infinite loops
Prompt engineering helps
Why do AI projects fail, and how can you succeed?
Summary
Further reading
Join us on Discord
3
From Observability to AIOps and the Use Cases it Solves Today
When data on glass and static alerts fail
Alternatives to static alerts on infrastructure metrics
Static thresholds: where they make sense
Baselining: where static thresholds are impractical
Beyond CPU and memory: cloud-native golden signals
Resource layer
Orchestration layer
Workload layer
Platform service layer
Service layer
Application layer
Observability layer
Choosing the right metrics through proper load testing
Step 1: setting up a test environment
Step 2: defining realistic scenarios
Step 3: expand-left observability
Step 4: running tests
Step 5: identifying critical indicators
Use case: baseline alerting on Kubernetes health
Observability-driven development
Step 1: defining internal and external system health indicators
Step 2: defining how to measure health indicators
Step 3: providing easy access to this data for engineering
Where was the data captured, and who needs it?
Where and when does this data need to be made available, and to whom?
Self-service use case 1: providing the top three database queries for team standup
Step 4: refining, enriching, and automating toward production
Self-service use case 2: right-sizing container recommendations as a Git pull request
Context is king: the quality of observability data
From pets to cattle: semantics for observability
Enriching observability data across the stack
Enriching with tags from your infrastructure
Use case: logs only accessible for sales
Use case: traces from development only kept for two sprints
Enriching with tags, labels, and annotations from your deployment
Use case: is version 4.0.3 good enough to keep, or do we need to roll it back?
Use case: is the quality of customer-service-portal in preproduction good enough for production?
Enriching observability data across the SDLC
Tracking the SDLC
Use case: automated real-time artifact inventory and software catalog
Use case: tracking DORA and other DevEx efficiency metrics
Use case: automated correlation of deployment changes with problems
Observing your DevOps tools
Quality of data: where to enrich it and what to sample
How and where to enrich observability data with context
Do we need all the data? Sampling strategies
AIOps: reducing the noise with anomaly and root cause detection
Step 1: detecting abnormal events
Scope
Dependencies
Shared resources
External events
Step 2: connecting the dots to find the root cause
Horizontal stack, or call chain
Vertical stack, or runs on another component
Network connectivity
Cross-application or cross-stack
Step 3: explaining all the evidence
From ops to business: SLO-based impact analysis
The right questions to ask!
Moving from technical to business objectives
Why we still drown in incidents
Expanded scope
Tool consolidation
Gaps in end-to-end observability
Missing ownership
Lack of criticality and business impact
A primer on SLOs: learnings from Google's SRE handbook
Service-level indicator
Service-level objective
Error budget and burndown rate
Service-level agreement
From SLOs to business objectives
Asking business impact questions
Connecting business objectives with technical objectives
Business impact analysis as part of incident response
Incident analysis without SLO context
Connecting the SLO with the incident
How to start this journey
How would Financial One ACME make this transition?
Summary
Further reading
Get this book's PDF version and more
4
ACME Financial Services: Implementing AIOps
Technical requirements
Our fictitious company
After the great cloud migration
ACME Financial Services' current state of observability
How their observability practices became unmanageable
An explosion of operating costs
What about regulations?
Controlled deployments
The old deployment process
From continuous delivery to continuous incident
The strain from adding features
The technology stack
Deciding what to improve from a sea of options
Tackling the issues
Issues determined to be medium effort
An eye toward aiding fraud prevention
Tracking feature usage
Plans to address alert fatigue
New tools, new possibilities
The tool selection game
Build versus buy
Time to investigate
The traceability group
The feature usage group
The fraud group
The alerts group
The SLO mess
Static alerts
Siloed data
Alerts from logs
Bringing it all together
Plan of action
Having the "what" and determining the "how
It's finally done. What did we gain?
Summary
Join us on Discord
Part 2
Expanding Left: Moving AIOps into Platform Engineering
5
Democratizing Observability: A Primer to Self-Service Platforms
Technical requirements
What is a self-service platform?
Kubernetes standardization
Kustomize
Boilerplate code and templates
Prometheus templates
Helm templates
Templates in non-Kubernetes architectures
Terraform
OpenTofu
Ansible
Choosing and using your templating technology
How to life cycle templates
Change tracking
Usage tracking
What is the role of the platform?
Enforcement of the maturity models
The platform is the unifying layer
Use cases for the integration of AI into the IDP
Instruction files and giving your AI direction
A reliable AI is an observable AI
Strategies for measuring response accuracy
Natural language processing scoring
AI is a capability of the IDP
Beyond IDP: Other places to push observability data
Summary
Further reading
Get this book's PDF version and more
6
The Observability Agent: Real-Life Use Cases
Agentic AI and MCP: How it can revolutionize modern cloud-native organizations
Recap on agentic AI, LLMs, and MCP
The four phases of agentic AI
Perceive phase
Reason phase
Act phase
Learn phase
Beyond observability: The other systems agents connect to
Change management
Observability
Bug tracking
Status pages
CRM
Git code and code history
Software catalog
Connecting observability data to your agentic AI
Defining tools: Connecting your observability platform through MCP
Start with the questions you need to answer
Tip: Don't mix read and write
Tip: Generic versus use case-specific
Tip: Local versus remote validation
Providing guidance: Instruction files for your AI agents
Tracking and measuring usage and impact
Instrumenting the MCP server and beyond
Analyzing the backend observability API calls
Agentic AI observability
Defining personas: Who is using the AI and where?
Who are your internal end users and their use cases?
Planning: What features to build, improve, or remove?
Building: How to improve the resiliency of our architecture
Testing: What use cases are we not covering right now?
Releasing: When is a good time to release a new feature and to whom?
Operating: What's the best way to mitigate production issues?
From which tools do engineers interact with the agent?
Project management and issue tracking tools
Collaboration and chat tools
Customer service and support tools
Business reporting tools
From manual pull prompts to automated push results
Use case: Push daily standup insights into Slack/MS Teams
Use case: Push observability insights to pull request
Use case: Push daily business insights via emails
Using observability agents to get from reactive to proactive operations
Goal-driven engineering
Defining goals for engineering
Achieving goals: Prompting Center of Excellence
Scaling: From prompt to autonomous agents
Use cases to get from reactive to proactive
Use case - FinOps: redistributing workloads
Use case - Observability data quality checks
Use case - Optimizing user experience
Summary
Further reading
Join us on Discord
7
ACME Financial Services: How to Move from AIOps to Agentic Platforms
Technical requirements
Once more into the breach
The cost drivers
A new challenge: Business process observability
A new round of discussions
Feedback from the application developers
Feedback from the SREs
Compliance: The great defense
How observability can help with compliance
Expanding the utility of the observability tools
Tackling the costs
Assessing sampling
The new changeset
A platform team?
A big ask, a big undertaking
A platform team does what?
That's an awful lot of responsibility
How do we do that?
In search of more answers
Connecting the systems
Discovering how applications were deployed and managed
Discovering how Kubernetes clusters were managed
Discovering the concept of self-service
What does a platform team deliverable look like?
Who is a fit for a platform team?
Lessons from the SREs
Reactive versus proactive scaling
Self-service platform details
Assistance with RCA
Automating regulation compliance
Expanding left: Meeting developers where they work
Observability-driven development
Bringing it all together: The new proposal to senior leadership
The presentation
Phase 1 - Crawl: Setting the standards
Phase 2 - Walk: IDP with its first golden paths
Phase 3 - Run: From AI and platform to agentic platform
The results of the presentation
Time jump: What was achieved
Summary
Get this book's PDF version and more
Part 3
From AI Assistants to Self-Driving Architectures
8
Evolving Operations: Proactive & Preventive & Self-Driven Architecture
Technical requirements
Defining our terms
Differentiating proactive versus preventative versus self-driven
Enabling AI-driven decision-making through event-driven architectures
Feeding events into our observability platform
Scenario: tying events and AI operations together
How AI can help us analyze the situation
How to let the AI also take automated actions
Use case: AI and GitOps
Integrating agentic AI for Kubernetes
Use case: AI-driven scaling for performance and cost
Types of scaling
Scaling for performance
Scaling for cost
Scaling for carbon footprint
How to approach auto scaling
Enemy at the gates: let your AI be your defender
Using AI to parse audit logs and system access
AI for reducing risk surface area
Risk surface 1: data
Risk surface 2: users and permissions
Use case: ensuring proper chain of custody for workloads
Chain of custody explained
Commit signing and signature verification
Validating package contents
Use case: repository maintenance - let AI take on the boring tasks
Summary
Join us on Discord
9
No Future Without Challenges
Technical requirements
Building trust in the AI
Regular dry runs
Leveraging vendors and tools
Implementing and observing governance
Guardrails in the cloud
Guardrails as middleware
Demonstrating success
Reliability
The hidden costs of implementing AI
Tokens and AI costs
Tokens and spending
Token management
Prompt engineering and context engineering
Fine-tuning
Supervised fine-tuning
Few-shot learning
Transfer learning
Domain-specific fine-tuning
Parameter-efficient fine-tuning (PEFT)
Continuous learning
Breaking down your options
GPUs are expensive
Cost-benefit analysis
Securing the AI
What's different about security in an AI environment?
Zero trust
Code scanning and SBOMS
Other best practices for zero trust
Data security posture management
Threat models
How to keep up with industry trends, best practices, and threats
Mind the gap: How to maintain compliance when implementing new AI technologies
What is compliance?
Compliance risk 1: Data sources unknown
Risk 2: Skills and MCP servers may not pass muster
Risk 3: Data loss
Risk 4: Bias in the AI
Mitigations
Summary
Further reading
Get this book's PDF version and more
10
ACME Financial Services: How Will the AI Future Shape Our Company?
Technical requirements
Our story thus far
Evolving trust in the observability platform
The SREs' path to trust
The developers' path to trust
A trust framework was born
The experimentation and dry-run phase
The trust phase
The verify phase
How trust changed the company
How SREs were impacted by trusting the tool
How developers were impacted by trusting the tool
Business process observability bears fruit
The importance of securing the observability platform
Who can access the data
Managing the system
Watch the drift
The continual efforts to stay current
Monitoring industry trends
Expansion increases the effort
The cost of early adoption
Always monitoring the cost
Incorporating changes into roadmaps
So, what's next?
Summary
Join us on Discord
11
Unlock Your Exclusive Benefits
Unlock this Book's Free Benefits in three Easy Steps
Step 1
Step 2
Step 3
Need Help
Why subscribe?
Other Books You May Enjoy
Packt is searching for authors like you
Share your thoughts
Index

Preface Free benefits with your book Part 1: From Monitoring via Observability to AIOps Chapter 1: Observability: The Art of Turning Data into Insights What is observability? What is observability and how does it differ from monitoring? Let's ask ChatGPT! The early days: monitoring static systems The dawn of more complex and dynamic systems Cloud-native monitoring doesn't scale the way we need it to AIOps 2.0: observability ready for the cloud-native AI era Three pillars and beyond: use cases for logs, metrics, and traces What are metrics? What are logs? What are traces? Beyond the three pillars: use cases based on events, profiling, and real users Emerging standards over the years OpenTelemetry Prometheus Visualization standards: Grafana and Perses Observability and distributed systems Highly distributed systems and the increased complexity Understanding distributed systems through observability Inventory: which components are part of the system we are responsible for? Dependencies: how are components connected and dependent on each other? Interfaces/APIs: what are the boundaries of our distributed system to the consumers? Health: are all components working as expected or is there abnormal behavior? SLA and root cause: are end-to-end critical transactions experiencing issues, and why? Shared infrastructure and how it impacts components Use case: identifying a noisy neighbor Use case: right-sizing infrastructure based on real needs Synchronous and asynchronous communication Network: metrics or eBPF Connections: pools on each side of the call Queues: the distributors of messages VMs, containers, and databases - oh my! Observing hypervisors and VMs Observing web and application servers Observing databases Observing containers Observing serverless Full stack: from networking to cloud to application observability What is full stack observability? The observability goal is 100% coverage: start with production infrastructure, then expand up and left Defining the focus of this book You will be able to prove the value of observability and AI! You will use exponential data growth as an opportunity! You will expand observability to the left! You will provide observability as a self-service! You will unleash the power of AIOps through AI-driven automation! You will see the journey of Financial One ACME Summary Further reading Get this book's PDF version and more Chapter 2: The Elephant in the Room: Artificial Intelligence Technical requirements Why the hype around AI? What is AI good for? AI versus automation What is AI's unique value proposition? What is AI good for (right now) A value-adding abstraction layer Model Context Protocol MCP server components and features RAG versus CAG and how they relate to LLMs RAG versus CAG Choosing a language model What can and will go wrong (and what you can do about it) Incorrect user expectations Hallucination and errors Data poisoning Catastrophic forgetting Infinite loops Prompt engineering helps Why do AI projects fail, and how can you succeed? Summary Further reading Chapter 3: From Observability to AIOps and the Use Cases it Solves Today When data on glass and static alerts fail Alternatives to static alerts on infrastructure metrics Static thresholds: where they make sense Baselining: where static thresholds are impractical Beyond CPU and memory: cloud-native golden signals Resource layer Orchestration layer Workload layer Platform service layer Service layer Application layer Observability layer Choosing the right metrics through proper load testing Step 1: setting up a test environment Step 2: defining realistic scenarios Step 3: expand-left observability Step 4: running tests Step 5: identifying critical indicators Use case: baseline alerting on Kubernetes health Observability-driven development Step 1: defining internal and external system health indicators Step 2: defining how to measure health indicators Step 3: providing easy access to this data for engineering Where was the data captured, and who needs it? Where and when does this data need to be made available, and to whom? Self-service use case 1: providing the top three database queries for team standup Step 4: refining, enriching, and automating toward production Self-service use case 2: right-sizing container recommendations as a Git pull request Context is king: the quality of observability data From pets to cattle: semantics for observability Enriching observability data across the stack Enriching with tags from your infrastructure Enriching with tags, labels, and annotations from your deployment Enriching observability data across the SDLC Tracking the SDLC Observing your DevOps tools Quality of data: where to enrich it and what to sample How and where to enrich observability data with context Do we need all the data? Sampling strategies ...

Systemvoraussetzungen

Dateiformat: ePUB
Kopierschutz: Adobe-DRM (Digital Rights Management)

Systemvoraussetzungen:

Computer (Windows; MacOS X; Linux): Installieren Sie bereits vor dem Download die kostenlose Software Adobe Digital Editions (siehe E-Book Hilfe).
Tablet/Smartphone (Android; iOS): Installieren Sie bereits vor dem Download die kostenlose App Adobe Digital Editions oder die App PocketBook (siehe E-Book Hilfe).
E-Book-Reader: Bookeen, Kobo, Pocketbook, Sony, Tolino u.v.a.m. (nicht Kindle)

Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an.
Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.

Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!

Weitere Informationen finden Sie in unserer E-Book Hilfe.

Dateiformat: ePUB
Kopierschutz: ohne DRM (Digital Rights Management)

Systemvoraussetzungen:

Computer (Windows; MacOS X; Linux): Verwenden Sie eine Lese-Software, die das Dateiformat ePUB verarbeiten kann: z.B. Adobe Digital Editions oder FBReader – beide kostenlos (siehe E-Book Hilfe).
Tablet/Smartphone (Android; iOS): Installieren Sie bereits vor dem Download die kostenlose App Adobe Digital Editions oder die App PocketBook (siehe E-Book Hilfe).
E-Book-Reader: Bookeen, Kobo, Pocketbook, Sony, Tolino u.v.a.m.

Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „glatten” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an.
Ein Kopierschutz bzw. Digital Rights Management wird bei diesem E-Book nicht eingesetzt.

Weitere Informationen finden Sie in unserer E-Book Hilfe.

Als PDF speichern Als Link merken

Observability in the AI-Native Era

Beschreibung

Alle Preise

Weitere Details

Inhalt

Table of Contents

Systemvoraussetzungen