
Agentic Reliability Engineering
Building Agentic Systems That Think, Adapt, and Recover
David Jambor(Author)
O'Reilly (Publisher)
Will be published approx. on 30. November 2026
Book
Paperback/Softback
400 pages
979-8-3416-7313-7 (ISBN)
Description
As modern systems grow in scale, speed, and complexity, traditional reliability practices are reaching their limits. Site Reliability Engineering (SRE) excels at automating repeatable and predictable operational tasks, enabling engineers to respond faster and operate at scale. But as systems become more dynamic and interconnected, reliability increasingly depends on decisions made in real time, under uncertainty, and across competing priorities.
Agentic Reliability Engineering represents the next evolution of reliability. Instead of encoding every operational decision into runbooks and automation, engineers define intent, constraints, and principles, allowing systems to observe context, reason about trade-offs, and act autonomously within clear guardrails. Reliability shifts from human-driven reaction to system-driven decision-making, while remaining governable and accountable.
Written for experienced SREs, platform engineers, and engineering leaders, this book presents a practical framework for designing systems that can learn, adapt, and operate safely at machine speed.
By the end of this book, you'll be able to:
Understand how reliability evolves from automation to autonomy
Design intent-driven agentic reliability boundaries
Implement agent-driven incident response and learning loops
Build observability and decision feedback that enables trust-based autonomy
Lead technical and cultural change toward scalable, trust-based autonomy
Agentic Reliability Engineering represents the next evolution of reliability. Instead of encoding every operational decision into runbooks and automation, engineers define intent, constraints, and principles, allowing systems to observe context, reason about trade-offs, and act autonomously within clear guardrails. Reliability shifts from human-driven reaction to system-driven decision-making, while remaining governable and accountable.
Written for experienced SREs, platform engineers, and engineering leaders, this book presents a practical framework for designing systems that can learn, adapt, and operate safely at machine speed.
By the end of this book, you'll be able to:
Understand how reliability evolves from automation to autonomy
Design intent-driven agentic reliability boundaries
Implement agent-driven incident response and learning loops
Build observability and decision feedback that enables trust-based autonomy
Lead technical and cultural change toward scalable, trust-based autonomy
More details
Language
English
Place of publication
Sebastopol
United States
Product notice
Paperback (trade)
Unsewn / adhesive bound
Dimensions
Height: 232 mm
Width: 178 mm
ISBN-13
979-8-3416-7313-7 (9798341673137)
Copyright in bibliographic data and cover images is held by Nielsen Book Services Limited or by the publishers or by their respective licensors: all rights reserved.
Schweitzer Classification
Person
David is a senior technology leader, recognised thought partner, and practitioner in modern engineering, with more than seventeen years of experience advancing DevOps, SRE, and AI-enabled reliability at global scale. At the time of publishing, he is acting as Senior Director of Technology, having led transformative initiatives across cloud platforms, secure infrastructure, observability, identity, automation, and next-generation operations, experience that directly informs the practical frameworks in this book. Originally from Hungary and living in the United Kingdom for the past fourteen years, he has worked with global teams and large multinational organisations, bringing an international perspective to building secure, autonomous, and high-performing engineering ecosystems around the globe.
For the past seven years, David has served as the Chair Judge of the UK's National DevOps Awards, helping shape industry standards and championing the evolution of high-performing engineering cultures. He is the author of DevOps for Databases, a frequent keynote speaker, and an active contributor to the global DevOps, SRE, and AI-engineering community. David's work focuses on redefining engineering excellence for the era of Agentic Reliability Engineering (ARE), blending strategic vision with hands-on expertise to help organisations build autonomous, secure, and human-centred systems with confidence.