
Site Reliability Engineering
How Google Runs Production Systems
O'Reilly (Publisher)
2nd Edition
Will be published approx. on 31. October 2026
Book
Paperback/Softback
600 pages
979-8-3416-0768-2 (ISBN)
Description
Google pioneered the discipline of Site Reliability Engineering, applying reliability to the entire user journey for consumer, enterprise, and infrastructure systems. In the years since, many organizations have followed suit, guided by the tenets laid out in this practical book. This fully revised edition brings Site Reliability Engineering up-to-date with fresh insights on engineering techniques, organizational processes, and case studies that will help you promote and implement greater reliability throughout the engineering lifecycle.
In this collection of essays and articles, key members of Google's Site Reliability Engineering team explore the company's current SRE practices and explain how they've evolved in the decade since the initial publication. New updates cover the value of reliability, cloud reliability, and the impact of AI. You'll learn the principles and practices that enable Google engineers to make some of the world's largest systems scalable, reliable, and efficient-lessons directly applicable to your organization.
Train new Site Reliability Engineers based on the latest practices in the field
Develop engineering organizations that support reliability as a feature
Build online services that incorporate reliability principles
Use AI to improve SRE across the organization and optimize critical areas such as automation and incident detection
In this collection of essays and articles, key members of Google's Site Reliability Engineering team explore the company's current SRE practices and explain how they've evolved in the decade since the initial publication. New updates cover the value of reliability, cloud reliability, and the impact of AI. You'll learn the principles and practices that enable Google engineers to make some of the world's largest systems scalable, reliable, and efficient-lessons directly applicable to your organization.
Train new Site Reliability Engineers based on the latest practices in the field
Develop engineering organizations that support reliability as a feature
Build online services that incorporate reliability principles
Use AI to improve SRE across the organization and optimize critical areas such as automation and incident detection
More details
Edition
2nd Revised edition
Language
English
Place of publication
Sebastopol
United States
Edition type
Revised edition
Product notice
Paperback (trade)
Unsewn / adhesive bound
Dimensions
Height: 232 mm
Width: 178 mm
ISBN-13
979-8-3416-0768-2 (9798341607682)
Copyright in bibliographic data and cover images is held by Nielsen Book Services Limited or by the publishers or by their respective licensors: all rights reserved.
Schweitzer Classification
Persons
Betsy Beyer has worked at Google for almost 20 years, most recently in the role of Program Manager for Site Reliability Engineering. She's the editor of three books in this space, including the best-selling Site Reliability Engineering: How Google Runs Production Systems. She previously worked on Google's datacenter and hardware operations teams. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. She holds degrees from Stanford and Tulane. Chris Jones is a Site Reliability Engineer for Google Maps. Based in San Francisco, he has previously been responsible for the care and feeding of Google App Engine, a cloud platform-as-a-service product serving over 28 billion requests per day, and a number of other services. In other lives, Chris has worked in privacy engineering, academic IT, analyzed data for political campaigns, and engaged in some light BSD kernel hacking, picking up degrees in Computer Engineering, Economics, and Technology Policy along the way. He's also a licensed professional engineer. Christof Leng has been working as a Site Reliability Engineer for Google for more than 11 years in the Dublin and Munich offices. He has worked on systems in Google's ads, cloud, and internal developer infrastructure and has been building and managing various teams in these areas over the years. He has been responsible for central Google SRE programs like the SRE engagement model, production excellence (ProdEx), and production launch reviews. Christof holds a PhD in computer science from TU Darmstadt and has been a postdoc at ICSI and UC Berkeley. He has been a vice president for the German Informatics Society (GI). Christof lives in Darmstadt with his wife, three children, and two cats. David Huska is a Site Reliability Engineer on Google's Cloud Incident Response Team - the cross-service escalation point for major incidents on the Google Cloud Platform. Previously he worked directly with GCP's largest customers to incorporate SRE practices into their services and operations on the platform, and as an SRE and a Software Engineer on Google Maps. He's a coauthor and contributing editor of The Site Reliability Workbook. Jennifer Petoff is Director of Google Cloud Platform (GCP) & Technical Infrastructure (TI) Education and is the lead author of Training Site Reliability Engineers: What Your Organization Needs to Create a Learning Program. Known as [The Reliable PgM](https://www.reliablepgm.com/), she is an advocate for applying SRE principles to a wide range of program management situations. Jennifer is an avid public speaker and has done talks, panel discussions, and keynote presentations at DevOps, SRE, and other industry conferences in 16 countries (and counting) around the world. Jennifer joined Google in 2007 after spending eight years in the chemical industry. She holds a PhD in Chemistry from Stanford University and a BS in Chemistry and a BA in Psychology from the University of Rochester in the United States. Jennifer and her husband Scott are avid travelers and have lived in the U.S., Ireland, and now Portugal. Jennifer loves travel writing and door photography both of which you can find on Sidewalk Safari. Niall Richard Murphy is the CEO and founder of Stanza Systems, a small startup in the ML and AI reliability space. He has worked in computing infrastructure since the mid-1990s, and has been employed by every major cloud provider (Amazon, Google, and Microsoft) in a variety of roles from IC to director. He's also the instigator, coauthor, and editor of multiple award-winning books on networking, reliability, and machine learning. He holds degrees in computer science, mathematics, and poetry studies and lives in Dublin with his wife and two children.