Site Reliability Engineering: How Google Runs Production Systems

ctvo · 2022-05-04 · Original thread

Google is the only company I'm aware of where the engineers constantly publish popular books about their engineering practices.

Apple, Amazon / AWS, MSFT, etc. have all done impressive things in their space at various points, but seem to lack the mixture of personalities / culture / reputation where "Engineering at Apple" isn't quite the hit that SRE at Google [1] or this book may be.

1 - https://www.amazon.com/Site-Reliability-Engineering-Producti...

Edit: If you're at Apple and happen to work in hardware, I would pay good money to read about the process and war stories.

ajmarsh · 2016-07-26 · Original thread

I'd start with this.

https://www.amazon.com/Site-Reliability-Engineering-Producti...

jaytaylor · 2016-07-20 · Original thread

This is a nice quick overview of what is covered in-depth in the Google SRE book [0].

[0] https://www.amazon.com/Site-Reliability-Engineering-Producti...

pjungwir · 2016-04-21 · Original thread

I have similar thoughts. In the new SRE book[0], there is a history of Google's infrastructure automation, and in a way it started with tests:

First they had Python scripts to do things on various machines. (It actually sounds a lot like Ansible: lots of little scripts that all ran over ssh.) But because of high configurability, these didn't always work right.

So they wrote a bunch of tests, e.g. ClusterExistsInMachineDatabase, DNSTestHasBeenAssignedMachines, so they could find out what wasn't right when a new machine had been provisioned.

Then they realized that fixing the tests could usually be automated, so they wrote code for each test, to correct the issue if it was failing.

It seems like they sort of backed into a declarative idempotent configuration management solution like Chef or Puppet, where you say what you want the machine to look like, and the config management is responsible for getting you there.

As I think you are feeling, in config management, the redundancy of tests and automation code is a bit more ... redundant ... than with automated tests for development.

I think monitoring/alerting is another kind of test: Is the database up? Is the web site responding?

Another good story from that book is how one internal database never went down, so teams became lax about designing systems that would still work without that component. So Google decided they'd just take the database down for a bit. :-) It sounds a bit like their version of Chaos Monkey.

[0] http://www.amazon.com/Site-Reliability-Engineering-Productio...

andrewstuart2 · 2016-04-11 · Original thread

If you're really interested, I highly recommend reading the SRE book Google has released. It's got a ton of insight into both their business practices and technical infrastructure.

[1] http://www.amazon.com/Site-Reliability-Engineering-Productio...

dekhn · 2016-04-06 · Original thread

It's the one mentioned in the interview, http://www.amazon.com/Site-Reliability-Engineering-Productio...

ISBN: 149192912X