Found in 11 comments on Hacker News
quartz · 2021-08-09 · Original thread
Nice to see articles like this describing a company's incident response process and the positive approach to incident culture via gamedays (disclaimer: I'm a cofounder at Kintaba[1], an incident management startup).

Regarding gamedays specifically: I've found that many company leaders don't embrace them because culturally they're not really aligned to the idea that incidents and outages aren't 100% preventable.

It's a mistake to think of the incident management muscle as one you'd like exercised as little as possible when in reality it's something that should be in top form because doing so comes with all kinds of downstream values for the company (a positive culture towards resiliency, openness, team building, honesty about technical risk, etc).

Sadly this can be a difficult mindset to break out of especially if you come from a company mired in "don't tell the exec unless it's so bad they'll find out themselves anyway."

Relatedly, the desire to drop the incident count to zero discourages recordkeeping of "near-miss" incidents, which generally deserve to have the same learning process (postmortem, followup action items, etc) associated with them as the outcomes of major incidents and game days.

Hopefully this outdated attitude continues to die off.

If you're just getting started with incident response or are interested in the space, I highly recommend:

- For basic practices: Google's SRE chapters on incident management [2]

- For the history of why we prepare for incidents and how we learn from them effectively: Sidney Dekker's Field Guide to Understanding Human Error [3]




wpietri · 2021-06-29 · Original thread
What would you estimate as the total word count of the Docker documentation? The page you linked to alone is about 10k words, and there are a lot of pages. At a guess, we're looking at something the length of notoriously long novels like War and Peace. What percentage of the documentation had you read and internalized before you first put Docker into production?

It's very easy to come along after something blows up, find something a person could have done differently, and blame that person, treating them as stupid or negligent. It's easy, satisfying, and often status-enhancing. It's also going to make the world less safe, because it prevents us from solving the actual problems.

Those serious about reducing failures should read Dekker's "Field Guide to Understanding 'Human Error'":

It comes out of the world of airplane accident investigations. But so much of it is applicable to software. The first chapter alone, the one contrasting old and new views, can be enough for a lot of people. It's available for free via "look inside this book" or the Kindle sample.

This book is a great short manifesto on exactly that point:

It's written by someone that does airliner crash investigations. His central point is that "human error" as a term functions to redirect blame away from the people who establish systems and procedures. It blames the last domino vs the people who stacked them.

It's a quick breezy read, and you'll get the main points within the first 30 min or so of reading. I've found it useful for getting these ideas across to people though, especially more generic business types where "no blame post mortem" strikes them as some care bear nonsense rather than being an absolutely essential tool to reduce future incidents.

quartz · 2020-03-25 · Original thread
I'm a cofounder at Kintaba ( where we spend a lot of time with companies that are implementing postmortems as part of their larger incident management process and it has been fascinating to see how varied the adoption of the practice is even in SV despite the value being well accepted for over a decade in tech (longer in other research circles).

I often recommend anyone who is interested in the topic to check out Sidney Dekker's Field Guide to Understanding Human Error [1].

It's a very approachable read and goes into great detail about the underlying theories of safety research that support the value of blame-free cultures and postmortems and addresses common counter-arguments, particularly around the idea that lack of blame = lack of accountability.

Also worth checking out the (free) google SRE Book chapters on Incident Management [2] and Postmortem Culture [3].




rdoherty · 2019-05-16 · Original thread
This is a great overview. I would also recommend Dekker's book The Field Guide to Understanding Human Error [1]. It's a bit easier to read than Drift Into Failure, which I found to be very dense.


wpietri · 2018-03-18 · Original thread
Ooh, that reminds me of another excellent book on failure, Sidney Dekker's "Field Guide to Understanding Human Error":

It's about investigating airplane crashes, and in particular two different paradigms for understanding failure. It deeply changed how I think and talk about software bugs, and especially how I do retrospectives. I strongly recommend it.

And the article made me think of Stewart Brand's "How Buildings Learn":

It changed my view of a building from a static thing to a dynamic system, changing over time.

The BBC later turned it into a 6-part series, which I haven't seen, but which the author put up on YouTube, starting here:

I especially like that in the comments he writes: "Anybody is welcome to use anything from this series in any way they like. Please don’t bug me with requests for permission. Hack away. Do credit the BBC, who put considerable time and talent into the project."

wpietri · 2017-11-05 · Original thread
Really? I find victim-blaming intellectually sterile. It can be done pretty much any time something bad happens, and it's not challenging to do. You just find the person who's most fucked and say it's all their fault.

I think it's much more interesting to understand the subtle dynamics that result in bad outcomes. As an example, Sidney Dekker's book, "The Field Guide to Understanding Human Error" [1] makes an excellent case that if you're going to do useful aviation accident investigation, you have to decline the simple-minded approach of blame, and instead look at the web of causes and experiences that lead to failure.


csours · 2017-08-13 · Original thread
If this is interesting to you, I highly recommend "The Field Guide to Understanding Human Error" by Sidney Dekker - it covers these points with examples. [0]

Another note, I wondered what the root cause of the financial meltdown was for a number of years, but looking at it from this point of view, it's obvious that a number of things have to wrong simultaneously; but it is not obvious beforehand which failed elements, broken processes, and bypassed limits lead to catastrophe.

For your own business/life, think about things that you live with that you know are not in a good place. Add one more problem and who knows what gives.

This is not intended to scare or depress, but maybe have some compassion when you hear about someone else's failure.


Link for those interested:

wpietri · 2016-08-15 · Original thread
This attitude is not just wrong, it's dangerously wrong.

It's this sort of blame-driven, individual-focused, ask-the-unachieveable answer that makes it completely impossible for organizations to move beyond a relatively low level of quality/competence. It's satisfying to say, because it can always be applied and always makes the speaker feel smart/superior. But its universal applicability is a hint that it's not going to actually solve many problems.

If you'd like to learn why and what the alternative is, I strongly recommend Sidney Dekker's "Field Guide to Understanding Human Error":

His field of study is commercial airline accident review, so all the examples are about airplane crashes. But the important lessons are mostly about how to think about error and what sort of culture creates actual safety. The lessons are very much applicable in software. And given our perennially terrible bug rates, I'd love to see our thinking change on this.

mwsherman · 2015-10-28 · Original thread
Yes, that was a very old-school way of dealing with the problem. It’s mostly symbolic.

Most problems are systemic, which is a nice way of saying “ultimately management’s fault”.

Most things that most people do, most of the time, are reasonable in the circumstances. Management creates the circumstances. “Human error” is a non-explanation.

Here’s a book on the topic, often called systems thinking:

Getting even more bookish: firing “bad apples” for “human error” is a form of substituting an easier question when presented with a harder one, as Kahneman describes in Thinking Fast and Slow.

Fresh book recommendations delivered straight to your inbox every Thursday.