Found in 10 comments on Hacker News
wpietri · 2020-02-23 · Original thread
I think you're missing a couple things here.

One is the difference between optimizing for MTBF and MTTR (respectively, mean time between failures and mean time to repair). Quality gates improve the former but make the latter worse.

I think optimizing for MTTR (and also minimizing blast radius) is much more effective in the long term even in preventing bugs. For many reasons, but big among them is that quality gates can only ever catch the bugs you expect; it isn't until you ship to real people that you catch the bugs that you didn't expect. But the value of optimizing for fast turnaround isn't just avoiding bugs. It's increasing value delivery and organizational learning ability.

The other is that I think this grows out of an important cultural difference: the balance between blame for failure and reward for improvement. Organizations that are blame-focused are much less effective at innovation and value delivery. But they're also less effective at actual safety. [1]

To me, the attitude in, "Getting a call that production is not working is the event that I am trying to prevent by all means possible," sounds like it's adaptive in a blame-avoidance environment, but not in actual improvement. Yes, we should definitely use lots of automated tests and all sorts of other quality-improvement practices. And let's definitely work to minimize the impact of bugs. But we must not be afraid of production issues, because those are how we learn what we've missed.

[1] For those unfamiliar, I recommend Dekker's "Field Guide to Human Error":

John Allspaw applied concepts from The Field Guide to Understanding Human Error to software post mortems. When I was at Etsy, he taught a class explaining this whole concept. We read the book and discussed concepts like the Fundamental Attribution Error.

I've found it very beneficial, and the concepts we learned have helped me inn almost every aspect of understanding the complicated world we live in. I've taken these concepts to two other companies now to great effect.

wpietri · 2018-05-04 · Original thread
One of the things I think about when analyzing organizational behavior is where something falls on the supportive vs controlling spectrum. It's really impressive how much they're on the supportive end here.

When organizations scale up, and especially when they're dealing with risks, it's easy for them to shift toward the controlling end of things. This is especially true when internally people can score points by assigning or shifting blame.

Controlling and blaming are terrible for creative work, though. And they're also terrible for increasing safety beyond a certain pretty low level. (For those interested, I strongly recommend Sidney Dekker's "Field Guide to Understanding Human Error" [1], a great book on how to investigate airplane accidents, and how blame-focused approaches deeply harm real safety efforts.) So it's great to see Slack finding a way to scale up without losing something that has allowed them to make such a lovely product.


wpietri · 2018-02-17 · Original thread
I think that's ridiculous. Pilots are correctly very reluctant to hit things. Historically, we have wanted them to do their best to avoid that.

You could argue that we should now train pilots to carefully pause and consider whether the thing they are about to hit is safe to hit. But for that, you'd have to show that the additional reaction time in avoiding collisions is really net safer. And if you did argue that, you couldn't judge the current pilots by your proposed new standard.

For those interested, by the way, in really thinking through accident retrospectives, I strongly recommend Sidney Dekker's "Field Guide to Human Error":

I read it just out of curiosity, but it turned out to be very applicable to software development.

csours · 2017-11-02 · Original thread
Also the classic Field Guide to Understanding Human Error.


Older PDF (paperback is well worth it, in my opinion):

csours · 2017-11-02 · Original thread
The path to a disaster has been compared to a tunnel [0]. You can escape from the tunnel at many points, but you may not realize it.

Trying to find the 'real cause' is a fool's errand, because there are many places and ways to avoid the outcome.

I do take your meaning, reducing speed and following well established rules would have almost certainly have saved them.

0. PDF:


wpietri · 2015-04-30 · Original thread
I am in favor of commentary, but his comment only makes sense as hindsight. If he posted it beforehand, he would mostly be wrong, because AirBnB mostly works. If he posted it after any of the many successful outcomes, he would look dumb.

There is no reason to say that these people "got it wrong". They were unlucky. Suppose the same shitheels broke a window, climbed in, unlocked the door, and had a big party on a weekend when the owners were away. One inclined to superiority-by-hindsight could say, "Well duh, why didn't they have bars on their windows?"

After a rare negative occurrence, one can always look back with hindsight, find some way the bad outcome could theoretically have been averted, and then say, "Well duh." Always. It is a great way to sound and feel smart. But it never actually fixes anything. Indeed, it can prevent the fixing of things because, having blamed someone, we mostly stop looking for useful lessons to learn.

If you want the book-length version of this, Sidney Dekker's "Field Guide to Understanding Human Error" has a great explanation of why retrospective blame ends up being immensely harmful:

wpietri · 2015-04-09 · Original thread
I'm almost done with The Field Guide to Understanding Human Error:

It's a brilliant book written by Sidney Dekker, a "Professor of Human Factors and Flight Safety". The basic point is that the default way of understanding bad outcomes is what he calls "the Old View or the Bad Apple Theory". He instead argues for the New View, where "human error is a symptom of trouble deeper inside a system".

Normally with a book like this, I read the first couple of chapters, say, "Ok, I get the idea," and can ignore the rest. After all, I both agree with and understand the basic thesis. But so far every chapter has been surprisingly useful; I keep discovering that I have Old View notions hidden away. E.g., when I discover a systemic flaw, I'm inclined to blame "bad design". But he points out that's a fancy way of calling the problem human error, just a different human and a different error than normal.

Even the driest parts are helped by his frequent use of examples, often taken from real-world aviation accident reports. There are also fascinating bits like a system for high-resolution markup of dialog transcripts to indicate timing (down to 1/10th second), speech inflection, and emphasis. I'll never use it myself, but I will definitely use the mindset that it requires.

Given how much time software projects spend dealing with bugs, I believe we need a new way to think about them, and for me this book describes a big piece of that.

benihana · 2015-01-16 · Original thread
Blameless postmortems work phenomenally at Etsy which is a pretty low risk setting (after reading Sidney Dekker's book The Field Guide to Understanding Human Error [highly recommended] I would say that blameless post-mortems are even more important in a high-risk setting). Except failure isn't the correct word - the book makes the case that these are natural artefacts of complex systems.



benihana · 2014-10-14 · Original thread
We can't have a discussion about the human factors in automated systems without talking about Sidney Dekker's book The Field Guide To Understand Human Error:

Fantastic read about the futility of placing blame on a single human in a catastrophe like this. It makes a strong case for why more automation often causes more work. Definitely worth checking out, Etsy has applied it to their engineering work by using it to facilitate blameless post mortems:

Fresh book recommendations delivered straight to your inbox every Thursday.