Found in 16 comments on Hacker News
wpietri · 2023-07-14 · Original thread
For anybody that liked the style of this sort of analysis, let me strongly recommend Dekker's "Field Guide to Understanding 'Human Error'": https://www.amazon.com/Field-Guide-Understanding-Human-Error...

It focuses on air crash investigations. But it's very useful to tech people in understanding the right way to approach incident investigations. It can be very easy to blame individuals ("stupid pilot shouldn't have dropped his iPad", etc), but that focus prevents improving safety over the long term. Dekker's book is a great argument for, as here, thinking about what actually happened and why as a systemic thing. Which provides much more fertile ground for making sure it doesn't happen again.

js2 · 2023-07-10 · Original thread
Accident investigations intentionally do not apply blame. Humans are fallible. We make mistakes. If one human makes a mistake, others will too.

The author of the httpie blog post does not apply blame. In fact, he goes out of his way to explain his error, then suggests changes that would have prevented his mistake, to hopefully save others from the same mistake.

So we can blame humans as you seem to want to do, or we can accept human behavior and design our systems to be more forgiving.

cdelsolar at the top of this thread wrote that "I get emails, every single day" and that "It literally drives me crazy."

So what should cdelsolar do? Blame all the humans accidentally deleting their accounts, or find a way to design around it?

I know which approach I would take.

Let me suggest this book btw:

https://www.amazon.com/Field-Guide-Understanding-Human-Error...

MPSimmons · 2023-02-08 · Original thread
Please consider reading Sidney Dekker's books, The Field Guide to Understanding Human Error[1] and Drift Into Failure[2]

[1] - https://www.amazon.com/Field-Guide-Understanding-Human-Error...

[2] - https://www.abebooks.com/servlet/BookDetailsPL?bi=3123581163...

quartz · 2022-06-06 · Original thread
The most reliable companies out there have aggressively adopted practices that treat incidents as expected and "human error" as a symptom of a (correctable) systems failure.

Somewhat counterintuitively these high-reliability companies have MORE incidents than less reliable organizations that haven't removed the stigma around incident reporting because their goal is to learn from them and catch them early vs. punishing the unlucky individual who happened to step on whatever systemic land-mine exploded that day.

Looks like one of Dekker's books is already listed in this post, but another one worth checking out is his "Field Guide to Understanding Human Error"[1] which is a very approachable book focused on the aviation industry and the learnings (especially post WWII) that have made that industry so safe.

If you're working on this at your own company (especially if you're in a supervisor / executive position) it's incredibly powerful and impactful to be the incident champion who works to make incident response open and accessible across the org. So much catastrophic failure comes as a result of hiding the early signs due to fear retaliation or embarrassment.

Also worth checking out: we[2] hosted a mini conference on Incident Response earlier this year with lots of great videos from folks who have worked in this space for decades about everything from culture to practices: https://www.irconf.io/

[1] https://www.amazon.com/Field-Guide-Understanding-Human-Error...

[2] shameless plug for https://kintaba.com, my startup in this space

wpietri · 2022-04-14 · Original thread
I suggest you read "The Field Guide to Understanding 'Human Error'". You'd learn a lot.

https://www.amazon.com/Field-Guide-Understanding-Human-Error...

My view is that expecting humans to stop making mistakes is much less effective than fixing the systems that amplify those mistakes into large, irreversible impacts.

quartz · 2021-08-09 · Original thread
Nice to see articles like this describing a company's incident response process and the positive approach to incident culture via gamedays (disclaimer: I'm a cofounder at Kintaba[1], an incident management startup).

Regarding gamedays specifically: I've found that many company leaders don't embrace them because culturally they're not really aligned to the idea that incidents and outages aren't 100% preventable.

It's a mistake to think of the incident management muscle as one you'd like exercised as little as possible when in reality it's something that should be in top form because doing so comes with all kinds of downstream values for the company (a positive culture towards resiliency, openness, team building, honesty about technical risk, etc).

Sadly this can be a difficult mindset to break out of especially if you come from a company mired in "don't tell the exec unless it's so bad they'll find out themselves anyway."

Relatedly, the desire to drop the incident count to zero discourages recordkeeping of "near-miss" incidents, which generally deserve to have the same learning process (postmortem, followup action items, etc) associated with them as the outcomes of major incidents and game days.

Hopefully this outdated attitude continues to die off.

If you're just getting started with incident response or are interested in the space, I highly recommend:

- For basic practices: Google's SRE chapters on incident management [2]

- For the history of why we prepare for incidents and how we learn from them effectively: Sidney Dekker's Field Guide to Understanding Human Error [3]

[1] https://kintaba.com

[2] https://sre.google/sre-book/managing-incidents/

[3] https://www.amazon.com/Field-Guide-Understanding-Human-Error...

wpietri · 2021-06-29 · Original thread
What would you estimate as the total word count of the Docker documentation? The page you linked to alone is about 10k words, and there are a lot of pages. At a guess, we're looking at something the length of notoriously long novels like War and Peace. What percentage of the documentation had you read and internalized before you first put Docker into production?

It's very easy to come along after something blows up, find something a person could have done differently, and blame that person, treating them as stupid or negligent. It's easy, satisfying, and often status-enhancing. It's also going to make the world less safe, because it prevents us from solving the actual problems.

Those serious about reducing failures should read Dekker's "Field Guide to Understanding 'Human Error'": https://www.amazon.com/Field-Guide-Understanding-Human-Error...

It comes out of the world of airplane accident investigations. But so much of it is applicable to software. The first chapter alone, the one contrasting old and new views, can be enough for a lot of people. It's available for free via "look inside this book" or the Kindle sample.

This book is a great short manifesto on exactly that point: https://www.amazon.com/Field-Guide-Understanding-Human-Error...

It's written by someone that does airliner crash investigations. His central point is that "human error" as a term functions to redirect blame away from the people who establish systems and procedures. It blames the last domino vs the people who stacked them.

It's a quick breezy read, and you'll get the main points within the first 30 min or so of reading. I've found it useful for getting these ideas across to people though, especially more generic business types where "no blame post mortem" strikes them as some care bear nonsense rather than being an absolutely essential tool to reduce future incidents.

quartz · 2020-03-25 · Original thread
I'm a cofounder at Kintaba (https://kintaba.com) where we spend a lot of time with companies that are implementing postmortems as part of their larger incident management process and it has been fascinating to see how varied the adoption of the practice is even in SV despite the value being well accepted for over a decade in tech (longer in other research circles).

I often recommend anyone who is interested in the topic to check out Sidney Dekker's Field Guide to Understanding Human Error [1].

It's a very approachable read and goes into great detail about the underlying theories of safety research that support the value of blame-free cultures and postmortems and addresses common counter-arguments, particularly around the idea that lack of blame = lack of accountability.

Also worth checking out the (free) google SRE Book chapters on Incident Management [2] and Postmortem Culture [3].

[1] https://www.amazon.com/Field-Guide-Understanding-Human-Error...

[2] https://landing.google.com/sre/sre-book/chapters/managing-in...

[3] https://landing.google.com/sre/sre-book/chapters/postmortem-...

rdoherty · 2019-05-16 · Original thread
This is a great overview. I would also recommend Dekker's book The Field Guide to Understanding Human Error [1]. It's a bit easier to read than Drift Into Failure, which I found to be very dense.

1: https://www.amazon.com/Field-Guide-Understanding-Human-Error...

wpietri · 2018-03-18 · Original thread
Ooh, that reminds me of another excellent book on failure, Sidney Dekker's "Field Guide to Understanding Human Error": https://www.amazon.com/dp/1472439058

It's about investigating airplane crashes, and in particular two different paradigms for understanding failure. It deeply changed how I think and talk about software bugs, and especially how I do retrospectives. I strongly recommend it.

And the article made me think of Stewart Brand's "How Buildings Learn": https://www.amazon.com/dp/0140139966

It changed my view of a building from a static thing to a dynamic system, changing over time.

The BBC later turned it into a 6-part series, which I haven't seen, but which the author put up on YouTube, starting here: https://www.youtube.com/watch?v=AvEqfg2sIH0

I especially like that in the comments he writes: "Anybody is welcome to use anything from this series in any way they like. Please don’t bug me with requests for permission. Hack away. Do credit the BBC, who put considerable time and talent into the project."

wpietri · 2017-11-05 · Original thread
Really? I find victim-blaming intellectually sterile. It can be done pretty much any time something bad happens, and it's not challenging to do. You just find the person who's most fucked and say it's all their fault.

I think it's much more interesting to understand the subtle dynamics that result in bad outcomes. As an example, Sidney Dekker's book, "The Field Guide to Understanding Human Error" [1] makes an excellent case that if you're going to do useful aviation accident investigation, you have to decline the simple-minded approach of blame, and instead look at the web of causes and experiences that lead to failure.

[1] https://www.amazon.com/Field-Guide-Understanding-Human-Error...

csours · 2017-08-13 · Original thread
If this is interesting to you, I highly recommend "The Field Guide to Understanding Human Error" by Sidney Dekker - it covers these points with examples. [0]

Another note, I wondered what the root cause of the financial meltdown was for a number of years, but looking at it from this point of view, it's obvious that a number of things have to wrong simultaneously; but it is not obvious beforehand which failed elements, broken processes, and bypassed limits lead to catastrophe.

For your own business/life, think about things that you live with that you know are not in a good place. Add one more problem and who knows what gives.

This is not intended to scare or depress, but maybe have some compassion when you hear about someone else's failure.

0 https://www.amazon.com/Field-Guide-Understanding-Human-Error...

Link for those interested:

https://www.amazon.com/dp/1472439058

wpietri · 2016-08-15 · Original thread
This attitude is not just wrong, it's dangerously wrong.

It's this sort of blame-driven, individual-focused, ask-the-unachieveable answer that makes it completely impossible for organizations to move beyond a relatively low level of quality/competence. It's satisfying to say, because it can always be applied and always makes the speaker feel smart/superior. But its universal applicability is a hint that it's not going to actually solve many problems.

If you'd like to learn why and what the alternative is, I strongly recommend Sidney Dekker's "Field Guide to Understanding Human Error":

https://www.amazon.com/Field-Guide-Understanding-Human-Error...

His field of study is commercial airline accident review, so all the examples are about airplane crashes. But the important lessons are mostly about how to think about error and what sort of culture creates actual safety. The lessons are very much applicable in software. And given our perennially terrible bug rates, I'd love to see our thinking change on this.

mwsherman · 2015-10-28 · Original thread
Yes, that was a very old-school way of dealing with the problem. It’s mostly symbolic.

Most problems are systemic, which is a nice way of saying “ultimately management’s fault”.

Most things that most people do, most of the time, are reasonable in the circumstances. Management creates the circumstances. “Human error” is a non-explanation.

Here’s a book on the topic, often called systems thinking: http://www.amazon.com/Field-Guide-Understanding-Human-Error/...

Getting even more bookish: firing “bad apples” for “human error” is a form of substituting an easier question when presented with a harder one, as Kahneman describes in Thinking Fast and Slow.

Fresh book recommendations delivered straight to your inbox every Thursday.