It focuses on air crash investigations. But it's very useful to tech people in understanding the right way to approach incident investigations. It can be very easy to blame individuals ("stupid pilot shouldn't have dropped his iPad", etc), but that focus prevents improving safety over the long term. Dekker's book is a great argument for, as here, thinking about what actually happened and why as a systemic thing. Which provides much more fertile ground for making sure it doesn't happen again.
The author of the httpie blog post does not apply blame. In fact, he goes out of his way to explain his error, then suggests changes that would have prevented his mistake, to hopefully save others from the same mistake.
So we can blame humans as you seem to want to do, or we can accept human behavior and design our systems to be more forgiving.
cdelsolar at the top of this thread wrote that "I get emails, every single day" and that "It literally drives me crazy."
So what should cdelsolar do? Blame all the humans accidentally deleting their accounts, or find a way to design around it?
I know which approach I would take.
Let me suggest this book btw:
https://www.amazon.com/Field-Guide-Understanding-Human-Error...
[1] - https://www.amazon.com/Field-Guide-Understanding-Human-Error...
[2] - https://www.abebooks.com/servlet/BookDetailsPL?bi=3123581163...
Somewhat counterintuitively these high-reliability companies have MORE incidents than less reliable organizations that haven't removed the stigma around incident reporting because their goal is to learn from them and catch them early vs. punishing the unlucky individual who happened to step on whatever systemic land-mine exploded that day.
Looks like one of Dekker's books is already listed in this post, but another one worth checking out is his "Field Guide to Understanding Human Error"[1] which is a very approachable book focused on the aviation industry and the learnings (especially post WWII) that have made that industry so safe.
If you're working on this at your own company (especially if you're in a supervisor / executive position) it's incredibly powerful and impactful to be the incident champion who works to make incident response open and accessible across the org. So much catastrophic failure comes as a result of hiding the early signs due to fear retaliation or embarrassment.
Also worth checking out: we[2] hosted a mini conference on Incident Response earlier this year with lots of great videos from folks who have worked in this space for decades about everything from culture to practices: https://www.irconf.io/
[1] https://www.amazon.com/Field-Guide-Understanding-Human-Error...
[2] shameless plug for https://kintaba.com, my startup in this space
https://www.amazon.com/Field-Guide-Understanding-Human-Error...
My view is that expecting humans to stop making mistakes is much less effective than fixing the systems that amplify those mistakes into large, irreversible impacts.
Regarding gamedays specifically: I've found that many company leaders don't embrace them because culturally they're not really aligned to the idea that incidents and outages aren't 100% preventable.
It's a mistake to think of the incident management muscle as one you'd like exercised as little as possible when in reality it's something that should be in top form because doing so comes with all kinds of downstream values for the company (a positive culture towards resiliency, openness, team building, honesty about technical risk, etc).
Sadly this can be a difficult mindset to break out of especially if you come from a company mired in "don't tell the exec unless it's so bad they'll find out themselves anyway."
Relatedly, the desire to drop the incident count to zero discourages recordkeeping of "near-miss" incidents, which generally deserve to have the same learning process (postmortem, followup action items, etc) associated with them as the outcomes of major incidents and game days.
Hopefully this outdated attitude continues to die off.
If you're just getting started with incident response or are interested in the space, I highly recommend:
- For basic practices: Google's SRE chapters on incident management [2]
- For the history of why we prepare for incidents and how we learn from them effectively: Sidney Dekker's Field Guide to Understanding Human Error [3]
[2] https://sre.google/sre-book/managing-incidents/
[3] https://www.amazon.com/Field-Guide-Understanding-Human-Error...
It's very easy to come along after something blows up, find something a person could have done differently, and blame that person, treating them as stupid or negligent. It's easy, satisfying, and often status-enhancing. It's also going to make the world less safe, because it prevents us from solving the actual problems.
Those serious about reducing failures should read Dekker's "Field Guide to Understanding 'Human Error'": https://www.amazon.com/Field-Guide-Understanding-Human-Error...
It comes out of the world of airplane accident investigations. But so much of it is applicable to software. The first chapter alone, the one contrasting old and new views, can be enough for a lot of people. It's available for free via "look inside this book" or the Kindle sample.
It's written by someone that does airliner crash investigations. His central point is that "human error" as a term functions to redirect blame away from the people who establish systems and procedures. It blames the last domino vs the people who stacked them.
It's a quick breezy read, and you'll get the main points within the first 30 min or so of reading. I've found it useful for getting these ideas across to people though, especially more generic business types where "no blame post mortem" strikes them as some care bear nonsense rather than being an absolutely essential tool to reduce future incidents.
I often recommend anyone who is interested in the topic to check out Sidney Dekker's Field Guide to Understanding Human Error [1].
It's a very approachable read and goes into great detail about the underlying theories of safety research that support the value of blame-free cultures and postmortems and addresses common counter-arguments, particularly around the idea that lack of blame = lack of accountability.
Also worth checking out the (free) google SRE Book chapters on Incident Management [2] and Postmortem Culture [3].
[1] https://www.amazon.com/Field-Guide-Understanding-Human-Error...
[2] https://landing.google.com/sre/sre-book/chapters/managing-in...
[3] https://landing.google.com/sre/sre-book/chapters/postmortem-...
1: https://www.amazon.com/Field-Guide-Understanding-Human-Error...
It's about investigating airplane crashes, and in particular two different paradigms for understanding failure. It deeply changed how I think and talk about software bugs, and especially how I do retrospectives. I strongly recommend it.
And the article made me think of Stewart Brand's "How Buildings Learn": https://www.amazon.com/dp/0140139966
It changed my view of a building from a static thing to a dynamic system, changing over time.
The BBC later turned it into a 6-part series, which I haven't seen, but which the author put up on YouTube, starting here: https://www.youtube.com/watch?v=AvEqfg2sIH0
I especially like that in the comments he writes: "Anybody is welcome to use anything from this series in any way they like. Please don’t bug me with requests for permission. Hack away. Do credit the BBC, who put considerable time and talent into the project."
I think it's much more interesting to understand the subtle dynamics that result in bad outcomes. As an example, Sidney Dekker's book, "The Field Guide to Understanding Human Error" [1] makes an excellent case that if you're going to do useful aviation accident investigation, you have to decline the simple-minded approach of blame, and instead look at the web of causes and experiences that lead to failure.
[1] https://www.amazon.com/Field-Guide-Understanding-Human-Error...
Another note, I wondered what the root cause of the financial meltdown was for a number of years, but looking at it from this point of view, it's obvious that a number of things have to wrong simultaneously; but it is not obvious beforehand which failed elements, broken processes, and bypassed limits lead to catastrophe.
For your own business/life, think about things that you live with that you know are not in a good place. Add one more problem and who knows what gives.
This is not intended to scare or depress, but maybe have some compassion when you hear about someone else's failure.
0 https://www.amazon.com/Field-Guide-Understanding-Human-Error...
It's this sort of blame-driven, individual-focused, ask-the-unachieveable answer that makes it completely impossible for organizations to move beyond a relatively low level of quality/competence. It's satisfying to say, because it can always be applied and always makes the speaker feel smart/superior. But its universal applicability is a hint that it's not going to actually solve many problems.
If you'd like to learn why and what the alternative is, I strongly recommend Sidney Dekker's "Field Guide to Understanding Human Error":
https://www.amazon.com/Field-Guide-Understanding-Human-Error...
His field of study is commercial airline accident review, so all the examples are about airplane crashes. But the important lessons are mostly about how to think about error and what sort of culture creates actual safety. The lessons are very much applicable in software. And given our perennially terrible bug rates, I'd love to see our thinking change on this.
Most problems are systemic, which is a nice way of saying “ultimately management’s fault”.
Most things that most people do, most of the time, are reasonable in the circumstances. Management creates the circumstances. “Human error” is a non-explanation.
Here’s a book on the topic, often called systems thinking: http://www.amazon.com/Field-Guide-Understanding-Human-Error/...
Getting even more bookish: firing “bad apples” for “human error” is a form of substituting an easier question when presented with a harder one, as Kahneman describes in Thinking Fast and Slow.
There is a great book which I think should be on a table of every single person (especially leadership) working in any place which involves humans interacting with machines:
https://www.amazon.com/Field-Guide-Understanding-Human-Error...