Found 2 comments on HN
bretthopper · 2011-04-29 · Original thread
I've been noticing a trend recently when reading about large scale failures of any system: it's never just one thing.

AWS EBS outage, Fukushima, Chernobyl, even the great Chicago Fire (forgive me for comparing AWS to those events).

Sure there's always a "root" cause, but more importantly, it's the related events that keep adding up to make the failure even worse. I can only imagine how many minor failures happen world wide on a daily basis where there's only a root cause and no further chain of events.

Once a system is sufficiently complex, I'm not sure it's possible to make it completely fault-tolerant. I'm starting to believe that there's always some chain of events which would lead to a massive failure. And the more complex a system is, the more "chains of failure" exist. It would also become increasingly difficult to plan around failures.

edit: The Logic of Failure is recommended to anyone wanted to know more about this subject: http://www.amazon.com/Logic-Failure-Recognizing-Avoiding-Sit...

I believe the scope of than answer is greater than a HN thread, but I might just be wussing out. Hopefully others will engage you. If not, happy to take it offline.

I will say this: be careful of selection bias! Looking back, sure, if I show you a thousand examples that ended poorly your response will be something like "But they weren't really smart. Look how poorly it all turned out!" This is, at best, circular reasoning. The important thing is that, at the time, these folks were the best and brightest and put in charge for that very reason.

Good starting point: http://www.amazon.com/Logic-Failure-Recognizing-Avoiding-Sit...

Get dozens of book recommendations delivered straight to your inbox every Thursday.