The most reliable companies out there have aggressively adopted practices that treat incidents as expected and "human error" as a symptom of a (correctable) systems failure.
Somewhat counterintuitively these high-reliability companies have MORE incidents than less reliable organizations that haven't removed the stigma around incident reporting because their goal is to learn from them and catch them early vs. punishing the unlucky individual who happened to step on whatever systemic land-mine exploded that day.
Looks like one of Dekker's books is already listed in this post, but another one worth checking out is his "Field Guide to Understanding Human Error"[1] which is a very approachable book focused on the aviation industry and the learnings (especially post WWII) that have made that industry so safe.
If you're working on this at your own company (especially if you're in a supervisor / executive position) it's incredibly powerful and impactful to be the incident champion who works to make incident response open and accessible across the org. So much catastrophic failure comes as a result of hiding the early signs due to fear retaliation or embarrassment.
Also worth checking out: we[2] hosted a mini conference on Incident Response earlier this year with lots of great videos from folks who have worked in this space for decades about everything from culture to practices: https://www.amazon.com/Field-Guide-Understanding-Human-Error...
[2] shameless plug for https://kintaba.com, my startup in this space
My view is that expecting humans to stop making mistakes is much less effective than fixing the systems that amplify those mistakes into large, irreversible impacts.
This book is a great short manifesto on exactly that point: https://www.amazon.com/Field-Guide-Understanding-Human-Error...
It's written by someone that does airliner crash investigations. His central point is that "human error" as a term functions to redirect blame away from the people who establish systems and procedures. It blames the last domino vs the people who stacked them.
It's a quick breezy read, and you'll get the main points within the first 30 min or so of reading. I've found it useful for getting these ideas across to people though, especially more generic business types where "no blame post mortem" strikes them as some care bear nonsense rather than being an absolutely essential tool to reduce future incidents.
This is a great overview. I would also recommend Dekker's book The Field Guide to Understanding Human Error [1]. It's a bit easier to read than Drift Into Failure, which I found to be very dense.
Ooh, that reminds me of another excellent book on failure, Sidney Dekker's "Field Guide to Understanding Human Error": https://www.amazon.com/dp/1472439058
It's about investigating airplane crashes, and in particular two different paradigms for understanding failure. It deeply changed how I think and talk about software bugs, and especially how I do retrospectives. I strongly recommend it.
And the article made me think of Stewart Brand's "How Buildings Learn": https://www.amazon.com/dp/0140139966
It changed my view of a building from a static thing to a dynamic system, changing over time.
The BBC later turned it into a 6-part series, which I haven't seen, but which the author put up on YouTube, starting here: https://www.youtube.com/watch?v=AvEqfg2sIH0
I especially like that in the comments he writes: "Anybody is welcome to use anything from this series in any way they like. Please don’t bug me with requests for permission. Hack away. Do credit the BBC, who put considerable time and talent into the project."
Really? I find victim-blaming intellectually sterile. It can be done pretty much any time something bad happens, and it's not challenging to do. You just find the person who's most fucked and say it's all their fault.
I think it's much more interesting to understand the subtle dynamics that result in bad outcomes. As an example, Sidney Dekker's book, "The Field Guide to Understanding Human Error" [1] makes an excellent case that if you're going to do useful aviation accident investigation, you have to decline the simple-minded approach of blame, and instead look at the web of causes and experiences that lead to failure.
If this is interesting to you, I highly recommend "The Field Guide to Understanding Human Error" by Sidney Dekker - it covers these points with examples. [0]
Another note, I wondered what the root cause of the financial meltdown was for a number of years, but looking at it from this point of view, it's obvious that a number of things have to go wrong simultaneously; but it is not obvious beforehand which failed elements, broken processes, and bypassed limits lead to catastrophe.
For your own business/life, think about things that you live with that you know are not in a good place. Add one more problem and who knows what gives.
This is not intended to scare or depress, but maybe have some compassion when you hear about someone else's failure.
Somewhat counterintuitively these high-reliability companies have MORE incidents than less reliable organizations that haven't removed the stigma around incident reporting because their goal is to learn from them and catch them early vs. punishing the unlucky individual who happened to step on whatever systemic land-mine exploded that day.
Looks like one of Dekker's books is already listed in this post, but another one worth checking out is his "Field Guide to Understanding Human Error"[1] which is a very approachable book focused on the aviation industry and the learnings (especially post WWII) that have made that industry so safe.
If you're working on this at your own company (especially if you're in a supervisor / executive position) it's incredibly powerful and impactful to be the incident champion who works to make incident response open and accessible across the org. So much catastrophic failure comes as a result of hiding the early signs due to fear retaliation or embarrassment.
Also worth checking out: we[2] hosted a mini conference on Incident Response earlier this year with lots of great videos from folks who have worked in this space for decades about everything from culture to practices: https://www.amazon.com/Field-Guide-Understanding-Human-Error...
[2] shameless plug for https://kintaba.com, my startup in this space
https://www.amazon.com/Field-Guide-Understanding-Human-Error...
My view is that expecting humans to stop making mistakes is much less effective than fixing the systems that amplify those mistakes into large, irreversible impacts.
It's written by someone that does airliner crash investigations. His central point is that "human error" as a term functions to redirect blame away from the people who establish systems and procedures. It blames the last domino vs the people who stacked them.
It's a quick breezy read, and you'll get the main points within the first 30 min or so of reading. I've found it useful for getting these ideas across to people though, especially more generic business types where "no blame post mortem" strikes them as some care bear nonsense rather than being an absolutely essential tool to reduce future incidents.
1: https://www.amazon.com/Field-Guide-Understanding-Human-Error...
It's about investigating airplane crashes, and in particular two different paradigms for understanding failure. It deeply changed how I think and talk about software bugs, and especially how I do retrospectives. I strongly recommend it.
And the article made me think of Stewart Brand's "How Buildings Learn": https://www.amazon.com/dp/0140139966
It changed my view of a building from a static thing to a dynamic system, changing over time.
The BBC later turned it into a 6-part series, which I haven't seen, but which the author put up on YouTube, starting here: https://www.youtube.com/watch?v=AvEqfg2sIH0
I especially like that in the comments he writes: "Anybody is welcome to use anything from this series in any way they like. Please don’t bug me with requests for permission. Hack away. Do credit the BBC, who put considerable time and talent into the project."
I think it's much more interesting to understand the subtle dynamics that result in bad outcomes. As an example, Sidney Dekker's book, "The Field Guide to Understanding Human Error" [1] makes an excellent case that if you're going to do useful aviation accident investigation, you have to decline the simple-minded approach of blame, and instead look at the web of causes and experiences that lead to failure.
[1] https://www.amazon.com/Field-Guide-Understanding-Human-Error...
https://www.amazon.com/dp/1472439058
Another note, I wondered what the root cause of the financial meltdown was for a number of years, but looking at it from this point of view, it's obvious that a number of things have to go wrong simultaneously; but it is not obvious beforehand which failed elements, broken processes, and bypassed limits lead to catastrophe.
For your own business/life, think about things that you live with that you know are not in a good place. Add one more problem and who knows what gives.
This is not intended to scare or depress, but maybe have some compassion when you hear about someone else's failure.
0 https://www.amazon.com/Field-Guide-Understanding-Human-Error...