The Field Guide to Understanding Human Error

Author: Sidney Dekker
This Month Hacker News 1


by droopyEyelids   2022-03-06
If anyone is interested in this topic, I recommend the book "Field Guide to Understanding Human Error" by Sidney Dekker

It's an incredibly thorough treatment of the incentives and psychology that lead to people labeling process failures as 'human error'.

Most of the book deals with manufacturing, aviation, and air control failures, but the principles generalize so easily to software development that it's a treat to read. One thing that makes it so good is that I was vaguely aware of most of what he covers before having read it, but reading him stitch it all together brought me to the point of intuitively understanding the concepts that had been floating in the back of my mind, and being able to see them all around me at work. He puts it together so smoothly that after having read it, it felt like I always knew what I had just learned.

It's super expensive on amazon but available on all the online library sites that aren't for linking in polite company. It's also on audible.

by wpietri   2020-02-26
I think you're missing a couple things here.

One is the difference between optimizing for MTBF and MTTR (respectively, mean time between failures and mean time to repair). Quality gates improve the former but make the latter worse.

I think optimizing for MTTR (and also minimizing blast radius) is much more effective in the long term even in preventing bugs. For many reasons, but big among them is that quality gates can only ever catch the bugs you expect; it isn't until you ship to real people that you catch the bugs that you didn't expect. But the value of optimizing for fast turnaround isn't just avoiding bugs. It's increasing value delivery and organizational learning ability.

The other is that I think this grows out of an important cultural difference: the balance between blame for failure and reward for improvement. Organizations that are blame-focused are much less effective at innovation and value delivery. But they're also less effective at actual safety. [1]

To me, the attitude in, "Getting a call that production is not working is the event that I am trying to prevent by all means possible," sounds like it's adaptive in a blame-avoidance environment, but not in actual improvement. Yes, we should definitely use lots of automated tests and all sorts of other quality-improvement practices. And let's definitely work to minimize the impact of bugs. But we must not be afraid of production issues, because those are how we learn what we've missed.

[1] For those unfamiliar, I recommend Dekker's "Field Guide to Human Error":

by hoorayimhelping   2019-07-12
John Allspaw applied concepts from The Field Guide to Understanding Human Error to software post mortems. When I was at Etsy, he taught a class explaining this whole concept. We read the book and discussed concepts like the Fundamental Attribution Error.

I've found it very beneficial, and the concepts we learned have helped me inn almost every aspect of understanding the complicated world we live in. I've taken these concepts to two other companies now to great effect.

by wpietri   2018-11-10
One of the things I think about when analyzing organizational behavior is where something falls on the supportive vs controlling spectrum. It's really impressive how much they're on the supportive end here.

When organizations scale up, and especially when they're dealing with risks, it's easy for them to shift toward the controlling end of things. This is especially true when internally people can score points by assigning or shifting blame.

Controlling and blaming are terrible for creative work, though. And they're also terrible for increasing safety beyond a certain pretty low level. (For those interested, I strongly recommend Sidney Dekker's "Field Guide to Understanding Human Error" [1], a great book on how to investigate airplane accidents, and how blame-focused approaches deeply harm real safety efforts.) So it's great to see Slack finding a way to scale up without losing something that has allowed them to make such a lovely product.


by wpietri   2018-02-20
I think that's ridiculous. Pilots are correctly very reluctant to hit things. Historically, we have wanted them to do their best to avoid that.

You could argue that we should now train pilots to carefully pause and consider whether the thing they are about to hit is safe to hit. But for that, you'd have to show that the additional reaction time in avoiding collisions is really net safer. And if you did argue that, you couldn't judge the current pilots by your proposed new standard.

For those interested, by the way, in really thinking through accident retrospectives, I strongly recommend Sidney Dekker's "Field Guide to Human Error":

I read it just out of curiosity, but it turned out to be very applicable to software development.

by csours   2017-11-05
The path to a disaster has been compared to a tunnel [0]. You can escape from the tunnel at many points, but you may not realize it.

Trying to find the 'real cause' is a fool's errand, because there are many places and ways to avoid the outcome.

I do take your meaning, reducing speed and following well established rules would have almost certainly have saved them.

0. PDF: