The Field Guide to Understanding Human Error

Author: Sidney Dekker
4.2
This Month Hacker News 1

Comments

by wpietri   2020-02-26
I think you're missing a couple things here.

One is the difference between optimizing for MTBF and MTTR (respectively, mean time between failures and mean time to repair). Quality gates improve the former but make the latter worse.

I think optimizing for MTTR (and also minimizing blast radius) is much more effective in the long term even in preventing bugs. For many reasons, but big among them is that quality gates can only ever catch the bugs you expect; it isn't until you ship to real people that you catch the bugs that you didn't expect. But the value of optimizing for fast turnaround isn't just avoiding bugs. It's increasing value delivery and organizational learning ability.

The other is that I think this grows out of an important cultural difference: the balance between blame for failure and reward for improvement. Organizations that are blame-focused are much less effective at innovation and value delivery. But they're also less effective at actual safety. [1]

To me, the attitude in, "Getting a call that production is not working is the event that I am trying to prevent by all means possible," sounds like it's adaptive in a blame-avoidance environment, but not in actual improvement. Yes, we should definitely use lots of automated tests and all sorts of other quality-improvement practices. And let's definitely work to minimize the impact of bugs. But we must not be afraid of production issues, because those are how we learn what we've missed.

[1] For those unfamiliar, I recommend Dekker's "Field Guide to Human Error": https://www.amazon.com/Field-Guide-Understanding-Human-Error...

by hoorayimhelping   2019-07-12
John Allspaw applied concepts from The Field Guide to Understanding Human Error to software post mortems. When I was at Etsy, he taught a class explaining this whole concept. We read the book and discussed concepts like the Fundamental Attribution Error.

I've found it very beneficial, and the concepts we learned have helped me inn almost every aspect of understanding the complicated world we live in. I've taken these concepts to two other companies now to great effect.

https://www.amazon.com/Field-Guide-Understanding-Human-Error...

https://codeascraft.com/2012/05/22/blameless-postmortems/

https://codeascraft.com/2016/11/17/debriefing-facilitation-g...

https://www.oreilly.com/library/view/velocity-conference-201...

by wpietri   2018-11-10
One of the things I think about when analyzing organizational behavior is where something falls on the supportive vs controlling spectrum. It's really impressive how much they're on the supportive end here.

When organizations scale up, and especially when they're dealing with risks, it's easy for them to shift toward the controlling end of things. This is especially true when internally people can score points by assigning or shifting blame.

Controlling and blaming are terrible for creative work, though. And they're also terrible for increasing safety beyond a certain pretty low level. (For those interested, I strongly recommend Sidney Dekker's "Field Guide to Understanding Human Error" [1], a great book on how to investigate airplane accidents, and how blame-focused approaches deeply harm real safety efforts.) So it's great to see Slack finding a way to scale up without losing something that has allowed them to make such a lovely product.

[1] https://www.amazon.com/Field-Guide-Understanding-Human-Error...

by wpietri   2018-02-20
I think that's ridiculous. Pilots are correctly very reluctant to hit things. Historically, we have wanted them to do their best to avoid that.

You could argue that we should now train pilots to carefully pause and consider whether the thing they are about to hit is safe to hit. But for that, you'd have to show that the additional reaction time in avoiding collisions is really net safer. And if you did argue that, you couldn't judge the current pilots by your proposed new standard.

For those interested, by the way, in really thinking through accident retrospectives, I strongly recommend Sidney Dekker's "Field Guide to Human Error": https://www.amazon.com/Field-Guide-Understanding-Human-Error...

I read it just out of curiosity, but it turned out to be very applicable to software development.

by csours   2017-11-05
The path to a disaster has been compared to a tunnel [0]. You can escape from the tunnel at many points, but you may not realize it.

Trying to find the 'real cause' is a fool's errand, because there are many places and ways to avoid the outcome.

I do take your meaning, reducing speed and following well established rules would have almost certainly have saved them.

0. PDF: https://www.amazon.com/Field-Guide-Understanding-Human-Error...