Think about our previous example; it was both trivial and profound. It was trivial because there are many ways of achieving some kind of fault tolerance with a library that returns successive numbers. But it was profound because it is a concrete representation of the idea of building rings of confidence in our code. The outer ring, where our code interacts with the world, should be as reliable as we can make it. But within that ring there are other, nested rings. And in those rings, things can be less than perfect. The trick is to ensure that the code in each ring knows how to deal with failures of the code in the next ring down.
And that’s where supervisors come into play. In this chapter we’ve seen only a small fraction of supervisors’ capabilities. They have different strategies for dealing with the termination of a child, different ways of terminating children, and different ways of restarting them. There’s plenty of information online about using OTP supervisors.
But the real power of supervisors is that they exist. The fact that you use them to manage your workers means you are forced to think about reliability and state as you design your application. And that discipline leads to applications with very high availability—in Programming Erlang (2nd edition) [Arm13], Joe Armstrong says OTP has been used to build systems with 99.9999999% reliability. That’s nine nines. And that ain’t bad.
(In case you were wondering, that equates to a complete application outage of roughly 1 second every 30 years. I don’t know how you’d even measure that, which makes me a little suspicious….)
There’s one more level in our lightning tour of OTP—the application. But before we look at that, let’s use what we learned so far and build some real-world code.