Recovering from errors

Distributed systems are inherently complex. While executing a job in a master/worker setup, numerous things can go wrong, for instance, processes can run out of memory and crash or simply become non-responsive, network packets might be dropped, or network devices might fail and hence lead to network splits. When building distributed systems, we must not only anticipate the presence of errors but we should also devise strategies for dealing with them once they occur.

In this section, we will discuss the following approaches for recovering from errors in a master/worker system: