Recovering from errors

Distributed systems are inherently complex. While executing a job in a master/worker setup, numerous things can go wrong, for instance, processes can run out of memory and crash or simply become non-responsive, network packets might be dropped, or network devices might fail and hence lead to network splits. When building distributed systems, we must not only anticipate the presence of errors but we should also devise strategies for dealing with them once they occur.

In this section, we will discuss the following approaches for recovering from errors in a master/worker system:

Restart on error: This kind of strategy is better suited for workloads whose calculations are idempotent. Once a fatal error is detected, the master asks all workers to abort the current job and restart the workload from scratch.
Re-distribute the workload to healthy workers: This strategy is quite effective for systems that can dynamically change the assigned workloads while a job is executing. If any of the workers goes offline, the master can re-distribute its assigned workload to the remaining workers.
Use a checkpoint mechanism: This strategy is best suited for long-running workloads that involve non-idempotent calculations. While the job is executing, the master periodically asks the workers to create a checkpoint, a snapshot of their current internal state. If an error occurs, instead of restarting the job from scratch, the master asks the workers to restore their state from a particular checkpoint and resume the execution of the job.