1
Vote

Fault Tolerance

description

Currently, if a worker disconnects before the experiment is finished, it will crash the master. This needs to be handled in a more elegant way by first alerting the Master's GUI that it has lost a worker and then having the master gracefully shift any incomplete work onto the remaining worker nodes.

comments