There are a lot of situations where the slave may fail to communicate with the master.
These range from hardware issues such as the slave temporarily losing networking capability to software issues in the master code requiring reboot.
To make the cluster robust against many different types of failure it will need to resume operation whenever there is an issue.
By improving the slave, allowing it to automatically reconnect to the master this handles many situations where the master fails. Once the master comes back online the slave will detect this and reconnect. Once reconnected, the slave will continue operations and the operation of the cluster should seamlessly resume.
This will also mean that when the master needs to restart the slave will automatically reconnect once it is back online.
For this tutorial we will ignore slave failures, which will be handled in a later tutorial. It is assumed that when the master goes down, it will automatically come back online at an unspecified period later on.
Modifying the slave to reconnect
Currently we catch the disconnection exception thrown when the master disconnects. This then ends our script which stops the slave running. Today we are going to change it so that once it disconnects the slave attempts to reconnect. To do this we are going to encase this code in a While True loop.
The first piece of work is to refactor our socket code and to move the constant data outside of the new loop. Here we store our client number, a socket reference that we set to None to start with, and the server address we are going to use.
Next we create the While loop that our slave code will be encased in
logger.info("Connecting to the master...")
connected = False
while connected is False:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
connected = True
except socket.error as e:
logger.info("Failed to connect to master, waiting 60 seconds and trying again")
logger.info("Successfully connected to the master")
Here we create the socket and attempt to connect to the master. If this succeeds we set connected to True and begin the slave code previously written.
If this fails however we catch the exception and log that it failed to connect. The slave will wait 60 seconds and then attempt to connect to the master again. This will continue until it has successfully connected to the master.
Since this code is within a While True loop, the slave code will continue to run until the master disconnects. Once this happens the loop is reset and the slave attempts to connect again. This will again keep trying until it successfully connects.
Summary of changes to the Rpi Cluster
Now our slave will automatically reconnect to the master no matter how many times it has been restarted. This moves the cluster one step forward towards a system which will automatically repair if there is an issue.
In the next tutorial we will look at how we can make the cluster code run on boot using rc.local.
The full code is available on Github, any comments or questions can be raised there as issues or posted below.