Description
[ I'm not sure whether this should be considered a bug, or a PEBKAC. If the latter, it would be nice to document as a pitfall to avoid]
In my .ssh/config
, I have set up automatic connection sharing via
Host *
ControlMaster auto
ControlPath ~/.ssh/%r@%h:%p
Thus, all ssh sessions to the same host go via the same connection. However, this means that the ssh session initiated by the rollback checker does not necessarily test whether ssh still works - it may just be multiplexed into an existing connection!
Consider the following (slightly contrived) scenario:
- You are about to do a deploy which will break ssh so that new connections are rejected but existing ones carry on working (e.g. by disabling the ssh daemon)
- Before doing so, you
ssh <deploy-host>
for some reason (to check disk usage, perhaps) - You leave that ssh session open, and in another terminal you run the deploy
- The deploy's ssh sessions are multiplexed into the existing one. This doesn't matter for the copy and activate, but means that the rollback-check ssh erroneously succeeds (as it uses the still running existing connection is still up, and thus doesn't check whether a fresh connection will succeed)
- Thus the deploy is activated and seems successful
- You now close the manually-opened ssh connection ("disk usage checker" in this example)
- Access is now lost to the remote box! Any new
ssh <deploy-host>
will now fail, as it won't accept the new connection.
I have seen this scenario happen when testing in a VM. I have not seen any issue without the manual ssh connection. (But since deploy-rs will ssh multiple times, I am slightly concerned there could be a race condition where the "activate" connection is not closed in time and thus is shared with the "rollback check" session resulting in the same issue. However, I have not checked the code - this concern may well be baseless)
The solution would be to add the -S none
option to (the rollback-check's) ssh options, if we decide this is a bug we should fix, rather than user error.