bbr-backup-director task fails with error "ssh: handshake failed" when connecting to BOSH director
search cancel

bbr-backup-director task fails with error "ssh: handshake failed" when connecting to BOSH director

book

Article ID: 297218

calendar_today

Updated On:

Products

Concourse for VMware Tanzu

Issue/Introduction

Customer has the following setup with multiple foundations:
 

  • foundation-A to be backed up:
    • BOSH, TAS, TKGI, ...
  • foundation-B running bbr-pcf-pipeline-tasks:

The bbr-backup-director task running in foundation-B uses the CLI tool, bbr, to backup foundation-A. In order to connect with foundation-A, BOSH director which is running in a private CIDR is not directly reachable from foundation-B, the bbr task takes use of the tunneling as described in BOSH doc.

To be specific, it sets up an SSH tunnel (stored in $BOSH_ALL_PROXY) to the BOSH director and allows toggling $NO_PROXY for connecting with either a local or remote BOSH director. The expected behavior is as follows.
 

  • When $NO_PROXY is set, including the BOSH director IP, the bbr task intends to connect with it directly, bypassing $BOSH_ALL_PROXY;
  • When $NO_PROXY is not set, the bbr task should use the tunnel in $BOSH_ALL_PROXY, assuming the targeted director is a remote director not locally reachable.

At the time of writing, (bbr v1.9.6), the bbr CLI does not respect $NO_PROXY but respects only $no_proxy. It would not utilize the tunnel in $BOSH_ALL_PROXY if $no_proxy includes BOSH director IP/CIDR.

Below is an example case hitting the problem:
 

  • The BOSH directors in both foundations A and B use identical private CIDR 172.###.###.0/24 and occupies identical private IP 172.###.###.14.
  • When deploying concourse in foundation-B via the Helm chart, $http_proxy$https_proxy and $no_proxy are set with $https_proxy containing the enterprise proxy of the customer. The BOSH director private CIDR 172.###.###.0/24 is included in $no_proxy because from the perspective of foundation-B, there is no need to use $https_proxy to reach the local BOSH director in foundation-B.

The setup is illustrated in the following diagram:



With this setup, the bbr-backup-director task running in foundation-B failed to backup foundation-A BOSH director with error as follows:

bbr] 2021/03/22 08:10:07 INFO - Looking for scripts
1 error occurred:
error 1:
finding scripts failed on bosh/0: ssh.Run failed: ssh.Stream failed: ssh.Dial failed: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain


The underlying cause is within bosh-utils, the library used by the bbr tool for connecting with BOSH director. The bosh-utils does not respect $NO_PROXY but respects only $no_proxy. In the above example, it connects with the local foundation-B BOSH director (because $no_proxy is set) using the ssh key meant for the foundation-A BOSH director. Hence, it hits the ssh authenticate failure. Please refer to this bosh-utils code.

Resolution

As a workaround, the customer can unset no_proxy or exclude the BOSH director IP/CIDR from no_proxy by modifying the export-director-metadata script. For example:

# https://github.com/pivotal-cf/bbr-pcf-pipeline-tasks/blob/master/scripts/export-director-metadata#L54-L58

# Set NO_PROXY for BOSH Director
if [ ! -z ${SET_NO_PROXY:+x} ] && [ $SET_NO_PROXY = true ]; then
    export NO_PROXY="${BOSH_ENVIRONMENT},${NO_PROXY:=${no_proxy:=}}"
    echo "exporting NO_PROXY=${NO_PROXY}"
fi

# unset no_proxy or export no_proxy with new value to exclude BOSH director
export no_proxy=<...>


Troubleshooting tips:
 

  • Test connectivity to the remote BOSH director. The test can be performed from the concourse worker or from the pipeline job container. 
    • Establish SSH tunnel to the remote OpsMan VM:
      ssh -4 -D 5000 -NC "ubuntu@<REMOTE_OPSMAN_FQDN>" -i <REMOTE_OPSMAN_SSH_KEY> -o ServerAliveInterval=60 -o StrictHostKeyChecking=no &
    • SSH into the remote BOSH director through the tunnel:
      ssh -o ProxyCommand='nc -X 5 -x localhost:5000 %h %p' -i <BBR_USR_SSH_KEY> bbr@<REMOTE_BOSH_DIRECTOR_IP>
    • Test connectivity with bbr CLI subcommand pre-backup-check:
      export BOSH_ALL_PROXY=socks5://localhost:5000
      echo $no_proxy
      unset no_proxy
      bbr director --host <REMOTE_BOSH_DIRECTOR_IP> --username bbr --private-key-path ./<BBR_USR_SSH_KEY> pre-backup-check
      
  • Check the remote BOSH director sshd log to see if it has received auth request and if the auth is from the local OpsMan (through the tunnel) or from the concourse worker/container directly. If sshd does not receive any auth requests while the ssh handshake error is being replicated, it indicates the bbr CLI has not connected with the correct director. 
    • ssh to the remote BOSH director:
      tail -n 0 -f /var/log/auth.log | grep "Accepted publickey for bbr from"