vSphere Replication was converged from 9.0.2.2 to 9.0.4 post which the replication was broken with the below error message.
"A Replication error occurred at the vSphere Replication Server for replication 'VM". Details: "No connection to VR Server for virtual machine on host in duster : Unknown"
VLR 9.0.4
vSphere ESXi 9.0.1
The issue was caused as HMS had configured two HbrReplicationTargets with the same IP but different certificates.
The hbr-agent.log on the Source ESX host reported below
yyyy-mm-ddThh:mm:ss.671Z In(166) hbr-agent-bin[2882912]: [0x000000e613f97640] info: [ProxyConnection] Setting up secure tunnel to broker on Recovery VRMS:32032
yyyy-mm-ddThh:mm:ss.671Z In(166) hbr-agent-bin[2882912]: [0x000000e613f97640] info: [Proxy [Group: ] -> [Recovery VRMS:32032]] Bound to vmk: vmk3 for connection to Recovery VRMS:32032
yyyy-mm-ddThh:mm:ss.673Z In(166) hbr-agent-bin[2882912]: [0x000000e614018640] info: [Proxy [Group: ] -> [Recovery VRMS:32032]] TCP Connect latency was 1792µs
yyyy-mm-ddThh:mm:ss.681Z In(166) hbr-agent-bin[2882912]: [0x000000e613f16640] error: [Proxy [Group: ] -> [Recovery VRMS:32032]] SSL handshake failed: certificate verify failed (SSL routines)
yyyy-mm-ddThh:mm:ss.681Z In(166) hbr-agent-bin[2882912]: [0x000000e613f16640] error: [Proxy [Group: ] -> [Recovery VRMS:32032]] Failed to connect to broker on Recovery VRMS:32032: certificate verify failed (SSL routines)
yyyy-mm-ddThh:mm:ss.681Z In(166) hbr-agent-bin[2882912]: [0x000000e613f16640] error: [Proxy [Group: ] -> [Recovery VRMS:32032]] Failed to connect to broker: certificate verify failed (SSL routines)
yyyy-mm-ddThh:mm:ss.758Z In(166) hbr-agent-bin[2882912]: [0x000000e613f97640] info: [ProxyConnection] Setting up secure tunnel to broker on Recovery VRMS:32032
The `HbrServerInfo` entity stores Broker information at the Protected site.
What this does is tell Protected ESX which certificate to use when connecting to Recovery Broker.
Primary groups reference the broker they are bound to.
If examine the database again (by running '# select name, hbrserverinfo_uuid from primarygroupentity;`), there is no record pointing to `xxxxxxx-xxxx-xxxx-xxx-xxxxxxxx`, which means this record can be safely removed.
To remove the stale certificate please take a snapshot of the vSphere replication appliance and follow the below steps
# select uuid, resolvedrepltrafficaddress, certificate from hbrserverinfoentity;
The above command will display UUID for all the stored certificate.
systemctl stop hms
delete from hbrserverinfoentity where uuid = 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx';
systemctl start hms