This error could be caused by an underlying NSXT issue.
We can use following steps to identify if it is NSXT issue:
1. Verify the failing service instance using below command and see if the VMs are assigned to all the AZs equally.
bosh -d <service-instance> vms
2. Verify TAS CPI logs and see if there is any mention of ports that are being blocked or missing VM. For instance:
/var/vcap/data/packages/vsphere_cpi/50ca35ec1631bcfa54c171e78451b27a104b532d/lib/cloud/vsphere/nsxt_provider.rb:330:in `block in logical_ports'
/var/vcap/data/packages/vsphere_cpi/50ca35ec1631bcfa54c171e78451b27a104b532d/vendor/bundle/ruby/2.4.0/gems/bosh_common-1.3262.24.0/lib/common/retryable.rb:28:in `block in retryer'
/var/vcap/data/packages/vsphere_cpi/50ca35ec1631bcfa54c171e78451b27a104b532d/vendor/bundle/ruby/2.4.0/gems/bosh_common-1.3262.24.0/lib/common/retryable.rb:26:in `loop'
/var/vcap/data/packages/vsphere_cpi/50ca35ec1631bcfa54c171e78451b27a104b532d/vendor/bundle/ruby/2.4.0/gems/bosh_common-1.3262.24.0/lib/common/retryable.rb:26:in `retryer'
/var/vcap/data/packages/vsphere_cpi/50ca35ec1631bcfa54c171e78451b27a104b532d/lib/cloud/vsphere/nsxt_provider.rb:323:in `logical_ports'
3. Verify
Sys logs and see if the logical port (mentioned in above step) is available:
syslog.4.gz:<182>1 2020-04-29T17:59:12.171Z ensxtcntrl2 NSX 5553 FABRIC [nsx@6876 comp="nsx-manager" subcomp="manager"] VifMsgHandler.BEGIN: Received VifMsg [460d5dd4-42c5-4c9c-847e-3f987083431b:2985]: "operation: ATTACH_VIF_TO_PORT#012type: REQUEST#012vif_attachment {#012
vif_uuid: "f73a97d3-2d9b-45d9-82f3-e3f156fdf0f1"#012 logical_switch_uuid: "b13a385a-3777-417a-953e-0754c393b8bb"#012 logical_port_uuid: ""#012 host_id: "071fc450-7802-4e2e-8d0c-fa1f8eff952e"#012 vmx_path: "/vmfs/volumes/vsan:525615850d14dcb0-309dfeb7ce2f66bd/63c0a95e
-8e66-7bf4-6cf1-e4434b183440/vm-b702f94c-7690-456a-891e-523f7ef588a8.vmx"#012 host_operation_id: "ac64cbc-01-01-01-01-c6-9c3a-42-43"#012}#012"
4. At this point, we can safely say that this is an NSXT issue. Once NSXT issue is identified, following steps can be taken to rectify the issue:
[root@rri1esxpl373:/vmfs/volumes/5d750453-7702cbca-06eb-e4434b181320/log] esxcli network ip connection list | grep 5671
tcp 0 0 10.221.28.26:21600 10.221.28.43:5671 ESTABLISHED 2103844 newreno mpa
tcp 0 0 10.221.28.26:21599 10.221.28.43:5671 ESTABLISHED 2103844 newreno mpa
- Now, verify the service instance again and see if the VMs are assigned to all the Availability Zones (AZs) equally.
bosh -d <service-instance> vms