VMs entered into blocked status,network disconnectivity caused in production workloads
search cancel

VMs entered into blocked status,network disconnectivity caused in production workloads

book

Article ID: 411758

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • Virtual Machines lost connection when vMotioned.
  • VMs entered into blocked status, which directly impacts production workloads.

  • Command : net-dvs -l will show below output from ESXi SSH.

    port ########-####-####-####-############:
                    com.vmware.common.port.volatile.status = inUse linkUp blocked portID=###### Port blocked by admin propType = RUNTIME
            
  • From /var/log/hostd.log we can see the VM was migrated/powered on on the host just before the connectivity issue started:
    <time stamp> In(166) Hostd[2102730]: [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/6###-d##-3##-3###/<vm-path>/<vm-name>.vmx opID=CdrsLoadBalancer-  sid=52c9df38 user=vpxuser:<no user>] State Transition (VM_STATE_OFF -> VM_STATE_IMMIGRATING)
  • From /var/log/nsx-syslog, we can see ATTACH_PORT call followed by repeated SYNC_ATTACH_PORT for the impacted VM till the issue is remediated or VM is moved off the host

    <Time stamp> In(182) nsx-opsagent[3287033]: NSX 3287033 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="24216973" level="INFO"] [DoVifPortOperation] request=[opId:[CdrsLoadBalancer-] op:[HOSTD_ATTACH_PORT(1)] vif:[f###-d##-4##-8###-30ffafe6ad23] ls:[2####-3##-4###-9#####a] vmx:[/vmfs/volumes/6####-d#####-3###-3######//<vm-path>/<vm-name>.vmx] lp:

    <Time stamp>  In(182) nsx-opsagent[3287033]: NSX 3287033 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="3287483" level="INFO"] [DoVifPortOperation] request=[opId:[sync-attach-5] op:[SYNC_ATTACH_PORT(1001)] vif:[f###-d##-4##-8###-30ffafe6ad23] ls:[2####-3##-4###-9#####a] vmx:[/vmfs/volumes/6####-d#####-3###-3######//<vm-path>/<vm-name>.vmx] lp:

    <Time stamp>  In(182) nsx-opsagent[3287033]: NSX 3287033 - [nsx@6876 comp="nsx-esx" subcomp="mpa-client" tid="3287483" level="INFO"] [SwitchingVertical] SendRequest: To Master APH,  type (com.vmware.nsx.switching.VifMsg) correlationId () trackingIdStr (2###-b##-c##-8#####) Success.      <<<<<<<<<< Copy trackingIdStr

    <Time stamp>  In(182) nsx-proxy[3286484]: NSX 3286484 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="mpa-proxy-lib" tid="3286484" level="INFO"] MessagingClientService: Heartbeat message received in FrameworkUnifiedMsg from endpoint: ssl://#.#.#15:1234 client_id: 1####-f###-4##-9##-2#####      <<<<<<< #.#.#.15 is the manager IP to which this host is connected


  • /var/log/proton/nsxapi.log | grep <trackingIdStr>

    <time stamp>  INFO L2TaskExecutor10 RpcManager 77237 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Sending error response to call handle of incoming-request with id aad97####5-fc30-d##-d##application SwitchingVertical
    <time stamp> WARN GmleClientBlockingOpsThread-1 Lease 77237 - [nsx@6876 comp="nsx-manager" level="WARNING" s2comp="lease" subcomp="manager"] Leadership lease size is 0 for group 1####-7###-3###-b##-f#### and service POLICY_SVC_ROUTING
    <time stamp> ERROR GmleClientBlockingOpsThread-1 Lease 77237 - [nsx@6876 comp="nsx-manager" errorCode="GML206" level="ERROR" s2comp="lease" subcomp="manager"] Unable to get LeadershipLease for service POLICY_SVC_ROUTING on member 4####-1###-f#####-a6###### of group 1####-7###-3###-b##-f####.
    org.bouncycastle.crypto.fips.FipsOperationError: proportionate test failed        >>>>>this is the cause

Environment

VMware NSX 4.2.x and later

Cause

Modules running on NSX manager are FIPS compliant and use BCFIPS module to maintain this compliance.

The error org.bouncycastle.crypto.fips.FipsOperationError: proportionate test failed indicates that BouncyCastle's FIPS-certified cryptographic module failed its continuous self-testing requirements. This is a built-in safety mechanism in FIPS 140-2/140-3 certified cryptographic modules. When this self test fails, modules running on NSX Manager initialize but get into an error state.
As all the modules on the NSX manager are in error state, when a VM would move/power-on to a host connected to manager in this state, the port attach calls would fail to be serviced, resulting in the VM port to go in a blocked state.

Resolution

Workaround:
Need to reboot the NSX  Manager node where we observe the error  org.bouncycastle.crypto.fips.FipsOperationError. 

Permanent fix will be in future release.