"Not Responding" ESXi Hosts and OVC VM Failures During SimpliVity Migration from Standard to Distributed Virtual Switches
search cancel

"Not Responding" ESXi Hosts and OVC VM Failures During SimpliVity Migration from Standard to Distributed Virtual Switches

book

Article ID: 394265

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

When migrating a SimpliVity hyperconverged environment from standard virtual switches (vSS) to distributed virtual switches (vDS), ESXi hosts may become unresponsive or experience unexpected reboots. Symptoms include:

  • Hosts appear as "Not Responding" in vCenter Server
  • SSH access remains available, but the web UI returns 503 errors
  • Management commands like esxcli network nic list fail with "Connection failed"
  • Storage access becomes degraded or unavailable
  • OmniStack Virtual Controller (OVC) VMs enter a zombie state

This issue affects the entire hyperconverged infrastructure, potentially causing widespread VM inaccessibility and service disruption.

The logs from affected systems typically show a specific sequence of events:

  1. Initial PCI passthrough disruption related to network device reassignment:
NetPort: disabled port (PORTID)
Net: disconnected client from port (PORTID)
PCIPassthru: Freeing intr cookies of device 0000:##:00.0 for type:4, devIntrtype:4, devSts:0)
  1. OmniStackVC VM immediately entering zombie state:
WARNING: CpuSched: Automatic relation removal from ######(vmx-vcpu-0:OmniStackVC-##-##-##-##, zombie) to ######(LSI-######:0)
  1. Storage connectivity degradation with NFS timeouts appearing:
NFS: Status:File system timeout (Ok to retry). Retrying synchronous write I/O 3 of 25 times
  1. Federation communication issues with SunRPC failures:
SunRPC: Synchronous RPC cancel for client 0x########## IP ##.##.##.##.#.# proc 1 xid 0x###### attempt 1 of 3
  1. Full storage disconnection messages:
NFS: Status:No connection. Retrying synchronous write I/O 1 of 25 times
NFS: Status:No connection. Retrying synchronous write I/O 2 of 25 times
  1. Critical host heartbeat file failures triggering system backtraces:
BC: write to host-####-hb (#### ## ######## ######## ######## ######## ######## ######## ######## ########) 8 bytes failed: File system timeout (Ok to retry) Log: Generating backtrace for ######: worker 
  1. Storage failures escalating to additional system process failures:
BC: write to host-####-hb (#### ## ######## ######## ######## ######## ######## ######## ######## ########) 8 bytes failed: No connection Log: Generating backtrace for ######: fdm 
  1. Critical data loss warnings:
ALERT: BC: File host-####-hb closed with dirty buffers. Possible data loss.
WARNING: NFSLock: Unable to remove expired or lost primary lockfile .lck-############
  1. System services shutting down prior to reboot:
Daemon amsd deactivated.
Daemon ntpd deactivated.

Environment

  • VMware vSphere ESXi hosts in a SimpliVity hyperconverged infrastructure
  • Environment undergoing migration from standard virtual switches to distributed virtual switches
  • HPE SimpliVity OmniStack Virtual Controller (OVC) VMs

Cause

The migration from standard virtual switches to distributed virtual switches creates a circular dependency failure when not performed in the correct sequence:

  1. Network transition disrupts connectivity to the OmniStack Virtual Controller (OVC) VM
  2. The OVC VM enters a zombie state, compromising the hyperconverged data services
  3. Storage connectivity failures occur, preventing the ESXi host from maintaining heartbeat files
  4. System processes fail sequentially, eventually leading to a complete host reboot

This occurs because in a hyperconverged environment, the storage services are provided by VMs running on the same hosts that depend on that storage, creating a "chicken-and-egg" scenario when network changes affect both simultaneously.

Resolution

To resolve this issue for affected hosts:

  1. Verify basic network connectivity by confirming you can ping and SSH to the affected hosts.

  2. For hosts that are operational but showing certificate errors:

    1. Change the vpxd.certmgmt.mode from vmca to thumbprint mode in vCenter Server to allow all host certificates.

    2. Regenerate certificates on the affected hosts.

    3. Restart the management agents.

  3. For hosts that are completely unresponsive:

    1. Access the host via SSH.

    2. Check and verify the endpoint.conf file is correctly configured.

    3. Regenerate host certificates using the following command:

      /sbin/generate-certificates
      

      d. Restart management agents:

      /etc/init.d/hostd restart
      /etc/init.d/vpxa restart
      
  4. After making these changes, wait approximately 30 minutes for the services to fully restart and establish connections.

  5. Once hosts reconnect to vCenter, restart the OmniStack Virtual Controller (OVC) VMs.

  6. Allow sufficient time (may be several hours) for the OVCs to synchronize and restore data services.

To prevent this issue when planning future migrations:

  1. Review HPE SimpliVity documentation for the proper sequence of migrating hyperconverged environments to distributed virtual switches.

  2. Ensure OVC connectivity is maintained throughout the migration by:

    1. Creating the distributed switch and port groups first.

    2. Migrating one physical uplink at a time.

    3. Validating connectivity at each step.

    4. Always keeping at least one management network connection active.

Additional Information

  • The time required for OVC synchronization depends on the amount of data in the environment.
  • Host issues may resolve at different rates; some hosts may recover more quickly than others.
  • In severe cases, it may be necessary to temporarily move critical VMs to hosts that have already recovered.