Recover HCX Manager and Fleet Appliances after datastore failure

search cancel

Recover HCX Manager and Fleet Appliances after datastore failure

book

Article ID: 328982

calendar_today

Updated On:

Products

VMware HCX

Issue/Introduction

HCX Manager and IX/NE may still be active on the network and reply to ICMP requests
HCX-IX/NE tunnels might be up and running as data path is not impacted.
HCX Manager shows "Read-error on swap-device" on VM Web Console OR Remote Console :
HCX Fleet Appliances(NE/IX) on VM Web Console OR Remote Console shows below messages :

You are in emergency mode
EXT4-fs error (sda#)
Remounting filesystem read-onlyDetected aborted journal
HCX IX/NE appliance transport tunnel will remain up, but the appliance will show "System state is critical".
Below messages can also be noticed :
"Partition root has problem to create new files. Err: open \/fscheck####: read-only file system",
"Partition log has problem to create new files. Err: open \/var\/log\/fscheck####: read-only file system"

Environment

VMware HCX

Cause

Guest Operating System transitioned to a read-only file system state due to underlying storage-related issues. This behavior is typically triggered as a protective measure by the OS when it detects I/O errors or potential file system corruption, in order to prevent further damage.

Resolution

To resolve this issue, please ensure the underlying storage problem is addressed first. Once the storage issue has been rectified, proceed to reboot the affected VMs.

If the issue is not fixed after reboot, please open a support case with Broadcom Support and refer to this KB article.
For more information, see Creating and managing Broadcom support cases

NOTE:
In the event that the recovery process fails, restore HCX VM from backup.
If a backup is not available, re-deployment will be necessary.

For Fleet Appliances(IX/NE), you can do a redeploy.
For NE in HA, use "RECOVER" as this operation attempts to return an HA group to a Healthy state.

Additional Information

All HCX Management Services could be down due to the system not being able to boot.
NE appliances will remain operational and the L2C data path will continue to forward traffic.
All migration and configuration workflows will not be serviced.
There is no risk in executing the workaround procedure as the VM may be considered unrecoverable already.

Feedback

thumb_up Yes

thumb_down No