ESXi hosts may experience a Purple Screen of Death (PSOD) and subsequent VM downtime due to outdated network interface card (NIC) firmware and drivers.
Typical error messages in vmkernel logs may include:
WARNING: [Device] FW went bad, stop waiting for queue flush
WARNING: [Device] Timeout waiting for queue flush response
WARNING: [Device] Failed to get coredump segment list
The PSOD is caused by instability in NIC operation due to significantly outdated firmware and drivers. This can lead to network communication issues, queue flush failures, and ultimately, system crashes.
To resolve this issue and prevent future occurrences, follow these steps to update the NIC firmware and drivers:
1. Schedule a maintenance window for all affected ESXi hosts.
2. Download the latest VCG-recommended firmware and drivers for your NICs from the VMware Compatibility Guide.
3. Update the NIC firmware:
a. Access the host's management interface.
b. Navigate to the firmware update section.
c. Upload and install the new firmware
4. Update the NIC driver:
a. Put the host in maintenance mode.
b. Upload the new driver to the host.
c. Install the driver using the esxcli command or vSphere Update Manager.
5. Reboot the host to apply all updates.
6. Exit maintenance mode and verify that VMs can be powered on successfully.
7. Monitor system performance and stability post-update.
8. Repeat steps 3-7 for all affected hosts in the cluster.
- Always check the VMware Compatibility Guide for the latest recommended firmware and driver versions for your hardware.
- Implement a regular maintenance schedule to check for and apply firmware and driver updates.
- Consider updating firmware and drivers for other components such as storage adapters to ensure overall system stability.