Redundant uplinks are available to iSCSI/vSAN port group at ESXi.
Physical switches and storage controllers are set to Active-Active mode.
Expected behavior was when one storage controller/physical switch was down, another one was active. Hence the virtual machines should not have seen an impact that resulted in a reboot.
Issue was noticed on Redhat Linux virtual machines that were running Oracle database instances. Windows virtual machines were unaffected.
VMware.log:
2025-03-22T10:35:48.904Z In(05) vmx - VigorTransport_ServerSendResponse opID=lro--1648638810-12ec7059-01-01-bb-274a seq=14922074: Completed GuestStats request.
2025-03-22T10:36:07.812Z In(05) vcpu-0 - Vix: [vmxCommands.c:7182]: VMAutomation_HandleCLIHLTEvent. Do nothing.
2025-03-22T10:36:07.812Z In(05) vcpu-0 - MsgHint: msg.monitorevent.halt
2025-03-22T10:36:07.812Z In(05)+ vcpu-0 - The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.
VMware vSphere 7.0
VMware vSphere 8.0
Below is an example when physical switches were being upgraded.
Upon upgrading physical switch#1, we noticed connections going down that were connected through this switch:
/var/log/syslog.log:
2025-03-22T10:05:33.216Z iscsid[2099496]: connection 3:0 (iqn.1986-03.com.ibm:xxxx.customer-xxxx.node2 if=default addr=xxx.xxx.1.98:3260 (TPGT:1 ISID:0x1) (T1 C0)) Nop-out timeout after 10 sec in state (3). <========= This was expected due to physical switch upgrade
2025-03-22T10:05:33.219Z iscsid[2099496]: connection 1:0 (iqn.1986-03.com.ibm:xxxx.customer-xxxx.node1 if=default addr=xxx.xxx.1.97:3260 (TPGT:1 ISID:0x1) (T0 C0)) Nop-out timeout after 10 sec in state (3). <========= This was expected due to physical switch upgrade
However there was an unexpected 10s time out on the working connections a few minutes later:
2025-03-22T10:13:49.643Z iscsid[2099496]: connection 4:0 (iqn.1986-03.com.ibm:xxxx.customer-xxxx.node2 if=default addr=xxx.xxx.1.96:3260 (TPGT:1 ISID:0x2) (T1 C1)) Nop-out timeout after 10 sec in state (3).
2025-03-22T10:13:49.652Z iscsid[2099496]: connection 2:0 (iqn.1986-03.com.ibm:xxxx.customer-xxxx.node1 if=default addr=xxx.xxx.1.95:3260 (TPGT:1 ISID:0x2) (T0 C1)) Nop-out timeout after 10 sec in state (3).
Nop-out operation is explained in this KB: Lost connectivity to datastore due to iSCSI Nop-out timeouts
When all the connections were down for over 10s window, Redhat Linux virtual machines running Oracle databases went down.
Redhat team investigated from Linux logs to find that short disk time out reached 27s, which is the value it was set on these guest VMs. CPU halted when it saw 27s time out, resulting in a guest VM reboot.
Note: VMware tools default Linux guest disk time out values is set to 180s by default and in this case.
Refer article as a first step: Virtual Machine rebooted with the following event: "The CPU has been disabled by the guest operating system"
Recommendation for these Redhat VMs (version 6/7/8) was to increase CSS Misscount and Disk Misscount values from 27 to 90 seconds with in the guest OS, during maintenance path failover.
Open a case with Linux vendor and application vendor to ask about proper methods to check current values and how to increase it to a higher value for the time period during hardware maintenance.