Virtual Machines rebooted while performing upgrades on redundant Storage Array Controllers or Top of the Rack Switches
search cancel

Virtual Machines rebooted while performing upgrades on redundant Storage Array Controllers or Top of the Rack Switches

book

Article ID: 393304

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Redundant uplinks are available to iSCSI/vSAN port group at ESXi. 
Physical switches and storage controllers are set to Active-Active mode.

Expected behavior was when one storage controller/physical switch was down, another one was active. Hence the virtual machines should not have seen an impact that resulted in a reboot.

Issue was noticed on Redhat Linux virtual machines that were running Oracle database instances. Windows virtual machines were unaffected.

VMware.log:
2025-03-22T10:35:48.904Z In(05) vmx - VigorTransport_ServerSendResponse opID=lro--1648638810-12ec7059-01-01-bb-274a seq=14922074: Completed GuestStats request.
2025-03-22T10:36:07.812Z In(05) vcpu-0 - Vix: [vmxCommands.c:7182]: VMAutomation_HandleCLIHLTEvent. Do nothing.
2025-03-22T10:36:07.812Z In(05) vcpu-0 - MsgHint: msg.monitorevent.halt
2025-03-22T10:36:07.812Z In(05)+ vcpu-0 - The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.

Environment

VMware vSphere 7.0
VMware vSphere 8.0

Cause

Below is an example when physical switches were being upgraded. 

Upon upgrading physical switch#1, we noticed connections going down that were connected through this switch:

/var/log/syslog.log: 
2025-03-22T10:05:33.216Z iscsid[2099496]: connection 3:0 (iqn.1986-03.com.ibm:xxxx.customer-xxxx.node2 if=default addr=xxx.xxx.1.98:3260 (TPGT:1 ISID:0x1)  (T1 C0)) Nop-out timeout after 10 sec in state (3).    <=========  This was expected due to physical switch upgrade
2025-03-22T10:05:33.219Z iscsid[2099496]: connection 1:0 (iqn.1986-03.com.ibm:xxxx.customer-xxxx.node1 if=default addr=xxx.xxx.1.97:3260 (TPGT:1 ISID:0x1)  (T0 C0)) Nop-out timeout after 10 sec in state (3).    <=========  This was expected due to physical switch upgrade

However there was an unexpected 10s time out on the working connections a few minutes later:
2025-03-22T10:13:49.643Z iscsid[2099496]: connection 4:0 (iqn.1986-03.com.ibm:xxxx.customer-xxxx.node2 if=default addr=xxx.xxx.1.96:3260 (TPGT:1 ISID:0x2)  (T1 C1)) Nop-out timeout after 10 sec in state (3).
2025-03-22T10:13:49.652Z iscsid[2099496]: connection 2:0 (iqn.1986-03.com.ibm:xxxx.customer-xxxx.node1 if=default addr=xxx.xxx.1.95:3260 (TPGT:1 ISID:0x2)  (T0 C1)) Nop-out timeout after 10 sec in state (3).

Nop-out operation is explained in this KB: Lost connectivity to datastore due to iSCSI Nop-out timeouts 


When all the connections were down for over 10s window, Redhat Linux virtual machines running Oracle databases went down.
Redhat team investigated from Linux logs to find that short disk time out reached 27s, which is the value it was set on these guest VMs. CPU halted when it saw 27s time out, resulting in a guest VM reboot.

Note: VMware tools default Linux guest disk time out values is set to 180s by default and in this case.

Resolution

Refer article as a first step: Virtual Machine rebooted with the following event: "The CPU has been disabled by the guest operating system"

Recommendation for these Redhat VMs (version 6/7/8) was to increase CSS Misscount and Disk Misscount values from 27 to 90 seconds with in the guest OS, during maintenance path failover.
Open a case with Linux vendor and application vendor to ask about proper methods to check current values and how to increase it to a higher value for the time period during hardware maintenance.