VM became Read Only and not responding
search cancel

VM became Read Only and not responding

book

Article ID: 392676

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

At least two VMs on same datastore; one running guest OS Windows and the other Linux.

  • Only the Linux VM stopped responding.
  • In vobd logs (var/run/log) similar error messages are observed:

YYYY-MM-DDTHH:MM:SS.SSSSZ: [netCorrelator] 5423055380828us: [vob.net.vmnic.linkstate.down] vmnic vmnic# linkstate down
YYYY-MM-DDTHH:MM:SS.SSSSZ: [netCorrelator] 5423055588380us: [vob.net.lacp.uplink.transition.down] LACP warning: Uplink vmnic# on VDS DvsPortset-# moved out of the link aggregation group.

YYYY-MM-DDTHH:MM:SS.SSSSZ: [netCorrelator] 5423412994848us: [vob.net.vmnic.linkstate.up] vmnic vmnic# linkstate up
YYYY-MM-DDTHH:MM:SS.SSSSZ: [netCorrelator] 5423876249365us: [esx.clear.net.vmnic.linkstate.up] Physical NIC vmnic# linkstate is up
YYYY-MM-DDTHH:MM:SS.SSSSZ: [netCorrelator] 5423876249555us: [esx.clear.net.dvport.redundancy.restored] Uplink redundancy restored on DVPorts: . Physical NIC vmnic# is up recently
YYYY-MM-DDTHH:MM:SS.SSSSZ: [netCorrelator] 5423878248334us: [esx.clear.net.connectivity.restored] Network connectivity restored on virtual switch , portgroups: . Physical NIC vmnic# is up
YYYY-MM-DDTHH:MM:SS.SSSSZ: [netCorrelator] 5423878248381us: [esx.clear.net.redundancy.restored] Uplink redundancy restored on virtual switch , portgroups: . Physical NIC vmnic# is up
YYYY-MM-DDTHH:MM:SS.SSSSZ: [netCorrelator] 5423416471229us: [vob.net.lacp.uplink.connected] LACP info: Uplink vmnic# on VDS DvsPortset-3 was connected.

 

 



Cause

One of the storage network switches was rebooted.

  •  Using LACP, and it appears the two vmnics going down were handled as expected, by removing the downed NICs from the aggregation group. Then re-added to the LACP group after coming back up, which restored redundancy.
  • However, each guest OS responds to stun and loss of storage access differently. It seems the VM running Windows may not have been affected by the short stun time between the time it took for the traffic to move between the LACP NICs, or that VM may have been using the other NICs in the LACP group at the time. Either way, it seems the Linux VM was stunned for just enough time to go into read only (unresponsive) state. When the VM is writing to memory, it won't go to read only until the memory fills, so depending on the memory size and amount of I/O a VM has, depends on when it will reach that point.

Resolution

  • Make sure any changes in the environment, including network switches being rebooted (especially storage switches), be done during planned maintenance windows to try and avoid unexpected outages. Confirm all involved teams (e.g., Networking, vSphere, etc.) that could be affected, are made aware of the planned change in order to prepare.

  • Work with internal networking team to confirm LACP failovers and such are taking place as expected (e.g., perform tests, etc.), and determine if any changes in the environment can be changed for less impact to ESXi host storage communication.

  • Work with VM guest OS vendor to see if there is something that can be done within the guest OS to try and avoid this from happening.

Additional Information