OpenShift Container Platform Nodes are in NotReady status
search cancel

OpenShift Container Platform Nodes are in NotReady status

book

Article ID: 419994

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • When oc get nodes command is ran inside the OCP Node, one or multiple nodes shows status as NotReady.

  • The node which is marked as NotReady is unreachable over Network.
  • Uplink shows as void for the impacted node when anyone of the commands is executed on the ESXi host
    • netdbg vswitch instance list
    • nsxdp-cli vswitch instance list

  • vmxnet.log (inside Guest OS) shows the following snippets during the time of issue:
vmxnet3 0000:02:01.0 ens33: tx hang
vmxnet3 0000:02:01.0 ens33: resetting
vmxnet3 0000:02:01.0 ens33: intr type 3, mode 0, 9 vectors allocated
vmxnet3 0000:02:01.0 ens33: Failed to activate dev: error 1
  • vmware.log file reports the following snippets for the impacted VM:
YYYY-MM-DDTHH:MM:SS.SSSZ In(05) vcpu-14 - VMXNET3 user: Quiesce device 0.
YYYY-MM-DDTHH:MM:SS.SSSZ In(05) vcpu-14 - VMXNET3 user: UPT support is not requested
YYYY-MM-DDTHH:MM:SS.SSSZ In(05) vcpu-14 - Ethernet0 MAC Address: 00:50:56:##:##:##
YYYY-MM-DDTHH:MM:SS.SSSZ In(05) vcpu-14 - VMXNET3 user: Ethernet0 RSS fields requested by vmx: f
YYYY-MM-DDTHH:MM:SS.SSSZ In(05) vcpu-14 - VMXNET3 user: Activate device 0.
YYYY-MM-DDTHH:MM:SS.SSSZ In(05) vcpu-14 - VMXNET3 user: failed to activate 'Ethernet0': invalid magic value
YYYY-MM-DDTHH:MM:SS.SSSZ In(05) vcpu-14 - VMXNET3 user: Activate request failed for device 0.
  • txdescriptor ring corruption is observed for the switchport where Virtual Network Adapter of the impacted node is connected. Following dump can be observed in /var/run/log/vmkernel.log
YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu4:2488471)Vmxnet3: 2835: Tq: 1 port: 0x6000021 start desc: 451 hexdump: 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff
YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu4:2488471)Vmxnet3: 2846: Tq: 1 port: 0x6000021 start desc: 455 hexdump: 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff
YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu4:2488471)Vmxnet3: 2835: Tq: 1 port: 0x6000021 start desc: 459 hexdump: 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff
YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu4:2488471)Vmxnet3: 2846: Tq: 1 port: 0x6000021 start desc: 463 hexdump: 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff
YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu0:2488472)Vmxnet3: 2835: Tq: 2 port: 0x6000021 start desc: 126 hexdump: 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff
YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu4:2488474)Vmxnet3: 2835: Tq: 4 port: 0x6000021 start desc: 316 hexdump: 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff
YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu4:2488474)Vmxnet3: 2846: Tq: 4 port: 0x6000021 start desc: 320 hexdump: 0xff00000000 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff
YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu0:2488472)Vmxnet3: 2846: Tq: 2 port: 0x6000021 start desc: 130 hexdump: 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff
YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu0:2488472)Vmxnet3: 2835: Tq: 2 port: 0x6000021 start desc: 134 hexdump: 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff
YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu4:2488474)Vmxnet3: 2835: Tq: 4 port: 0x6000021 start desc: 324 hexdump: 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff
YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu4:2488474)Vmxnet3: 2846: Tq: 4 port: 0x6000021 start desc: 328 hexdump: 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff
YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu0:2488472)Vmxnet3: 2846: Tq: 2 port: 0x6000021 start desc: 138 hexdump: 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff
YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu4:2488475)Vmxnet3: 2835: Tq: 5 port: 0x6000021 start desc: 110 hexdump: 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff
YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu4:2488475)Vmxnet3: 2846: Tq: 5 port: 0x6000021 start desc: 114 hexdump: 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff
YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu4:2488475)Vmxnet3: 2835: Tq: 5 port: 0x6000021 start desc: 118 hexdump: 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff
YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu4:2488475)Vmxnet3: 2846: Tq: 5 port: 0x6000021 start desc: 122 hexdump: 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff

Environment

VMware vSphere ESXi

Server with AMD CPUs

Red Hat Enterprise Linux Operating System 

Cause


  • To help Guest Operating Systems utilize IOMMU features, the hypervisor provides a virtualized IOMMU (vIOMMU) to the virtual machine.
  • This vIOMMU functions as an intermediary layer, translating requests between the Guest OS's IOMMU drivers and the physical IOMMU hardware used by the hypervisor to manage device communications.
  • Linux programs the IO page tables using a page table level that's not supported by the virtual IOMMU ESXi hypervisor exposes which is a violation of the AMD IOMMU spec.
  • This causes discrepancy in memory seen by hypervisor and the Guest OS.
  • ESXi detects corruption when it observes that the descriptors of transmit rings have garbage values (as observed above from vmkernel.log above) when pulling packets from the Guest ring. Due to this, ESXi hypervisor stops the transmits queues of the Virtual Network adapter (vmxnet3/e1000/e1000e adapter) and sends an interrupt to the Guest OS to reset the Virtual Network adapter which causes the Virtual Machine to be unavailable on the Network.

NOTE: The issue has currently been isolated to Red Hat, and their engineering team is actively working on it.

Reference: VMware guest with large memory hangs

Resolution

There are two workarounds applicable at the moment which are as follows:

1. Set iommu=pt as the boot parameter. 

For any assistance required to configure the above parameter, please reach out to Red Hat customer support.

2. Set the parameter vvtd.enable to false in the vmx configuration file of the impacted Virtual Machine as follows:

vvtd.enable = "FALSE"

For any assistance required to configure the above parameter, please open a Case with Broadcom Support and mention this Knowledge base article.