Nodes within the TKGm clusters unexpectedly crash, resulting in a NotReady status when viewed from the control plane.
When in this state, the affected nodes exhibit the following behaviors:
Complete loss of network accessibility (SSH attempts fail).
Unresponsive to standard interactions via the vSphere cluster interface.
3.x
The node failure is caused by an underlying storage issue where the operating system encounters corruption or a read/write failure on an ext4 formatted filesystem attached via a Persistent Volume Claim (PVC).
Specifically, a Java application process (comm java) attempts to perform read/write operations to its mapped disk. Because the underlying block device (e.g., /dev/sdd) loses its ability to read the filesystem, the storage becomes inaccessible, causing the node kernel to lock up or crash.
Because the node becomes completely inaccessible via SSH or standard cluster commands, a hard restart is required.