"failed to mount" as ext4 "already contains unknown data" on cluster workloads in TKGI/PKS clusters
search cancel

"failed to mount" as ext4 "already contains unknown data" on cluster workloads in TKGI/PKS clusters

book

Article ID: 298594

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

  • Workloads running on TKGI/PKS clusters may become stuck in ContainerCreating or Init state
  • PersistentVolumes used by the pod workloads are successfully attached to the node, but subsequent mount fails
  • Describing the pod or showing csi-node-service.stderr.log in /var/vcap/sys/logs/csi-node-service the following error will be seen:

    MountVolume MountDevice failed for volume "......" failed to mount as ext4 it already contains unknown data, probably partitions
    Mount error failed: exit status 32 Mount: wrong fs type, bad option, bad superblock or /dev/sdd, missing codepage or helper program
  • From SSH to the node on which the failing pod has been scheduled, running the lsblk command, users will see the /dev/sdd (this might show a different disk than sdd) device present with no mount point, for example:

    NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
    sda      8:0    0    5G  0 disk
    └─sda1   8:1    0    5G  0 part /var/vcap/data        ---------> This directory will contain different folders/files depending on node customizations
                                    /home
                                    /
    sdb      8:16   0   50G  0 disk
    └─sdb1   8:17   0   50G  0 part /var/vcap/data/        ---------> This directory will contain different folders/files depending on node customizations
                                    /var/tmp
                                    /tmp
                                    /opt
                                    /var/opt
                                    /var/log
    sdc      8:32   0  100G  0 disk
    └─sdc1   8:33   0  100G  0 part 
    sdd      8:48   0  30G  0 disk
    └─sdd1   8:49   0  30G  0 part                    -----> This should show a mount point under normal conditions.  



Environment

Product Version: 1.5+

Cause

This error is presented due to disk corruption, which could happen in situations wherein the disks are not unmounted gracefully. Potential causes might be:

  1. Power Outage.
  2. Unexpected ESXi reboots.
  3. Storage Controller failures.
  4. iSCSI/FiberChannel fabric disconnections.
  5. Network connectivity failures for storage solutions that depends on network availability.

Resolution

The following steps provide a workaround that might fix the issue. Please note, this not a guarantee that disk corruption can be fixed or that data can be recovered (it depends on how bad the disk is corrupted). While this procedure may or may not fix disk corruption, it is a last step prior to restoring data from backups.

  1. Identify the worker where the pod (having the mount issue) is running. Run `kubectl get pods -o wide` to identify the the worker IP address.
  2. Identify the device (e.g., /dev/sdd) that is having the disk issue. Run `kubectl describe pod <pod-name>` and identify the device from the log.
  3. ssh into that particular worker you identified in step #1
  4. Once ssh'd into the particular worker, run the below tests to prove that this is where the bad device is:

    - Run (as root) `dumpe2fs /dev/sdd` (if device is indeed /dev/sdd). If this is the bad device, then it will just error out with "Couldn't find valid filesystem superblock"

    - Run dmesg -H and see that the device has errors when it is being tried to be mounted


  5. Once the bad device is identified, run (as root) `e2fsck /dev/sdf` (if device is indeed /dev/sdd) to fix the device. Answer 'yes' to all prompts. You can also use option -y, which default to “yes” for  all questions that are asked by the e2fsck command, for example

         # e2fsck -y /dev/sdd

Additional Information

See KB 369366 for more details on this issue