vIDM partition corruption and dracut emergency shell following failed root (/) filesystem expansion
search cancel

vIDM partition corruption and dracut emergency shell following failed root (/) filesystem expansion

book

Article ID: 432979

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

When operating VMware Identity Manager, you experience a cluster disruption and observe the following symptoms:

  • The UI displays the error: Failed to fetch identity providers. Identity Internal Server Error

  • A disk full event occurs on the primary node.

  • The /etc/hosts file becomes unreadable or appears to be nulled out.

  • Administrative attempts to increase the disk space in vCenter and run resizefs result in an error.

  • Upon rebooting the appliance (e.g., during a scheduled CSP patch), the system experiences a kernel panic and drops into a dracut emergency shell.

Environment

VMware Identity Manager 3.3.7

Cause

This issue originates from the PostgreSQL high-availability monitoring stack, which generates rollover logs in 51MB increments (/var/log/pgService/auto-recovery.log.*). These files over-retain and slowly exhaust the strictly bounded 12GB / partition (sda4).

The subsequent cluster failure and bootloader mismatch are caused by manual intervention attempts:

  1. Filesystem Unmapping: A manual fdisk operation executed on the guest OS to address the space exhaustion writes incorrect (identical) start and end blocks to the partition table. This collapses the sda4 logical boundary and unmaps the ext4 filesystem. Consequently, core networking files like /etc/hosts cannot be read, which breaks Pgpool-II quorum and application routing.

  2. Bootloader Desynchronization: Reconstructing the partition table generates a new Partition UUID (PARTUUID). During a subsequent reboot, the GRUB bootloader passes the old, invalid PARTUUID to the kernel, resulting in a kernel panic and a dracut emergency shell.

Resolution

To resolve the partition mapping and correct the bootloader, perform the following steps:

Prerequisites

  • You have access to the vSphere Web Console for the impacted appliance with Remote Console / VMRC.

Procedure

  1. Recreate the sda4 partition via fdisk utilizing the exact sector locations that represent the beginning and end of the disk.

    Note: This immediately restores access to the ext4 filesystem and the intact /etc/hosts file, allowing the Pgpool-II cluster to recover natively:
    1. Delete the malformed 0KB partition (if you created it previously):

      • Command: d
      • Partition number: 4
    2. Recreate partition 4 with the precise sector locations for a default installation (12GB for 8:4 / partition):

      • Command: n
      • Select Type: p (Primary)
      • Partition Number: 4
      • First Sector: 21239808
      • Last Sector: 46405631
    3. CRITICAL: If prompted to remove the ext4 signature, answer N.

  2. Reclaim space on the restored /dev/sda4 root partition:

    1. Isolate files larger than 15M:

      find / -xdev -type f -size +15M -exec ls -lh {} \;
    2. If numerous *.backup files exist for /var/log/pgService/auto-recovery.log.*, remove them with the following command:

      rm -f /var/log/pgService/auto-recovery.log.1-*
    3. Root should now have enough space; if not, /opt/vmware/opensearch/logs/gc.log.## are safe to delete also.

  3. Prevent Kernel panics by modifying the GRUB UUIDs for the new partition:

    1. Create a backup of the /boot/grub2/grub.cfg file:
      cp -p /boot/grub2/grub.cfg /tmp
    2. Extract the newly generated PARTUUID from the repaired partition:

      NEW_PARTUUID=$(blkid -s PARTUUID -o value /dev/sda4)
    3. Dynamically find the old PARTUUID in the GRUB config and replace it with the new one:

      sed -i "s/set rootpartition=PARTUUID=[a-zA-Z0-9-]\+/set rootpartition=PARTUUID=${NEW_PARTUUID}/g" /boot/grub2/grub.cfg
    4. Verify the configuration file now reflects the correct, new UUID:

      grep "rootpartition=PARTUUID=" /boot/grub2/grub.cfg
    5. Rebuild the initial ramdisk to ensure the boot environment recognizes the block changes:

      dracut --force
  4. Restart the appliance and confirm boot operations complete without entering dracut shell.