"A problem with one or more vFAT bootbank partitions was detected", Corrupted vFAT partitions causing upgrade or pre-check failures

Products

VMware vSphere ESXi VMware vSphere ESXi 7.0 VMware vSphere ESXi 8.0

Issue/Introduction

This KB provides guidance on recovering from one or multiple vFAT partition issues during an ESXi upgrade.

During ESXi upgrade the VC UI reports the pre-check message :

Hardware precheck of the profile <ProfileName> failed with errors: <VFAT_CORRUPTION ERROR: A problem with one or more vFAT bootbank partitions was detected. Please refer to KB 91136 and run dosfsck on bootbank partitions.

Lifecycle log on ESXi host (/var/run/log/lifecycle.log) will show similar to below entries :

yyyy-mm-ddThh:mm:ssZ In(14) lifecycle[pid]: runcommand:199 runcommand called with: args = ['/bin/dosfsck', '-V', '-n', '/dev/disks/naa.<id>:<partition>'], outfile = None, returnoutput = True, timeout = 10.

When upgrading to ESXi 7.0 Update 3l or ESXi 8.0 Update 1 or later, the operation fails with a purple diagnostic screen and an error such as:

An error occurred while backing up VFAT partition files before re-partitioning: Failed to calculate size for temporary Ramdisk: <error>.
An error occurred while backing up VFAT partition files before re-partitioning: Failed to copy files to Ramdisk: <error>.

Corrupted vFAT partitions may cause upgrades from ESXi 6.5 and 6.7 to versions up to ESXi 7.0 Update 3k or ESXi 8.0c to exhibit the following symptoms.

Logs indicating that the ramdisk (root) is full can be found in the vmkwarning.log file.
Unexpected reversion to ESXi 6.5 or 6.7 following an upgrade to ESXi 7.0.x or ESXi 8.0.
After a host upgrade, the boot banks are not linked to (/bootbank and /altbootbank), and OSDATA is missing.
The backtrace from the jumpstart-native-stdout.log will display the following errors.

YYYY-MM-DDTHH:MM:SS SystemStorage t10.ATA_____<ID>___________________________________<ID>: upgrading partition layout...
Traceback (most recent call last):
  File "/bin/initSystemStorage", line 1354, in <module>
    storage.setupSystemPartitions()
  File "/bin/initSystemStorage", line 659, in setupSystemPartitions
    self.upgradePartitionTable(bootDisk)
  File "/bin/initSystemStorage", line 413, in upgradePartitionTable
    upgradeBackup()
  File "/lib64/python3.8/site-packages/systemStorage/upgradeUtils.py", line 307, in upgradeBackup
  File "/lib64/python3.8/site-packages/systemStorage/upgradeUtils.py", line 201, in calculateDirMiBSize
  File "/lib64/python3.8/genericpath.py", line 50, in getsize
FileNotFoundError: [Errno 2] No such file or directory: '/vmfs/volumes/########-########-####-########/log/\x03\x05\x03\x01yd\x1fy.\######\#####'
YYYY-MM-DDTHH:MM:SS.523Z Plugin system-storage failed Invoking method start (rc=1)

After the upgrade is complete, you may experience the following PSOD.

Environment

8.0.x
7.0.x

Cause

Reason for the dirty bit: the dirty bit is set by another OS, as ESXi does not utilize this bit. It indicates that the partition was mounted without a corresponding unmount operation.

The cause of the other vFAT failures is currently under investigation.

Resolution

To resolve the issue, follow below steps to repair the faulty vFAT partitions by using dosfsck.

Identify all vFAT partitions:

Each ESXi host has 4 or 5 vFAT partition on ESXi 6.5 and ESXi 6.7: 2 Bootbanks, Scratch, and Locker

# esxcli storage filesystem list

Mount Point                                        Volume Name  UUID                                 Mounted  Type            Size          Free

-------------------------------------------------  -----------  -----------------------------------  -------  ------  ------------  ------------

/vmfs/volumes/########-########-####-############ datastore1 ########-########-####-############ true  VMFS-6  129385889792  127599116288
/vmfs/volumes/########-########-####-############ ########-########-####-############ true  vfat       299712512     108437504
/vmfs/volumes/########-########-####-############ ########-########-####-############ true  vfat       261853184      88797184
/vmfs/volumes/########-########-####-############ ########-########-####-############ true  vfat      4293591040    4079943680
/vmfs/volumes/########-########-####-############ ########-########-####-############ true  vfat       261853184     261849088

From the mount points, it's possible to identify disk and partition

# vmkfstools -P /vmfs/volumes/########-#######-####-############

vfat-0.04 (Raw Major Version: 0) file system spanning 1 partitions.
File system label (if any):
Mode: private
Capacity 299712512 (36586 file blocks * 8192), 108437504 (13237 blocks) avail, max supported file size 0
Disk Block Size: 512/0/0
UUID: ########-########-####-############
Partitions spanned (on "disks"):
mpx.vmhba0:C0:T0:L0:8
Is Native Snapshot Capable: NO

The disk and partition id is mpx.vmhba0:C0:T0:L0:8.

Note: The "mpx ID" strings are just examples; in your case, you might see "naa.*** ID."

Repeat this step for all vFAT partitions. Finally, you will have list like this

mpx.vmhba0:C0:T0:L0:2 (scratch)
mpx.vmhba0:C0:T0:L0:5 (bootbank 1)
mpx.vmhba0:C0:T0:L0:6 (bootbank 2)
mpx.vmhba0:C0:T0:L0:8 (locker)

Enter maintenance mode and stop all daemons

Note: This step is only required for upgrades from 6.5 and 6.7.

To avoid any interference between the following steps and any daemon writing on the disk, its required to check for open file handles and close them.

Stop crond, which periodically schedules backup.sh, updating the active bootbank
```
# kill $(cat /var/run/crond.pid)
```
Stop vmsyslogd, which has open file handles on /scratch (log files)
```
# /usr/lib/vmware/vmsyslog/bin/shutdown.sh
```

Check for further daemons having open file handles on the scratch partition and stop these daemons

# lsof |grep scratch
1001391762  vmfstracegd           FILE                        4   /scratch/vmfstraces/vmfsGlobalTrace.trace.0.gz

# /etc/init.d/vmfstraced stop
watchdog-vmfstracegd: Terminating watchdog process with PID 1001391748
vmfstracegd stopped
[root@localhost:~] lsof |grep scratch

-- note: ########-########-####-########### is the UUID of the scratch partition
# lsof |grep ########-########-####-############
1001391489  rhttpproxy            FILE                       18   /vmfs/volumes/########-########-####-###########/log/rhttpproxy-##########-################-lo0-1.pcap
1001391489  rhttpproxy            FILE                       19   /vmfs/volumes/########-########-####-###########/log/rhttpproxy-##########-################-vmk0-1.pcap
# /etc/init.d/rhttpproxy stop

# lsof | grep var/run/log
2101088    python               FILE                       5  /var/run/log/vsandevicemonitord.log

# /etc/init.d/vsandevicemonitord stop

Perform any of below Solutions to recover the corrupted vFAT partitions.

===========================================================================

Solution 1 (Preferred solution) - Use dosfsck as a first solution

For all identifies vFAT partitions, check the file system integrity and repair the disk as needed

Check the health of the vFAT partition
```
# dosfsck -Vv /dev/disks/<disk and partition id>
```
disk and partition id was derived in the previous step

For instance, the output for a healthy partition

# dosfsck -Vv /dev/disks/mpx.vmhba0\:C0\:T0\:L0:2
dosfsck 2.11 (12 Mar 2005)
dosfsck 2.11, 12 Mar 2005, FAT32, LFN
Checking we can access the last sector of the filesystem
Boot sector contents:
System ID "MSDOS5.0"
Media byte 0xf8 (hard disk)
       512 bytes per logical sector
     65536 bytes per cluster
         2 reserved sectors
First FAT starts at byte 1024 (sector 2)
         2 FATs, 16 bit entries
    131072 bytes per FAT (= 256 sectors)
Root directory starts at byte 263168 (sector 514)
       512 root directory entries
Data area starts at byte 279552 (sector 546)
     65515 data clusters (4293591040 bytes)
32 sectors/track, 64 heads
         0 hidden sectors
   8386560 sectors total
Starting check/repair pass.
Checking for unused clusters.
Starting verification pass.
Checking for unused clusters.
/dev/disks/mpx.vmhba0:C0:T0:L0:2: 222 files, 3279/65515 clusters

- If the command reports any failures or hangs, then try to repair the partition
  
  # dosfsck -a -w /dev/disks/<disk and partition id>

- If the command reports any orphaned files, delete the file(s). Then, write the changes.

- Repeat step 1. If dosfsck still report failures, proceed with the next step to re-create the partition
- After you have checked all ESXi partitions, reboot the ESXi host (this will restart all previously stopped daemons)

Solution 2: Use ESXi ISO to repair the boot partition

In case if the above option is failing to repair the disk, proceed to repair the same using an ESXi ISO

- Download the ESXi iso (same build as installed on host) from Broadcom Portal. Refer to VMware vSphere downloads, OEM custom images, patches and addons in the Broadcom Support Portal
- Log in to vCenter Server and put the host in maintenance mode
- Mount the ISO to the host via console (iDRAC,iLO etc)
- Open the remote console for the host and proceed to reset the server
- During the loading VMware ESXi screen, press Shift+O on your keyboard
- Clear the existing values and enter the value "cdromBoot"

Sample:

Default:

Modified:

- Post the boot from ISO is complete, execute the below command

# dosfsck -v -a /dev/disks/<disk and partition id>

- Unmount the ISO and proceed to reboot the host

If Partitions spanned (on "disks") is of the format: t10.NVMe<Vendor>_____________________________a5##############:5

root@esxi:/] dosfsck -Vv /dev/disks/t10.NVMe____<Vendor>____________________________a5##############:5

CP850//TRANSLIT: Invalid argument
CP850: Invalid argument
fsck.fat 4.1+git (2017-01-24)
Checking we can access the last sector of the filesystem
Boot sector contents:
System ID "MSDOS5.0"
Media byte 0xf8 (hard disk)
512 bytes per logical sector`
65536 bytes per cluster
2 reserved sectors
First FAT starts at byte 1024 (sector 2)
2 FATs, 16 bit entries
131072 bytes per FAT (= 256 sectors)
Root directory starts at byte 263168 (sector 514)
512 root directory entries
Data area starts at byte 279552 (sector 546)
65515 data clusters (4293591040 bytes)
32 sectors/track, 64 heads
0 hidden sectors
8386560 sectors total
Starting check/repair pass.
Orphaned long file name part "mfg_net"

For all identified vFAT partitions, check the file system integrity and repair the disk as needed

1. Run the command to check if the vFAT partition is corrupted # dosfsck -Vv /dev/disks/<disk and partition id> (Note: disk and partition id was derived in the previous step)
2. Select: Delete (Note: This option only repairs the corrupted vFAT partition)
3. Select: Keep the Changes
4. Then, proceed with the ESXi upgrade.

===========================================================================

Re-create a corrupted Scratch (vFAT) partition

Backup all files. In this example, we will backup /scratch and keep a copy on datastore1

# cp /scratch/ /vmfs/volumes/datastore1/scratchBackup

(At this point its very likely that the cp command returns a failure. Note, the filesystem is corrupted and one or more files or filenames will be invalid. A this point copy folder by folder or file by file and leave the corrupted files on the disk. After re-formatting, the  file will be lost!)

(Re-)Format the corrupted partition

# vmkfstools -C vfat /dev/disks/mpx.vmhba0:C0:T0:L0:2
create fs deviceName:'/dev/disks/mpx.vmhba0:C0:T0:L0:2', fsShortName:'vfat', fsName:'(null)'
deviceFullPath:/dev/disks/mpx.vmhba0:C0:T0:L0:2 deviceFile:mpx.vmhba0:C0:T0:L0:2
Checking if remote hosts are using this device as a valid file system. This may take a few seconds...
Creating vfat file system on "mpx.vmhba0:C0:T0:L0:2" with blockSize 1048576 and volume label "none".
Successfully created new volume: 640748a7-########-####-########46fa

(Note: If the command returns a busy error, this indicates that a file on this disk is still open. See above steps to identify the open handles.)

Restore the content

Get the volume ID from the previous command (e.g., 640748a7-########-####-##########fa)

# cp -r /vmfs/volumes/datastore1/scratchBackup/* /vmfs/volumes/640748a7-########-####-##########fa/

Reboot the ESXi host, after you have checked and repaired all vFAT partitions.

===========================================================================

If Partitions spanned (on "disks") is of the format: t10.ATA______CISCO_VD___________________________________

If the output for Step 1 is as per the following screenshot, from the highlighted box select option 1 write changes and the partition would be repaired.

Once done re-run the same command to check the file system integrity:

# dosfsck -Vv /dev/disks/<disk and partition id>

Additional Information

If following messages are prompted after running command, choose 'No action'.

# dosfsck -Vv /dev/disks/<disk and partition id>

0x25: Dirty bit is set. Fs was not properly unmounted and some data may be corrupt.

1) Remove dirty bit

2) No action

[12?1]

Partition will be repaired by command below

# dosfsck -a -w /dev/disks/<disk and partition id>.

Note: This issue has been permanently fixed in the ESXi 8.0u3b techdocs.broadcom.com.