Symptoms:
1. Configuration changes on the ESXi is lost after rebooting the server.
2. Bootbank and altbootbank points to /tmp folder instead of pointing to the ESXi installed partition (this can be confirmed by running ls -l on root ("/") folder)
lrwxrwxrwx 1 root root bootbank -> /tmp/_bootbank3tljsb_d
[root@localhost:~] esxcli storage core adapter list <====== this will list only the adapter which are not in passthru mode and claimed by Kernel
Output:
HBA Name Driver Link State UID Capabilities Description
-------- -------- ---------- ----------- ------------ -----------
vmhba0 vmw_ahci link-n/a sata.vmhba0 (0000:00:17.0) Intel Corporation Cannon Lake PCH-H AHCI Controller
[root@localhost:~] lspci | grep -i vmhba <======== this will list all the storage adapter's being discovered from pci layer
Output:
0000:00:17.0 SATA controller: Intel Corporation Cannon Lake PCH-H AHCI Controller [vmhba0]
0000:02:00.0 RAID bus controller: Broadcom PERC H330 Adapter [vmhba1]
[root@localhost:~] esxcli hardware pci pcipassthru list <========= this shows that the Storage Adapter is marked as passthrough device
Output:
Device ID Enabled
------------ -------
0000:00:08.0 false
0000:02:00.0 true
0000:04:00.0 false
0000:05:00.0 false
0000:05:00.1 false
ESXi host's booting from storage controller or devices which is marked as Passthrough after ESXi is installed on it
When ESXi is installed on a device it creates vfat partitions on the device and consumes the device and this device is used by bootloader for subsequent boot .
In some scenario admin's may accidentally mark the Boot Device/Controller as a Passthrough device and this would lead boot device not getting discovered/consumed by ESXi
Once system is booted all the configuration changes will be written to /tmp due to unavailability of the boot device hence no changes made will be persistent across reboot
To resolve this issue, follow the steps:
1. Log in to Host using both SSH and access its Web GUI(Host Client) and place the Host in Maintenance Mode as a reboot will be required
2. Find the PCI ID of the Storage Device or Adpater which has passthru enabled using command mentioned here: esxcli hardware pci pcipassthru list
3. Output should be similar to the one mentioned below:
Device ID Enabled
------------ -------
0000:00:08.0 false
0000:02:00.0 true
0000:04:00.0 false
0000:05:00.0 false
0000:05:00.1 false
Here we can see Adpater with device id: 0000:02:00.0 is pass-thru enabled and to know details about it we can use the command: lspci | grep 0000:02:00.0
This will list the details of the card/adapter/device for example: 0000:02:00.0 RAID bus controller: Broadcom PERC H330 Adapter [vmhba1]
4. Disable pass-thru for the adapter/device using Host UI: Go to “Configure” -> “Hardware” -> “PCI Devices” and click on “Toggle Passthrough" under Passthrough enabled devices section
Note, Please make sure to select only the device for which passthru needs to be disabled as identified in step 3 above
5. A device re-scan now will show the device as available but the issue is not resolved yet and we would need to follow the below steps
a. Rescan storage adapter's using the Host UI or vCenter UI for the affected Host, same can be done by using the command : esxcli storage core adapter rescan --all
b. Check if the storage adapter is listed now under storage adapter's claimed by kernel using command : esxcli storage core adapter list
If the adapter is visible this means we can proceed with next steps else please check the Device ID and Disable Passthrough
6. Run command mentioned here to set passthrough as disabled for adapter/device: esxcli hardware pci pcipassthru set -d 0000:02:00.0 -a -e=0
Note, replace "0000:02:00.0" mentioned here with the device id obtained in step 3 for the affected storage device/adapter
7. Check for passthru status using command mentioned here as the storage device/adapter would now be shown as passthru disabled : esxcli hardware pci pcipassthru list
Device ID Enabled
------------ -------
0000:00:08.0 false
0000:02:00.0 false
0000:04:00.0 false
0000:05:00.0 false
0000:05:00.1 false
8. Rescan storage adapter using command : esxcli storage core adapter rescan --all
9. Sync changes to the bootbank using: vim-cmd hostsvc/firmware/sync_config
10. Reboot ESXi Host
Upon Reboot ESXi will load from the boot device as it will no longer be in passthorugh mode and the symlinks will be pointed to their respective mount points
Similar Log Snippets will be observed in boot.log:
cpu0:1048576)PCIE: 608: 0000:02:00.0: PCIe v2 PCI Express Endpoint
cpu0:1048576)PCI: 789: ARI-capable device 0000:02:00.0 under non-ARI-capable bridge 0000:00:01.1
cpu0:1048576)PCI: 352: Found physical slot 0x1 (peer 0x0) from SMBIOS for 0000:02:00.0
cpu0:1048576)PCI: 1001: 0000:02:00.0: probing 1000:005f 1028:1f44 0104 0002
cpu7:1048961)PCIE: 194: 0000:02:00.0: Bypassing non-ACS capable device in hierarchy
cpu7:1048961)PCIPassthru: PCIPassthruAttachDev:233: Attached to device 0000:02:00.0
cpu5:1048855)Activating Jumpstart plugin system-storage.
ESC[31;1m cpu5:1049154)ALERT: Failed to find boot device after 120 secondsESC[0m
ESC[31;1m cpu5:1049154)ALERT: No persistent storage available for system logs and data
. ESX is operating with limited system storage space, logs and system data will be lost on reboot.ESC[0m
cpu5:1048855)Jumpstart plugin system-storage activated.
cpu2:1048857)Activating Jumpstart plugin backup-configurations.
cpu0:1048845)Activating Jumpstart plugin tag-boot-bank.
cpu7:1048845)Jumpstart plugin tag-boot-bank activation failed: tag-boot-bank->start() failed: exited with code 1
Note, PCI ID will be different depending on the B/D/F id of the PCI Slot/Card being used by the Adapter/Device
Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.
If Bootbank is pointing to /tmp due to Slow Storage Discovery please follow KB: https://knowledge.broadcom.com/external/article/318029