After Interrupted Reboot High Log Congestion Bandwidth Or Log Congestion Experienced
search cancel

After Interrupted Reboot High Log Congestion Bandwidth Or Log Congestion Experienced

book

Article ID: 315504

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction


Symptoms:
  • A host reboot is interrupted while in the vSAN disk group recovery phase or during a disk group mount operation.
  • After subsequent reboot extreme latency develops, and on investigation high Log Congestion Bandwidth or Log Congestion are seen.

Impact/Risks:
  • This can cause extreme latency to Virtual Machines with a component residing on the impacted host.
  • This latency will be due to high Log Congestion Bandwidth (activates before Log Congestion when LLOG is between 16 GB and 22 GB) and Log Congestion (activates once LLOG is above 22 GB).

Environment

VMware vSAN builds prior to 7.0 U3i, and 8.0 U1

Cause

When a write operation (including unmap) and commitTransaction are spread across multiple segments and during the disk group recovery (during boot or remount) the write operation is freed but the commitTransaction is not, and the recovery process is interrupted (reboot, unmount, etc.) an LLOG leak occurs. 

Relog process expected to relog the LSOM operation entry, however the component the commitTransaction is tied to will enter an LSOM_COMP_INVALID_METADATA state due to the corrupt log. The corrupt log occurs due to a null component commitEntry and non-null CF commitEntry due to the incomplete write/unmap operation for the commitTransaction.

Resolution


This has been fixed in newer 6.7 versions, and 7.0 U3i, and 8.0 U1. This issue has rarely been seen in customer environments, and once in house by engineering during testing.

Workaround:
Currently the only workaround is to destroy and recreate the disk group.