vSAN OSA Cache-tier devices have a set amount of space reserved for special purpose system usage (LLOG+PLOG), if for any reason this space becomes exhausted then that Disk-Group may be unable to process IOs in a timely manner resulting in performance impact.
VMware vSAN 8.x
In some rare circumstances LLOG traversal may fail to progress due to a problematic reference in LLOG. After some time (typically hours to days range, depending on workload), this can result in a problematically high amount of LLOG usage which results in Log Congestion Bandwidth and/or Log congestion on the affected Disk-Group.
LLOG buildup can in some circumstances be attributed to LLOG not functioning as expected due to a reference to a Capacity-tier device that is currently not functioning normally or not available.
The amount of data currently utilized by LLOG+PLOG can be checked with the following run on any node - typically values in low/single digit GiB can be normal (especially under load), any values >10GiB may be indicative of an issue and/or not decreasing for a prolonged period of time:
# while true; do echo "================================================"; date; for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do echo $ssd;vsish -e get /vmkModules/lsom/disks/$ssd/info|grep Congestion;done; for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do llogTotal=$(vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "Log space consumed by LLOG"|awk -F \: '{print $2}');plogTotal=$(vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "Log space consumed by PLOG"|awk -F \: '{print $2}');llogGib=$(echo $llogTotal |awk '{print $1 / 1073741824}');plogGib=$(echo $plogTotal |awk '{print $1 / 1073741824}');allGibTotal=$(expr $llogTotal \+ $plogTotal|awk '{print $1 / 1073741824}');echo $ssd;echo " LLOG consumption: $llogGib";echo " PLOG consumption: $plogGib";echo " Total log consumption: $allGibTotal";done;sleep 30; done;
If there are failed disks following reboot of a node then the remaining vSAN references to these disks should be removed prior to taking the node out of Maintenance Mode - first validate that all data objects are accessible then follow the relevant steps as outlined in articles for identifying vSAN Disk group reported warning as Unhealthy for the vSAN Capacity Disk and removing the failed disk How to manually manage and configure a vSAN disk group using esxcli commands.
To avoid impact from this issue, upgrade to vSAN 8.0 U3 as there were significant changes made to how LSOM handles conditions such as this with remediation automatically being triggered so as to avoid subsequent congestion:
VMware vSAN 8.0 Update 3 Release Notes