In an environment using vSAN Data Protection (vSAN DP), snapshot tasks may fail when vSAN Skyline Health reports a red status.
The KB https://knowledge.broadcom.com/external/article/393764/ explains how to verify current health status from the vCenter UI. However, when investigating historical snapshot failures, the Skyline Health status at that time may no longer be visible in the UI.
This KB provides a method to identify the root cause of past vSAN DP snapshot failures directly from logs.
vSAN Data Protection performs a health check before executing snapshot tasks. If any Skyline Health check is in a red state, the snapshot task will fail.
Please follow below steps for a troubleshooting,
Step 1: Identify the failure time window
From vCenter:
Check Tasks & Events
Identify the time when vSAN DP snapshot tasks failed
Step 2: Access snapshot service logs
SSH into the appliance used for snapshot operations:
Relevant log files:
snap-service.log
snap-service.log.X.gz (rotated logs)
Note: The log file snap-service.log is in JSON format and not easy to read directly. A command
jq . snap-service.log > /tmp/snap-service.log.txt
can generate a more readable version.
Step 3: Locate snapshot execution time
Search for log entries around the snapshot schedule time:
Step 4: Check Skyline Health evaluation in logs
From snap-service.log or the rotated logs (snap-service.log.X.gz), extract the compressed files if necessary, and look for entries similar to:
Example:
Step 5: Identify the failing health check
Focus on entries where "TestHealth": "red"
Identify which Skyline Health test failed
Example:
vSAN HCL DB up-to-date
Step 6: Correlate with vCenter Skyline Health history
In vCenter:
Navigate to Cluster → Monitor → Skyline Health
In the Health score trend pane, click CUSTOM and specify a time range covering the failure window.
Verify the following:
The same Skyline Health test was red during the snapshot failure window
The test later returned to green, after which snapshots resumed normally
When investigating historical vSAN DP snapshot failures: