Identifying historical vSAN Data Protection snapshot failures using logs
search cancel

Identifying historical vSAN Data Protection snapshot failures using logs

book

Article ID: 435936

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

In an environment using vSAN Data Protection (vSAN DP), snapshot tasks may fail when vSAN Skyline Health reports a red status.

The KB https://knowledge.broadcom.com/external/article/393764/ explains how to verify current health status from the vCenter UI. However, when investigating historical snapshot failures, the Skyline Health status at that time may no longer be visible in the UI.

This KB provides a method to identify the root cause of past vSAN DP snapshot failures directly from logs.

Environment

  • vSAN ESA cluster running vSAN 8.0 U3 or VMware Cloud Foundation (VCF) 9
  • vSAN Data Protection (vSAN DP) is utilized.

Cause

vSAN Data Protection performs a health check before executing snapshot tasks. If any Skyline Health check is in a red state, the snapshot task will fail.

Resolution

Please follow below steps for a troubleshooting,

 

Step 1: Identify the failure time window

From vCenter:

  • Check Tasks & Events

  • Identify the time when vSAN DP snapshot tasks failed

 

Step 2: Access snapshot service logs

SSH into the appliance used for snapshot operations:

  • For vSAN 8.0 U3 → vSAN DP appliance
  • For VCF 9 → Live Recovery appliance

Relevant log files:

  • snap-service.log

  • snap-service.log.X.gz (rotated logs)

Note: The log file snap-service.log is in JSON format and not easy to read directly. A command

jq . snap-service.log > /tmp/snap-service.log.txt

can generate a more readable version.

 

Step 3: Locate snapshot execution time

Search for log entries around the snapshot schedule time:

zgrep -l "<date>" snap-service.log snap-service.log*.gz

Replace <date> with the relevant date (for example, 2026-01-01)
 

Step 4: Check Skyline Health evaluation in logs

From snap-service.log or the rotated logs (snap-service.log.X.gz), extract the compressed files if necessary, and look for entries similar to:

"message": "QueryVcClusterHealthSummary",    <-- Indicates Skyline Health check was executed
"GroupHealth": "red"    <-- Overall Skyline Health status is red
"TestName": "xxx",    <-- Name of the specific health check

"TestHealth": "red"    <-- Result of the specific test is red

Example:

"TestName": "vSAN HCL DB up-to-date",
"TestHealth": "red"
 
or
 
"TestName": "Storage space",
"TestHealth": "red",

 

Step 5: Identify the failing health check

  • Focus on entries where "TestHealth": "red"

  • Identify which Skyline Health test failed

Example:

  • vSAN HCL DB up-to-date

  • Storage space

 

Step 6: Correlate with vCenter Skyline Health history

In vCenter:

  • Navigate to Cluster → Monitor → Skyline Health

  • In the Health score trend pane, click CUSTOM and specify a time range covering the failure window.

  • Then click the graph at a specific timestamp to view detailed results.

Verify the following:

  • The same Skyline Health test was red during the snapshot failure window

  • The test later returned to green, after which snapshots resumed normally

Additional Information

When investigating historical vSAN DP snapshot failures:

  • Do not rely solely on current Skyline Health status
  • Use vSAN DP logs to determine the health state at the time of failure
  • Any red Skyline Health status will prevent snapshot execution by design