Collecting vSAN Performance Service data for vSAN performance issues
search cancel

Collecting vSAN Performance Service data for vSAN performance issues

book

Article ID: 326959

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

This article describes how to collect vSAN performance service data that needs to be uploaded to VMware Support ticket (GS) to analyze/review performance matrix of your vSAN/VxRail clusters.

Symptoms:
  • Customer Experience Improvement Program (CEIP) cannot be enabled due to security concerns
  • vCenter Server does not have a connection to the internet to upload CEIP data to VMware
  • See deteriorated performance on vSAN cluster and you need VMware support to identify and resolve performance related issues on vSAN/VxRail environments.


Environment

VMware vSAN 6.6.x
VMware vSAN 6.2.x
VMware vSAN 7.0.x
VMware vSAN 6.5.x

Cause

Observing any of the following:
  • high latency on vSAN performance charts
  • VMs experiencing high latency
  • slow throughput
  • very high outstanding I/O

Resolution

Notes:
  • The metrics collected by this process only contain metrics of the last 2 days;
  • The collection operation will invoke a number of queries. This operation could take a while to finish.
Collection steps:
  1. Ensure the performance issue is actively occurring.
  2. Enable Verbose and Network Diagnostics mode via vSAN cluster > Configure > vSAN > Services > Performance Service > Editimage.png
  3. Let the system run for 10 to 15 minutes.
  4. Collect the host logs from the entire cluster including vCenter logs
  5. Disable the Verbose and Network Diagnostics mode
  6. Upload the logs to VMware for analysis
Notes:
  • It's highly recommended that the cluster logs are collected before vCenter logs to ensure we get all the required logging from the hosts at the time of the event as host logs tend to wrap faster than vCenter logs.
  • When troubleshooting performance issues it's best to collect the Master/Leader host first to ensure we get the perf data at the time of the performance event.
  •  When dealing with large clusters 20+ hosts and collecting logs via vCenter it's best to collect the host logs in small batches, no more than 5 at a time to ensure the logs don't get corrupt.

If the issue is not actively occurring but is reproducible on demand then do the following:
  1. Enable Verbose and Network Diagnostics mode via vSAN cluster -> Configure -> vSAN -> Services -> Performance Service -> Edit
  2. Reproduce the performance issue
  3. Let the system run for 10 to 15 minutes.
  4. Collect the ESXi host logs from all hosts in the cluster and also logs from vCenter
  5. Disable the Verbose and Network Diagnostics mode
  6. Upload the logs to VMware for analysis

If the performance issue is affecting the log collection, we will need at a bare minimum the logs from the Master host in the cluster, as the bundle of the Master will contain the performance statistics of the cluster.

To determine the Master host in the cluster check the following:
vCenter -> vSAN Cluster -> Monitor -> vSAN -> Health -> Performance Service -> Stats Master Election


Additional Information

Impact/Risks:
  • In order for the required performance data to be accessible, it is required that the vSAN Performance Service is enabled on all production vSAN deployments.
  • This helps to identify bottlenecks/issues at various layers of the vSAN stack.
  • The verbose mode causes log files to be large in size and are placed the scratch partition and perf stat db object
  • The ESXI logs collection may take a long time to finish the long bundle collection