vSAN vmk interface intermittently spiking with high network traffic. vSAN performance metrics show coinciding spikes in IOPS.
search cancel

vSAN vmk interface intermittently spiking with high network traffic. vSAN performance metrics show coinciding spikes in IOPS.

book

Article ID: 426456

calendar_today

Updated On:

Products

VMware vSAN VMware Live Recovery

Issue/Introduction

Network monitoring tools may alert to high levels of network traffic on the vSAN vmk interface for a vSAN node.  This can be seen live while the issue is occurring via esxtop. 

Reviewing the vSAN performance charts will display a high level of IOPS also observed.

 

Rather consistent timing and duration to the intermittent spikes may be seen.  This may or may not negatively impact VM production performance.

No VMs registered to this vSAN DR cluster has any application-side operations that are running jobs of any high IO significance during those times.

Running a vSAN Skyline Health test may not report any alerts.  See:

Environment

VMware vSAN

vSphere Replication

Cause

This behavior can occur in vSAN clusters used as destinations for VMs replicated via vSphere Replication.  Depending on several factors such as

  • Source side VM churn rate
  • vSphere Replication VM RPO settings
  • vSAN raid policy
  • vSAN component placement
  • And backend vSAN network traffic

you may encounter a vSAN node in a destination cluster that is working to satisfy vSphere Replication VM data copies intended for components that reside on a diskgroup on that specific vSAN node.  Due to the VM RPO specified in vSphere Replication for that VM, you may see cyclical and somewhat consistent vSAN network and IO spikes as vSphere Replication operates to meet the RPO requirements specified by the customer. 

 

Resolution

This is expected behavior but several strategies exist if there is a concern of the intermittent high network utilization affecting a single vSAN node and the desire is to spread the vSphere Replication IO across the vSAN network traffic in a more uniform manner.

  • Identify VMs being replicated with vSphere Replication that have a large amount of data being replicated over in the RPO windows.  Consider slightly reducing the RPO on a few and monitoring.  This can lead to smaller amounts of data being replicated more often as opposed to a large amount of data being replicated less often.
  • The DR vSphere Replication appliances "randomly" choose a DR host through which to send the replicated data to the vSAN datastore.  It is possible that a single host has been chosen for 2 or more VMs that have a large amount of replicated data thus causing a lopsided DR workload.
    • Rebooting the host would lead the DR vSphere Replication appliances to redistribute the DR workload thus potential for an improved re-distribution.
    • Rebooting the DR vSphere Replication appliances can help with the re-distribution as well.  Rebooting the host would be more ideal though.
  • vSphere Replication that uses Enhanced Replication (introduced in vSphere Replication 9.0) has an improved method of load balancing replicated data on the DR side as opposed to Legacy Replication methods.  More details: