Troubleshooting timer drift in vSAN 7.0 GA
search cancel

Troubleshooting timer drift in vSAN 7.0 GA

book

Article ID: 314308

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

This article is written to inform of, and address, the timer drift issue.

Symptoms:
When running vSAN 7.0 prior to 7.0 Update 1 (ESXi build # 16850804), on a host running for any arbitrary length of time, a timer drift issue will manifest as a vSAN performance issue with one of the following two symptoms:
  1. LSOM LLOG will be 100% busy irrespective of the workload running on the system.
  2. All IOs will get delayed, and as time goes by, the delay will be longer and longer.


Environment

VMware vSAN 7.0.x

Cause

In vmkernel, time can be represented in two different ways:
  1. Timer cycles.
  2. Nanoseconds since the system boots up. It’s converted from timer cycles.

Prior vSAN 7.0 GA, nanoseconds are used to specify the timer expiry time. From vSAN 7.0 GA, timer cycles are used directly to specify timer expiry time, which can save CPU cycles spent on conversion.

Accordingly, vSAN started to track time in timer cycles, but nanoseconds are still used in some places. For some timers, their expiry time is specified using timer cycles that’s converted back from nanoseconds.

The conversion from timer cycles to nanoseconds and back to timer cycles again can expose a cumulative drift which can be large or underflow. This results in timer events fired earlier or later than desired.

Resolution


Upgrade to vSAN 7.0 U1 or later versions.

Customers are encouraged to do an upgrade at their earliest convenience as this can have a significant impact on vSAN performance as time progresses.

Workaround:
Rebooting all hosts in the vSAN cluster without upgrading can mitigate timer drift temporarily. But as time progresses, timer drift will happen again.

Additional Information

Impact/Risks:
The timer drift issue could lead to a vSAN performance degradation, and resolution requires host reboot.