Troubleshooting timekeeping issues in Linux guest operating systems

search cancel

Troubleshooting timekeeping issues in Linux guest operating systems

book

Article ID: 307984

calendar_today

Updated On:

Products

VMware VMware Desktop Hypervisor VMware vSphere ESXi

Issue/Introduction

This article provides steps for troubleshooting timekeeping issues that occur when running Linux guest operating systems in a virtual machine.

Time in the virtual machine jumps forward and back
Time in the virtual machine runs slowly
Time in the virtual machine runs quickly

Resolution

Validate that each troubleshooting step below is true for the environment. Each step provides instructions or a link to an article to assist in eliminating possible causes and take corrective action as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Do not skip a step.

Apply the timekeeping best practices documented in Linux timekeeping best practices.

For ESXi, run NTP on the host.

For hosted products, run w32time or NTP on the host as appropriate. Use Workstation 6.5, Fusion 2.0, Server 2.0, Player 2.0, or a later version of any of these products. These releases contain a number of fixes to address issues with host TSC synchronization.
Check for timer interrupt delivery falling behind.

Typically timekeeping interrupts are used by the guest operating system for determining the current time. If they are raised by the hypervisor at a rate lower than the rate the guest operating system requested, the time the guest operating system sees reported by the virtual hardware is different from real time. See the Timekeeping in VMware Virtual Machines for an in depth description of timer interrupt delivery, and what it means for it to fall behind. If timer interrupts are delivered at the correct rate, that is, the virtual hardware is reporting the correct time, time in the guest may still be incorrect due to issues in the guest operating system (Steps 3 and onward address those issues). However, if timer interrupt delivery is falling behind, then there is little that can be done to correct this from within the guest, so addressing this first is important.

The best way to measure the amount timer interrupt delivery is behind is by enabling TimeTrackerStats. TimeTrackerStats are covered in detail in the Turn On Additional Logging section of Timekeeping in VMware Virtual Machines.

For the purposes of this article, add:

timeTracker.periodicStats = TRUE

timeTracker.statInterval = 5

to the virtual machine's configuration (.vmx) file, either directly or by using VI Client.

To determine whether the timekeeping problems that are observed are caused by interrupt delivery falling behind, reproduce the timekeeping problem and look at the TimeTrackerStats messages that correspond in time with the problem. The part of the message that is relevant is the behind by portion:

TimeTrackerStats behind by 2246 us; ...

In this case, TimeTrackerStats are reporting that interrupt delivery is behind by only 2246 microseconds, which is good. If timer interrupt delivery is behind by a significant amount logs may appear as:

TimeTrackerStats behind by 6929841 us; ...

In this case, TimeTrackerStats are reporting that interrupt delivery is behind by 6929841 microseconds, or 6.9 seconds.

If TimeTrackerStats reports that interrupt delivery is behind by a significant amount (more than a second or two):
1. Check whether the vmkernel is paging guest memory to disk.
  
  To do this:
  1. Start esxtop.
  2. Type m to switch to the memory view.
  3. Look at the line starting with SWAP.
    
    It should look like:
    
    SWAP /MB: 0 curr, 0 target: 0.00 r/s, 0.00 w/s
    
    If any of the numbers are non-zero, then the vmkernel has swapped some of the guest memory to disk for at least one virtual machine on the host.
    
    If no VMkernel swapping is occurring, but interrupt delivery is still falling behind, continue to Step b.
2. Ensure the virtual machine has sufficient CPU resources.
  
  To do this:
  1. Start esxtop.
  2. Type e and the GID of the virtual machine in question. Press Enter.
  3. Look at the %RDY time for the vmm worlds.
    
    If the %RDY is high, the virtual machines are not getting as much CPU resources as they would like.
    
    Here is an example from ESX 3.5 where the VM RHEL5.2-0 is expanded to show individual vmm worlds, like vmm0:RHEL5.2-0. Each of them has a %RDY of about 50%, which matches the 2X cpu over-commitment present on the host.
    
    ID GID NAME NWLD %USED %RUN %SYS %WAIT %RDY %IDLE %OVRLP
    1141 28 vmware-vmx 1 0.06 0.06 0.00 99.71 0.30 0.00 0.00
    1142 28 vmm0:RHEL5.2-0 1 50.54 51.00 0.01 0.68 48.37 0.00 0.46
    1143 28 vmm1:RHEL5.2-0 1 49.69 50.16 0.00 1.38 48.52 0.00 0.46
    1144 28 vmm2:RHEL5.2-0 1 50.56 51.01 0.00 2.80 46.24 0.00 0.45
    1145 28 vmm3:RHEL5.2-0 1 50.52 50.97 0.00 2.51 46.56 0.00 0.40
    1146 28 vmware-vmx 1 0.00 0.00 0.00 100.00 0.00 0.00 0.00
    1147 28 mks:RHEL5.2-0 1 0.60 0.59 0.02 95.23 4.25 0.00 0.00
    1148 28 vcpu-0:RHEL5.2-0 1 0.01 0.01 0.00 99.99 0.00 0.00 0.00
    1149 28 vcpu-1:RHEL5.2-0 1 0.00 0.00 0.00 100.00 0.00 0.00 0.00
    1150 28 vcpu-2:RHEL5.2-0 1 0.00 0.00 0.00 100.00 0.00 0.00 0.00
    1151 28 vcpu-3:RHEL5.2-0 1 0.00 0.00 0.00 100.00 0.00 0.00 0.00
    1169 28 Worker#0:RHEL5.2-0 1 0.01 0.01 0.00 99.98 0.00 0.00 0.00
    29 29 RHEL5.2-1 12 188.39 189.68 0.02 797.75 213.04 0.00 1.30
    30 30 RHEL5.2-2 5 4.50 4.52 0.00 487.59 8.09 88.99 0.02
    31 31 RHEL5.2-3 11 187.19 188.70 0.00 706.60 205.30 0.19 1.48
    32 32 RHEL5.2-4 12 211.10 211.47 0.00 803.07 185.59 0.00 1.29
    
    If %RDY is high, there are two ways to address the issue:
  1. 1. Reduce host load. This is the most straightforward solution.
      
      OR
    2. Apply CPU reservations to the virtual machine. This is useful if only some of the virtual machines need to have accurate timekeeping, or if some of the virtual machines need more CPU resources to keep time accurately.
      
      If the virtual machine's %RDY is low, but timer interrupt delivery is still falling behind, continue to Step 3.
3. If timer interrupt delivery still falls significantly behind, file a support request.
Check that NTP is running properly in the guest and on the host. To view ntpd's status run the command ntpq -p to print the list of peers that ntpd is in communication with. Make sure that there is a currently selected peer (its name is preceded by a "*"). Ideally other servers are marked with a "+" which indicates that they are acceptable as well.

For example:

bash$ ntpq -p

remote refid st t when poll reach delay offset jitter
========================================================================
+ntps2.gslabs.org 192.168.0.72 2 u 149 256 377 0.212 -18.115 11.359
+ntps3.gslabs.org 192.168.0.72 2 u 185 256 377 0.207 -82.106 14.625
*ntps1.gslabs.org 192.168.0.72 2 u 175 256 377 0.266 65.871 21.401
ntps4.gslabs.org 192.168.10.2 3 u 55 256 377 0.284 -20.468 19.470
Collect time in the guest versus time reported by a reference source.

/usr/sbin/ntpdate -q <timeserver> reports the amount that time on the client (where ntpdate is executed) is ahead or behind the NTP server specified by <timeserver>. Positive offsets indicate time in the client is behind time on the server. Negative offsets indicate that time in the client is ahead of time on the server.

For example:

bash$ /usr/sbin/ntpdate -q 0.vmware.pool.ntp.org

server 65.182.224.39, stratum 2, offset -0.002269, delay 0.04424
server 66.79.167.34, stratum 2, offset 0.004515, delay 0.03171
server 72.18.205.156, stratum 2, offset 0.004714, delay 0.04095
server 72.167.54.201, stratum 2, offset 0.000994, delay 0.04677
server 128.10.252.10, stratum 2, offset -0.019049, delay 0.08801
28 Apr 20:25:20 ntpdate[1217]: adjust time server 66.79.167.34 offset 0.004515 sec

This can be used to collect data on how time on the client varies:

bash$ while true; do /usr/sbin/ntpdate -q 0.vmware.pool.ntp.org | tail -n -1; sleep 1; done

28 Apr 20:35:21 ntpdate[5112]: adjust time server 66.79.167.34 offset 0.004764 sec
28 Apr 20:35:27 ntpdate[5116]: adjust time server 66.79.167.34 offset 0.004872 sec
28 Apr 20:35:33 ntpdate[5119]: adjust time server 66.79.167.34 offset 0.004834 sec
28 Apr 20:35:39 ntpdate[5123]: adjust time server 66.79.167.34 offset 0.004871 sec
28 Apr 20:35:44 ntpdate[5127]: adjust time server 66.79.167.34 offset 0.004857 sec
28 Apr 20:35:50 ntpdate[5147]: adjust time server 66.79.167.34 offset 0.004909 sec
28 Apr 20:35:56 ntpdate[5150]: adjust time server 66.79.167.34 offset 0.004858 sec

This can then be imported into a spreadsheet and the offset graphed over time. If the graph contains sudden jumps in time, this is most likely due to corrections applied by time synchronization utilities within the guest, like NTP or VMware Tools time sync. When troubleshooting issues, it can be useful to temporarily disable all time synchronization utilities to make it easier to see the underlying issues separate from the synchronization utilities attempts to correct the time.
- If the resulting line is straight, the problem is most likely hardware drift at a rate greater than NTP can correct.
- If the resulting line is not straight, the problem is most likely one of two issues.
  - If the resulting line is indicating time loss, that is, time in the guest is moving further behind real time, the issue is most likely lost ticks.
  - If the resulting line indicates time gain, that is, time in the guest is moving further ahead of real time, the issue is most likely lost tick overcompensation.

Feedback

thumb_up Yes

thumb_down No