Troubleshooting timekeeping issues in Linux guest operating systems
search cancel

Troubleshooting timekeeping issues in Linux guest operating systems

book

Article ID: 307984

calendar_today

Updated On:

Products

VMware VMware Desktop Hypervisor VMware vSphere ESXi

Issue/Introduction

This article provides steps for troubleshooting timekeeping issues that occur when running Linux guest operating systems in a virtual machine.

Symptoms:
  • Time in the virtual machine jumps forward and back
  • Time in the virtual machine runs slowly
  • Time in the virtual machine runs quickly


Resolution

Validate that each troubleshooting step below is true for your environment. Each step provides instructions or a link to an article to assist you in eliminating possible causes and take corrective action as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Do not skip a step.
  1. Apply the timekeeping best practices documented in Linux timekeeping best practices (1006427).

    For ESX, run NTP on the host.

    For hosted products, run w32time or NTP on the host as appropriate. Use Workstation 6.5, Fusion 2.0, Server 2.0, Player 2.0, or a later version of any of these products. These releases contain a number of fixes to address issues with host TSC synchronization.
     
  2. Check for timer interrupt delivery falling behind.

    Typically timekeeping interrupts are used by the guest operating system for determining the current time. If they are raised by the hypervisor at a rate lower than the rate the guest operating system requested, the time the guest operating system sees reported by the virtual hardware is different from real time. See the Timekeeping in VMware Virtual Machines for an in depth description of timer interrupt delivery, and what it means for it to fall behind. If timer interrupts are delivered at the correct rate, that is, the virtual hardware is reporting the correct time, time in the guest may still be incorrect due to issues in the guest operating system (Steps 3 and onward address those issues). However, if timer interrupt delivery is falling behind, then there is little that can be done to correct this from within the guest, so addressing this first is important.

    The best way to measure the amount timer interrupt delivery is behind is by enabling TimeTrackerStats. TimeTrackerStats are covered in detail in the Turn On Additional Logging section of Timekeeping in VMware Virtual Machines.

    For the purposes of this article, add:

    timeTracker.periodicStats = TRUE

    timeTracker.statInterval = 5


    to the virtual machine's configuration (.vmx) file, either directly or by using VI Client.

    To determine whether the timekeeping problems that are observed are caused by interrupt delivery falling behind, reproduce the timekeeping problem and look at the TimeTrackerStats messages that correspond in time with the problem. The part of the message that is relevant is the behind by portion:

    TimeTrackerStats behind by 2246 us; ...

    In this case, TimeTrackerStats are reporting that interrupt delivery is behind by only 2246 microseconds, which is good. If timer interrupt delivery is behind by a significant amount you may see something like:

    TimeTrackerStats behind by 6929841 us; ...

    In this case, TimeTrackerStats are reporting that interrupt delivery is behind by 6929841 microseconds, or 6.9 seconds.

    If TimeTrackerStats reports that interrupt delivery is behind by a significant amount (more than a second or two):
     
    1. Check whether the vmkernel is paging guest memory to disk.

      To do this:
       
      1. Start esxtop.
      2. Type m to switch to the memory view.
      3. Look at the line starting with SWAP.

        It should look like:

        SWAP /MB: 0 curr, 0 target: 0.00 r/s, 0.00 w/s

        If any of the numbers are non-zero, then the vmkernel has swapped some of the guest memory to disk for at least one virtual machine on the host. See Time falls behind in a virtual machine when the memory of the virtual machine is paged from disk by the VMKernel (1005861), for background information and ways to address the issue.

        If no VMkernel swapping is occurring, but interrupt delivery is still falling behind, continue to Step b.
         
    2. Ensure the virtual machine has sufficient CPU resources.

      To do this:
       
      1. Start esxtop.
      2. Type e and the GID of the virtual machine in question. Press Enter.
      3. Look at the %RDY time for the vmm worlds.

        If the %RDY is high, the virtual machines are not getting as much CPU resources as they would like.

        Here is an example from ESX 3.5 where the VM RHEL5.2-0 is expanded to show individual vmm worlds, like vmm0:RHEL5.2-0. Each of them has a %RDY of about 50%, which matches the 2X cpu over-commitment present on the host.

        ID GID NAME NWLD %USED %RUN %SYS %WAIT %RDY %IDLE %OVRLP
        1141 28 vmware-vmx 1 0.06 0.06 0.00 99.71 0.30 0.00 0.00
        1142 28 vmm0:RHEL5.2-0 1 50.54 51.00 0.01 0.68 48.37 0.00 0.46
        1143 28 vmm1:RHEL5.2-0 1 49.69 50.16 0.00 1.38 48.52 0.00 0.46
        1144 28 vmm2:RHEL5.2-0 1 50.56 51.01 0.00 2.80 46.24 0.00 0.45
        1145 28 vmm3:RHEL5.2-0 1 50.52 50.97 0.00 2.51 46.56 0.00 0.40
        1146 28 vmware-vmx 1 0.00 0.00 0.00 100.00 0.00 0.00 0.00
        1147 28 mks:RHEL5.2-0 1 0.60 0.59 0.02 95.23 4.25 0.00 0.00
        1148 28 vcpu-0:RHEL5.2-0 1 0.01 0.01 0.00 99.99 0.00 0.00 0.00
        1149 28 vcpu-1:RHEL5.2-0 1 0.00 0.00 0.00 100.00 0.00 0.00 0.00
        1150 28 vcpu-2:RHEL5.2-0 1 0.00 0.00 0.00 100.00 0.00 0.00 0.00
        1151 28 vcpu-3:RHEL5.2-0 1 0.00 0.00 0.00 100.00 0.00 0.00 0.00
        1169 28 Worker#0:RHEL5.2-0 1 0.01 0.01 0.00 99.98 0.00 0.00 0.00
        29 29 RHEL5.2-1 12 188.39 189.68 0.02 797.75 213.04 0.00 1.30
        30 30 RHEL5.2-2 5 4.50 4.52 0.00 487.59 8.09 88.99 0.02
        31 31 RHEL5.2-3 11 187.19 188.70 0.00 706.60 205.30 0.19 1.48
        32 32 RHEL5.2-4 12 211.10 211.47 0.00 803.07 185.59 0.00 1.29


        If %RDY is high, there are two ways to address the issue:
         
        1. Reduce host load. This is the most straightforward solution.

          OR
           
        2. Apply CPU reservations to the virtual machine. This is useful if only some of the virtual machines need to have accurate timekeeping, or if some of the virtual machines need more CPU resources to keep time accurately.

          If the virtual machine's %RDY is low, but timer interrupt delivery is still falling behind, continue to Step 3.
           
    3. Check whether Time falls behind in a virtual machine when the guest operating system writes to previously unwritten regions of its virtual disk (1008284) applies, and if so, apply one of the solutions described in the article.
       
    4. If timer interrupt delivery still falls significantly behind, file a support request.
       
  3. Check that NTP is running properly in the guest and on the host. To view ntpd's status run the command ntpq -p to print the list of peers that ntpd is in communication with. Make sure that there is a currently selected peer (its name is preceded by a "*"). Ideally other servers are marked with a "+" which indicates that they are acceptable as well.

    For example:

    bash$ ntpq -p

    remote refid st t when poll reach delay offset jitter
    ========================================================================
    +ntps2.gslabs.org 192.168.0.72 2 u 149 256 377 0.212 -18.115 11.359
    +ntps3.gslabs.org 192.168.0.72 2 u 185 256 377 0.207 -82.106 14.625
    *ntps1.gslabs.org 192.168.0.72 2 u 175 256 377 0.266 65.871 21.401
    ntps4.gslabs.org 192.168.10.2 3 u 55 256 377 0.284 -20.468 19.470

     
  4. Collect time in the guest versus time reported by a reference source.

    /usr/sbin/ntpdate -q <timeserver> reports the amount that time on the client (where ntpdate is executed) is ahead or behind the NTP server specified by <timeserver>. Positive offsets indicate time in the client is behind time on the server. Negative offsets indicate that time in the client is ahead of time on the server.

    For example:

    bash$ /usr/sbin/ntpdate -q 0.vmware.pool.ntp.org

    server 65.182.224.39, stratum 2, offset -0.002269, delay 0.04424
    server 66.79.167.34, stratum 2, offset 0.004515, delay 0.03171
    server 72.18.205.156, stratum 2, offset 0.004714, delay 0.04095
    server 72.167.54.201, stratum 2, offset 0.000994, delay 0.04677
    server 128.10.252.10, stratum 2, offset -0.019049, delay 0.08801
    28 Apr 20:25:20 ntpdate[1217]: adjust time server 66.79.167.34 offset 0.004515 sec


    This can be used to collect data on how time on the client varies:

    bash$ while true; do /usr/sbin/ntpdate -q 0.vmware.pool.ntp.org | tail -n -1; sleep 1; done

    28 Apr 20:35:21 ntpdate[5112]: adjust time server 66.79.167.34 offset 0.004764 sec
    28 Apr 20:35:27 ntpdate[5116]: adjust time server 66.79.167.34 offset 0.004872 sec
    28 Apr 20:35:33 ntpdate[5119]: adjust time server 66.79.167.34 offset 0.004834 sec
    28 Apr 20:35:39 ntpdate[5123]: adjust time server 66.79.167.34 offset 0.004871 sec
    28 Apr 20:35:44 ntpdate[5127]: adjust time server 66.79.167.34 offset 0.004857 sec
    28 Apr 20:35:50 ntpdate[5147]: adjust time server 66.79.167.34 offset 0.004909 sec
    28 Apr 20:35:56 ntpdate[5150]: adjust time server 66.79.167.34 offset 0.004858 sec


    This can then be imported into a spreadsheet and the offset graphed over time. If the graph contains sudden jumps in time, this is most likely due to corrections applied by time synchronization utilities within the guest, like NTP or VMware Tools time sync. When troubleshooting issues, it can be useful to temporarily disable all time synchronization utilities to make it easier to see the underlying issues separate from the synchronization utilities attempts to correct the time.
Note: If your problem still exists:


Additional Information

Linux 客户机操作系统中计时问题的故障排除
Linux ゲスト OS におけるタイムキーピングの問題のトラブルシューティング