Corrupted FSDB files due to unrealistic timestamp - Aria Operations (formerly vRealize Operations)
search cancel

Corrupted FSDB files due to unrealistic timestamp - Aria Operations (formerly vRealize Operations)

book

Article ID: 340117

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

  • Aria Operations (formerly vRealize Operations) analytics service is crashing continuously on one or more nodes.
  • Alert received related to FSDB corruption within the alerts tab in Aria Operations (formerly vRealize Operations)
  • analytics-wrapper.log located in /data/vcops/log contains entries similar to those mentioned below:

    DEBUG  | wrapper  | 2020/11/13 07:42:28 | Pending Pings 5
    DEBUG  | wrapper  | 2020/11/13 07:42:37 | Signal trapped.  Details:
    DEBUG  | wrapper  | 2020/11/13 07:42:37 |   signal number=17 (SIGCHLD), source="unknown"
    DEBUG  | wrapper  | 2020/11/13 07:42:37 | Received SIGCHLD, checking JVM process status.
    STATUS | wrapper  | 2020/11/13 07:42:37 | JVM received a signal SIGSEGV (11)

  • Analytics service restarts appear to occur regularly.
  • fsdb-accessor-uuid.log located in /data/vcops/log, contains entries similar to the below entries.

    FSDB throws an exception: CorruptedFileException: The file /usr/lib/vmware-vcops/data/8/8028/144998775_03_8028.dat was corrupted loadHeader: Header version mismatch 0 != 2 and failed to repair No any data could be repaired, the file '/usr/lib/vmware-vcops/data/8/8028/144998775_03_8028.dat' was deleted.

  • Aside from the above issue, MP4H appears to be affected by this issue and will not allow the processing of MP4H metrics or in some cases, a subset of metrics. Following the solution in this KB will resolve the issue.


Environment

VMware vRealize Operations 8.x
Aria Operations 8.x

Cause

Due to environmental issues such as NTP or diskspace, the date format of the FSDB .dat file can become corrupted. This causes the FSDB file(s) to contain a date that is unrealistic and futuristic.

Resolution

Note: Before attempting the following steps please ensure that snapshots have been taken of all nodes in the Aria Operations cluster as per How to take a Snapshot of VMware Aria Operations

  1. Run the below command on all analytics nodes to check if you have corrupted FSDB files with an unrealistic timestamp.

    Note: The following command should return two files which can be ignored, these have been left in to confirm the command ran successfully:
    /storage/db/vcops/data/cache/0.cache  and  /storage/db/vcops/data/cache/.lock

    find /storage/db/vcops/data/ -type f -not -regex '.*/[2][0][2][0-9]_[0-9][0-9]_.*.dat' -and ! -regex '.*/[2][0][1][7-9]_[0-9][0-9]_.*.dat' -and ! -name '*dtr' -and ! -name 'mps_*'


    An example of a corrupted file is as follows: 
    /usr/lib/vmware-vcops/data/8/8584/163603858_02_8584.dat.
    Note the future timestamp at the end of the file in bold.

    An example of a valid file is as follows:
    /usr/lib/vmware-vcops/data/10/10718/2020_10_10718.dat.
    Note the relatable timestamp in bold.

    In rare cases, you may see files prior to the year 2000, these can be deleted.



  2. Note down the names of the files if found, and the node where you found them.
  3. Take the cluster offline via /admin UI
  4. Take the analytics VMs offline and create snapshots
  5. Power on the analytics VMs
  6. Whilst the cluster is still offline, SSH to the nodes and remove the files found in step 1 using the rm command.

    Example: "rm /usr/lib/vmware-vcops/data/8/8584/163603858_02_8584.dat"

  7. Delete the FSDB cache from all analytics nodes using the below command:

    rm -i $(find /usr/lib/vmware-vcops/data/cache/ -name *.cache | grep -v '/0.cache')

  8. Bring the cluster online
  9. Confirm cluster is fully operational and remove snapshots.

Additional Information

In the case that the command below returns more cache files apart from 0.cache, you need to delete them.
for example: /storage/db/vcops/data/cache/1.cache
 
find /storage/db/vcops/data/ -type f -not -regex '.*/[2][0][2][0-9]_[0-9][0-9]_.*.dat' -and ! -regex '.*/[2][0][1][7-9]_[0-9][0-9]_.*.dat' -and ! -name '*dtr' -and ! -name 'mps_*'
 
Impact/Risks:
  • The cluster may continuously crash until the issue is rectified.
  • MP4H will not show any metrics or display only a subset of metrics. This is due to the fact they cannot be saved to disk due to FSDB issues.