High memory for log-store-vms after unexpected ESXi host failure

Products

VMware Tanzu Application Service

Issue/Introduction

One or more log-store-vms show high memory consumption in OpsMang GUI when viewing the App Metrics tile -> Status tab.
In the recent past, on the vSphere infrastructure, an ESXi host failed, or a storage outage occurred for the host or datastore the log-store-vm was running on.
Using bosh ssh commands to ssh into the log-store-vm, the monit managed log-store process might be restarting.

# bosh ssh -d appMetrics-<DEPLOYMENT_ID> ssh log-store-vms/<INSTANCE_ID>
# sudo su
# monit summary
# monit status ----------> view the log-store process for uptime to see when it last restarted
Using the top command to view memory from a bosh ssh to the log-store-vm shows that the log-store process is consuming memory, this might show memory swapping depending on how degraded the VM is.

Example of top output viewing memory:
It is possible the log-store process will consume all memory on the server, leading to 'Out Of Memory' failures, if so, the /var/log/kern.log will report oom_reaper warnings like:

Out of memory: Killed process <ID> (log-store)
oom_reaper: reaped process <ID> (log-store)

Environment

This problem was seen on the Metric Store 1.7.0 for App metrics version 2.3.0-build.4. The problem relates to the underlying influxdb and is not necessarily version dependent.

Cause

In this instance, the ESXi host that failed was participating in a VSAN cluster. The failure led to a datastore connectivity failure which cause filesystem corruption on the log-store-vm. This filesystem corruption impacted the TSM file being written at the time of interruption leading to the inability of influxdb to load the file, which caused unrestrained memory consumption and eventual OOM errors.

Resolution

To investigate the log-store service, use the influx-inspect verify command to output health status for each TSM file:

# for partition in /var/vcap/store/log-store/influxdb/*; do echo "Processing: $partition"; sudo /var/vcap/packages/influx-inspect/influx_inspect verify -dir "$partition"; done

Processing /var/vcap/store/log-store/influxdb/15/

/var/vcap/store/log-store/influxdb/15/data/logs/default/1751414400000000000/000000357-000000002.tsm: got 2097282685 but expected 39641503 for key [108 111 .... <omitted> .... 103 101], block 12808

/var/vcap/store/log-store/influxdb/15/data/logs/default/1751414400000000000/000000359-000000001.tsm: healthy

In the above example output, only the 000000357-000000002.tsm file is impacted in the /var/vcap/store/log-store/influxdb/15/data/logs/default/1751414400000000000/ folder, while the 000000359-000000001.tsm file is healthy.

The 000000357-000000002.tsm file should be moved to a backup directory while the log-store service is stopped in order to correct this condition:

1. Stop the log-store process:
  
  # monit stop log-store
2. Create a backup directory:
  
  # mkdir -p /var/vcap/store/log-store/influxdb/backups
3. Move the 000000357-000000002.tsm file to the backup directory:
  
  # mv /var/vcap/store/log-store/influxdb/15/data/logs/default/1751414400000000000/000000357-000000002.tsm /var/vcap/store/log-store/influxdb/backups/
4. Re-start log-store service:
  
  # monit start log-store

This is the fastest method for recovery. Data contained in the corrupted 000000357-000000002.tsm file will be removed from the influxdb and will not be stored for appMetrics.