# bosh ssh -d appMetrics-<DEPLOYMENT_ID> ssh log-store-vms/<INSTANCE_ID># sudo su# monit summary# monit status ----------> view the log-store process for uptime to see when it last restartedtop command to view memory from a bosh ssh to the log-store-vm shows that the log-store process is consuming memory, this might show memory swapping depending on how degraded the VM is./var/log/kern.log will report oom_reaper warnings like:Out of memory: Killed process <ID> (log-store)oom_reaper: reaped process <ID> (log-store) This problem was seen on the Metric Store 1.7.0 for App metrics version 2.3.0-build.4. The problem relates to the underlying influxdb and is not necessarily version dependent.
In this instance, the ESXi host that failed was participating in a VSAN cluster. The failure led to a datastore connectivity failure which cause filesystem corruption on the log-store-vm. This filesystem corruption impacted the TSM file being written at the time of interruption leading to the inability of influxdb to load the file, which caused unrestrained memory consumption and eventual OOM errors.
# for partition in /var/vcap/store/log-store/influxdb/*; do echo "Processing: $partition"; sudo /var/vcap/packages/influx-inspect/influx_inspect verify -dir "$partition"; done
Processing /var/vcap/store/log-store/influxdb/15/
/var/vcap/store/log-store/influxdb/15/data/logs/default/1751414400000000000/000000357-000000002.tsm: got 2097282685 but expected 39641503 for key [108 111 .... <omitted> .... 103 101], block 12808
/var/vcap/store/log-store/influxdb/15/data/logs/default/1751414400000000000/000000359-000000001.tsm: healthy
000000357-000000002.tsm file is impacted in the /var/vcap/store/log-store/influxdb/15/data/logs/default/1751414400000000000/ folder, while the 000000359-000000001.tsm file is healthy.
000000357-000000002.tsm file should be moved to a backup directory while the log-store service is stopped in order to correct this condition:# monit stop log-store# mkdir -p /var/vcap/store/log-store/influxdb/backups000000357-000000002.tsm file to the backup directory:# mv /var/vcap/store/log-store/influxdb/15/data/logs/default/1751414400000000000/000000357-000000002.tsm /var/vcap/store/log-store/influxdb/backups/# monit start log-store
This is the fastest method for recovery. Data contained in the corrupted 000000357-000000002.tsm file will be removed from the influxdb and will not be stored for appMetrics.