Troubleshooting corrupted Cassandra commit logs in VMware Cloud Foundation
search cancel

Troubleshooting corrupted Cassandra commit logs in VMware Cloud Foundation

book

Article ID: 317069

calendar_today

Updated On:

Products

VMware Cloud Foundation

Issue/Introduction

Symptoms:
  • The Cassandra service is running but in degraded mode.
  • The LCM Service fail to start with a messages similar to the following "lcm.pid not readable"
 # service lcm status
* lcm.service - LCM app
  Loaded: loaded (/etc/systemd/system/lcm.service; enabled; vendor preset: enabled)
  Active: failed (Result: resources) since Thu 2018-05-17 17:04:12 UTC; 12s ago
 Process: 50860 ExecStart=/home/vrack/lcm/lcm-app/bin/lcm-service.sh start (code=exited, status=0/SUCCESS)
 
May 17 17:04:09 sddc-manager.mv.rackspace.com systemd[1]: Starting LCM app...
May 17 17:04:09 sddc-manager.mv.rackspace.com systemd[1]: lcm.service: PID file /home/vrack/lcm/logs/lcm.pid not readable (yet?) after start: No such file or directory
May 17 17:04:12 sddc-manager.mv.rackspace.com systemd[1]: lcm.service: Daemon never wrote its PID file. Failing.
May 17 17:04:12 sddc-manager.mv.rackspace.com systemd[1]: Failed to start LCM app.
May 17 17:04:12 sddc-manager.mv.rackspace.com systemd[1]: lcm.service: Unit entered failed state.
May 17 17:04:12 sddc-manager.mv.rackspace.com systemd[1]: lcm.service: Failed with result 'resources'.
  • For  VMware Cloud Foundation for Service Providers 2.4.x you see messages similar to the following in the  /var/opt/cassandra/logs/debug.log  on the SDDC Manager Controller VM
ERROR [main] 2018-05-29 17:36:40,103 JVMStabilityInspector.java:82 - Exiting due to error while processing commit log during initialization.
org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: Could not read commit log descriptor in file /var/opt/cassandra/data/commitlog/CommitLog-6-1527302840621.log
  • For VMware Cloud Foundation for Integrated Systems 2.2.x and 2.3.x you see messages similar to the following in /opt/vmware/cassandra/apache-cassandra-2.2.4/logs/debug.log on the SDDC Manager Controller VM.
ERROR [main] 2018-08-14 18:32:36,075 JVMStabilityInspector.java:78 - Exiting due to error while processing commit log during initialization.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Could not read commit log descriptor in file /opt/vmware/cassandra/apache-cassandra-2.2.4/bin/../data/commitlog/CommitLog-5-1534191227161.log
  • For VMware Cloud Foundation for Integrated Systems 3.x you see messages similar to the following in /var/opt/cassandra/logsdebug.log on the SDDC Manager Controller VM.
ERROR [main] 2018-12-14 14:38:57,488 JVMStabilityInspector.java:102 - Exiting due to error while processing commit log during initialization.
org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: Could not read commit log descriptor in file /var/opt/cassandra/data/commitlog/CommitLog-6-1541589396779.log

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware Cloud Foundation 2.2.x
VMware Cloud Foundation 3.5.x
VMware Cloud Foundation 2.3.x
VMware Cloud Foundation 3.0.x

Cause

If the SDDC Manager Controller VM runs out of disk space, the Cassandra database will start overwriting the CommitLogxxxx.log files, which can cause several database logs to become corrupted . When Cassandra is committing these logs to its database it will run in a degraded mode; this will cause the LCM service to stop running.

Resolution

To resolve this issue:

For  VMware Cloud Foundation for Service Providers 2.4.x
  1. Run the following command to check the status of the Cassandra service
service cassandra status
  1. Inspect the /var/opt/cassandra/logs/debug.log located on the SDDC Manager Controller VM and  identify the corrupted commitLogs.log file name.
less /var/opt/cassandra/logs/debug.log

org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: Could not read commit log descriptor in file /var/opt/cassandra/data/commitlog/CommitLog-6-1527302840621.log
  1. Run the following commands to stop the watchdogserver service and the Cassandra service.
/home/vrack/lcm/lcm-app/bin/lcm-watchdogserver.sh stop
service cassandra stop
  1.  Navigate to /var/opt/cassandra/data/commitlog directory and remove  the corrupted CommitLog.log file that was identified in Step 2.
cd /var/opt/cassandra/data/commitlog
rm  CommitLog-#-######.log 
  1. Run the following commands to start the watchdogserver service and the Cassandra service.
/home/vrack/lcm/lcm-app/bin/lcm-watchdogserver.sh start
service cassandra start
  1. Repeat Steps 1 to 5 until all corrupted commit logs are deleted and there is no automation of remediating a commit log corruption failure.
  2. The LCM service will start once there aren't any corrupted CommitLogs.
  3. If the service does not start successfully, look for more commit log failures and remove the offending commit logs.
 
For VMware Cloud Foundation Integrated Systems 2.2.x and 2.3.x 
  1. Run the following command to check the status of the Cassandra service
systemctl status cassandra
  1. Inspect the /opt/vmware/cassandra/apache-cassandra-2.2.4/logs/debug.log  located on the SDDC Manager Controller VM and  identify the corrupted commitlogs file name..
less /opt/vmware/cassandra/apache-cassandra-2.2.4/logs/debug.log

org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Could not read commit log descriptor in file /opt/vmware/cassandra/apache-cassandra-2.2.4/bin/../data/commitlog/CommitLog-5-1534191227161.log
  1. Run the following commands to stop the scs service and cassandra service
systemctl stop scs
systemctl stop cassandra
  1.  Navigate to /opt/vmware/cassandra/apache-cassandra-2.2.4/bin/../data/commitlog directory and remove  the corrupted CommitLog.log that was identified in step 2.
cd /opt/vmware/cassandra/apache-cassandra-2.2.4/bin/../data/commitlog
rm CommitLog-#-######.log 
  1. Run the following commands to start the scs service and cassandra service
systemctl start scs 
systemctl start cassandra
  1. Repeat Steps 1 to 5 until all corrupted commit logs are deleted and there is no automation of remediating a commit log corruption failure.
  2. The LCM service will start once there aren't any corrupted CommitLogs.
  3. If the service does not start successfully, look for more commit log failures and remove the offending commit logs.
For VMware Cloud Foundation Integrated Systems 3.x
  1. Run the following command to check the status of the Cassandra service
systemctl status cassandra
  1. Inspect the /var/opt/cassandra/logs/debug.log  located on the SDDC Manager Controller VM and  identify the corrupted commit logs file name..
less /var/opt/cassandra/logs/debug.log

org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Could not read commit log descriptor in file /var/opt/cassandra/data/commitlog/CommitLog-5-1534191227161.log
  1. Run the following commands to stop the scs service and cassandra service
systemctl stop scs
systemctl stop cassandra
  1.  Navigate to /var/opt/cassandra/data/commitlog/ directory and remove  the corrupted CommitLog.log that was identified in step 2.
cd /var/opt/cassandra/data/commitlog/
rm CommitLog-#-######.log 
  1. Run the following commands to start the scs service and cassandra service
systemctl start scs 
systemctl start cassandra
  1. Repeat Steps 1 to 5 until all corrupted commit logs are deleted and there is no automation of remediating a commit log corruption failure.
  2. The LCM service will start once there aren't any corrupted CommitLogs.
  3. If the service does not start successfully, look for more commit log failures and remove the offending commit logs.