"Problem: Kafka javaservice is not running" displays when running check-service-health.sh command in platform node for VCF Operations for Networks
search cancel

"Problem: Kafka javaservice is not running" displays when running check-service-health.sh command in platform node for VCF Operations for Networks

book

Article ID: 406221

calendar_today

Updated On:

Products

VCF Operations for Networks

Issue/Introduction

Kafka service is failing on platform node 1.

When checking the service health on platform node 1 using the command:

./run_all.sh sudo /home/ubuntu/check-service-health.sh -p -d

All services report "running" or "running and healthy" except for the Kafka services, which displays the status:

Problem: Kafka javaservice is not running.

instead of:

Kafka is running

 

 

Note:  VCF Operations for Networks was formerly named Aria Operations for Networks (AON), and prior to that was named vRealize Network Insight (vRNI).

Environment

Aria Operations for Networks 6.13
Aria Operations for Networks 6.14
Aria Operations for Networks 6.14.1

Cause

Replication offset checkpoint used by the Kafka service has a malformed the file format.

Upon examination of the platform node 1 /var/log/arkin/kafka/kafka.log, we see the following:

java.io.IOException: Malformed line in checkpoint file (/var/lib/kafka/kafka-logs/replication-offset-checkpoint): '52 197063175'

indicating that the replication offset checkpoint is malformed.

Resolution

Deleting the replication offset checkpoint file will allow the kafka service to recreate the file upon restart, allowing the service to resume functioning correctly.

  1. SSH into Platform node 1 using the support user.

  2. Check the service health by running the following commands:
    ub
    ./run_all.sh sudo /home/ubuntu/check-service-health.sh --uptime

    You should see the following line in the results:

    Problem: Kafka javaservice is not running.

  3. Verify that the replication-offset-checkpoint file exists by running the following commands:
    cd /var/lib/kafka/kafka-logs
    ls -lrth

    You should see a line similar to the following on the list for "replication-offset-checkpoint":

    -rw-r--r-- 1 ubuntu ubuntu 17 Mar 15 04:39 replication-offset-checkpoint

  4. Create a backup of the replication offset checkpoint file by running the following commands:

    cp replication-offset-checkpoint replication-offset-checkpoint.bak
    ls -lrth

    You should now see the original file named "replication-offset-checkpoint" and a new file named "replication-offset-checkpoint.bak":

    -rw-r--r-- 1 ubuntu ubuntu 17 Mar 15 04:39 replication-offset-checkpoint
    -rw-r--r-- 1 ubuntu ubuntu 17 Mar 15 04:39 replication-offset-checkpoint.bak

  5. Delete the replication offset checkpoint file using the following command:
    sudo rm -rf utatstdb23188f replication-offset-checkpoint
    ls -lrth

    You should now only see the "replication-offset-checkpoint.bak" file listed:

    -rw-r--r-- 1 ubuntu ubuntu 17 Mar 15 04:39 replication-offset-checkpoint.bak

  6. The kakfa service is likely not running in the environment, however, just in case it is, run the following command to stop it:
    ./run_all.sh sudo systemctl stop kafka.service
    You may only see a list of your platform nodes, or you may see more details showing the kafka service has been stopped.

  7. Start the kafka service by running the following command:
    ./run_all.sh sudo systemctl start kafka.service

    You should see a list of your platform nodes.

  8. Confirm that the "replication-offset-checkpoint" file was re-created successfully, by running the following command:
    ls-lrth

    You should see 2 files on the list again like:

    -rw-r--r-- 1 ubuntu ubuntu 17 Mar 15 04:39 replication-offset-checkpoint
    -rw-r--r-- 1 ubuntu ubuntu 17 Mar 15 04:39 replication-offset-checkpoint.bak

  9. Check the service health again, using the following command:

    ./run_all.sh sudo /home/ubuntu/check-service-health.sh -p -d

    You should see "Kafka is running" in the services list with a short uptime:

    Kafka is running
    Uptime: 00:01:05

  10. Wait 24 hours and verify service stability by re-running the step above for the service health. You should see the uptime of a little over 24 hours before deleting the backup file.

Additional Information

If you see the error "Grid processing stopped since kafka cluster is not available" in the GUI after you have completed the repair of the kafka service, please open a support case with Broadcom Support and refer to this KB article. For more information, see Creating and managing Broadcom support cases.