Checking DX O2 Kafka health
search cancel

Checking DX O2 Kafka health

book

Article ID: 272231

calendar_today

Updated On:

Products

DX Operational Observability

Issue/Introduction

The following is a high-list of techniques and suggestions to employ when troubleshooting Jarvis common performance and configuration issues.

Environment

DX O2 OnPremise

Resolution

1.Checklist

 

1) Check the kafka logs for possible ERROR or WARN messages
 
Kafka logs are available from: <NFS>/jarvis/api/<jarvis-apis-pod>/*.log

Here an example of a message indicating a kafka to zookeeper connectivity issue:
 
Client session timed out, have not heard from server in 20010ms for sessionid 0x1709fbc4e26000a, closing socket connection and attempting reconnect
 
 
 
2)  Check for possible LAGs
 
a) Connect to a kafka pod
 
kubectl exec -ti <jarvis-kafka-pod> sh -n<namespace> -- bash
 
Example:
kubectl exec -ti jarvis-kafka-0 sh -ndxi -- bash
 
b) Execute the below commands to identify if there is a LAG: 
 
 
List all available topics:

/opt/ca/kafka/bin/kafka-topics.sh --bootstrap-server jarvis-kafka-0:9092,jarvis-kafka-1:9092,jarvis-kafka-2:9092 --list

List all consumer groups/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka-0:9092,jarvis-kafka-1:9092,jarvis-kafka-2:9092 --list
Check for a possible LAG in jarvis Recommendation: LAG  column shouldn't be > 0

/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka-0:9092,jarvis-kafka-1:9092,jarvis-kafka-2:9092 --describe --group indexer

If there is a delay in AXA data reporting, see KB 238179  

/opt/ca/kafka/bin/kafka-consumer-groups.sh --bootstrap-server jarvis-kafka:9092,jarvis-kafka-2:9092,jarvis-kafka-3:9092 --describe --group axa.transformer

 

Here is an example illustrating a LAG condition:

Here is an example illustrating the consumers disconnection condition:

Recommendation:  Restart jarvis-lean-indexer service as below:

Below is the list of kubectl commands :

a) Scale down the following deployments:

kubectl scale --replicas=0 deployment jarvis-lean-indexer -n<namespace>

b) Wait for the pod to stop

c) Scale up deployments 

kubectl scale --replicas=1 deployment jarvis-lean-jarvis-indexer -n<namespace>

d) Verify for possible LAG again

 

2.What to collect if the problem persist?

If after applying the above checks and recommendations the problem persist, collect the below logs and contact Broadcom Support:

<NFS>/jarvis/api/<jarvis-apis-pod>/*.log
<NFS>/jarvis/indexer/<jarvis-lean-indexer-pod>/*.log
<NFS>/jarvis/kafka-logs/kafka-<#>/*.log
<NFS>/jarvis/esutils/*.log

kubectl logs jarvis-zookeeper-0 -ndxi