Watchtower Confluent retention policy of schema topic _schemas is 'delete' not 'compact' and fails to start

Products

WatchTower WatchTower Platform

Issue/Introduction

After a cluster upgrade, the Confluent is now failing to come back up with the following:

WARN The replication factor of the schema topic _schemas is less than the desired one of 3. If this is a production environment, it's crucial to add more brokers and increase the replication factor of the topic. (io.confluent.kafka.schemaregistry.storage.KafkaStore:263)
ERROR The retention policy of the schema topic _schemas is incorrect. You must configure the topic to 'compact' cleanup policy to avoid Kafka deleting your schemas after a week. Refer to Kafka documentation for more details on cleanup policies (io.confluent.kafka.schemaregistry.storage.KafkaStore:279)
ERROR Error starting the schema registry (io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication:81)
io.confluent.kafka.schemaregistry.exceptions.SchemaRegistryInitializationException: Error initializing kafka store while initializing schema registry
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.init(KafkaSchemaRegistry.java:411)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.initSchemaRegistry(SchemaRegistryRestApplication.java:79)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.configureBaseApplication(SchemaRegistryRestApplication.java:105)
at io.confluent.rest.Application.configureHandler(Application.java:324)
at io.confluent.rest.ApplicationServer.doStart(ApplicationServer.java:228)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryMain.main(SchemaRegistryMain.java:44)
Caused by: io.confluent.kafka.schemaregistry.storage.exceptions.StoreInitializationException: The retention policy of the schema topic _schemas is incorrect. Expected cleanup.policy to be 'compact' but it is delete
at io.confluent.kafka.schemaregistry.storage.KafkaStore.verifySchemaTopic(KafkaStore.java:284)
at io.confluent.kafka.schemaregistry.storage.KafkaStore.createOrVerifySchemaTopic(KafkaStore.java:185)
at io.confluent.kafka.schemaregistry.storage.KafkaStore.init(KafkaStore.java:122)
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.init(KafkaSchemaRegistry.java:409)

Environment

Watchtower 1.2, 1.3

Cause

New metrics from SYSVIEW were enabled, these new metrics caused a flood of data records overwhelming Watchtower. To resolve and clean up the lag, it was recommended to clean up the Kafka PVC. The steps that were followed are listed below and caused the _schemas topic to automatically change to the wrong cleanup policy.

WT is up
Kafka 1 replicas
Confluent schema registry 1 replicas
Kafka 0 replicas
delete the kafka pvc
Kafka 1 replicas

As long as the Confluent schema registry was up, it would work, but after a rescaling (the cluster upgrade event), the issue happened.

Confluent schema registry 0 replicas
Confluent schema registry 1 replicas

All other deployments could have been down. If the Kafka and Confluent schema registry sequence is as above, then the issue will occur irrespective of other Kafka-dependent deployments

Resolution

Steps to fix the cleanup policy of the _schemas topic

Method 1 - Apply Script (Recommended)

Scale down deployments to 0 replicas

confluent-deployment, data-insights-dbloader, data-insights-ingestor, datastream-hub-deployment, datastream-maas-deployment, ml-insights-profiler-alarm-manager, ml-insights-profiler-notifier to 0

kubectl scale -n NAMESPACE deploy $(kubectl get deploy|grep "confluent-deployment\|data-insights-dbloader\|data-insights-ingestor\|datastream-hub-deployment\|datastream-maas-deployment\|ml-insights-profiler-alarm-manager\|ml-insights-profiler-notifier"|awk -F ' ' '{print $1}') --replicas=0

Scale down ADE sts to 0 replicas

kubectl scale -n NAMESPACE sts $(kubectl get sts|grep ml-insights-profiler-ade|awk -F ' ' '{print $1}') --replicas=0

Apply the below job
```
kubectl apply -n NAMESPACE -f fix-confluent.yaml
```
1. The logs of the pod created by this job will display the following EXPECTED HARMLESS error. The last line is what matters
  
  log4j:ERROR Could not read configuration file from URL [file:/tmp/data/config/log4j.properties].
  java.io.FileNotFoundException: /tmp/data/config/log4j.properties (No such file or directory)
  .
  
  .
  
  .
  log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
  Completed updating config for topic _schemas.
2. fix-confluent.yaml Expand source

Scale up deployments to 1 replicas

kubectl scale -n NAMESPACE deploy $(kubectl get deploy|grep "confluent-deployment\|data-insights-ingestor\|datastream-hub-deployment\|datastream-maas-deployment\|ml-insights-profiler-alarm-manager\|ml-insights-profiler-notifier"|awk -F ' ' '{print $1}') --replicas=1

Scale up db-loader to 3 replicas as the diagnostics dump indicates 3 running replicas

kubectl scale -n NAMESPACE deploy $(kubectl get deploy|grep "data-insights-dbloader"|awk -F ' ' '{print $1}') --replicas=3

Scale up ADE to 1 replicas

kubectl scale -n NAMESPACE sts $(kubectl get sts|grep ml-insights-profiler-ade|awk -F ' ' '{print $1}') --replicas=1

Method 2 - Kafka PVC deletion

Scale down deployments to 0 replicas:

confluent-deployment, data-insights-dbloader, data-insights-ingestor, datastream-hub-deployment, datastream-maas-deployment, ml-insights-profiler-alarm-manager, ml-insights-profiler-notifier to 0

kubectl scale -n NAMESPACE deploy $(kubectl get deploy|grep "confluent-deployment\|data-insights-dbloader\|data-insights-ingestor\|datastream-hub-deployment\|datastream-maas-deployment\|ml-insights-profiler-alarm-manager\|ml-insights-profiler-notifier"|awk -F ' ' '{print $1}') --replicas=0

Scale down ADE and Kafka sts to 0 replicas

kubectl scale -n NAMESPACE sts $(kubectl get sts|grep "kafka\|ml-insights-profiler-ade"|awk -F ' ' '{print $1}') --replicas=0

Delete Kafka PVC

kubectl scale -n NAMESPACE delete pvc common-service-kafka-pvc-kafka-0

Scale up Kafka to 1 replicas

kubectl scale -n NAMESPACE sts $(kubectl get sts|grep kafka|awk -F ' ' '{print $1}') --replicas=1

Scale up deployments to 1 replicas

kubectl scale -n NAMESPACE deploy $(kubectl get deploy|grep "confluent-deployment\|data-insights-ingestor\|datastream-hub-deployment\|datastream-maas-deployment\|ml-insights-profiler-alarm-manager\|ml-insights-profiler-notifier"|awk -F ' ' '{print $1}') --replicas=1

Scale up db-loader to 3 replicas as the diagnostics dump indicates 3 running replicas

kubectl scale -n NAMESPACE deploy $(kubectl get deploy|grep "data-insights-dbloader"|awk -F ' ' '{print $1}') --replicas=3

Scale up ADE to 1 replicas

kubectl scale -n NAMESPACE sts $(kubectl get sts|grep ml-insights-profiler-ade|awk -F ' ' '{print $1}') --replicas=1