Watchtower Confluent retention policy of schema topic _schemas is 'delete' not 'compact' and fails to start
search cancel

Watchtower Confluent retention policy of schema topic _schemas is 'delete' not 'compact' and fails to start

book

Article ID: 407506

calendar_today

Updated On:

Products

WatchTower WatchTower Platform

Issue/Introduction

After a cluster upgrade, the Confluent is now failing to come back up with the following:

 WARN The replication factor of the schema topic _schemas is less than the desired one of 3. If this is a production environment, it's crucial to add more brokers and increase the replication factor of the topic. (io.confluent.kafka.schemaregistry.storage.KafkaStore:263)
 ERROR The retention policy of the schema topic _schemas is incorrect. You must configure the topic to 'compact' cleanup policy to avoid Kafka deleting your schemas after a week. Refer to Kafka documentation for more details on cleanup policies (io.confluent.kafka.schemaregistry.storage.KafkaStore:279)
 ERROR Error starting the schema registry (io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication:81)
io.confluent.kafka.schemaregistry.exceptions.SchemaRegistryInitializationException: Error initializing kafka store while initializing schema registry
        at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.init(KafkaSchemaRegistry.java:411)
        at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.initSchemaRegistry(SchemaRegistryRestApplication.java:79)
        at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.configureBaseApplication(SchemaRegistryRestApplication.java:105)
        at io.confluent.rest.Application.configureHandler(Application.java:324)
        at io.confluent.rest.ApplicationServer.doStart(ApplicationServer.java:228)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
        at io.confluent.kafka.schemaregistry.rest.SchemaRegistryMain.main(SchemaRegistryMain.java:44)
Caused by: io.confluent.kafka.schemaregistry.storage.exceptions.StoreInitializationException: The retention policy of the schema topic _schemas is incorrect. Expected cleanup.policy to be 'compact' but it is delete
        at io.confluent.kafka.schemaregistry.storage.KafkaStore.verifySchemaTopic(KafkaStore.java:284)
        at io.confluent.kafka.schemaregistry.storage.KafkaStore.createOrVerifySchemaTopic(KafkaStore.java:185)
        at io.confluent.kafka.schemaregistry.storage.KafkaStore.init(KafkaStore.java:122)
        at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.init(KafkaSchemaRegistry.java:409)

Environment

Watchtower 1.2, 1.3

Cause

 

New metrics from SYSVIEW were enabled, these new metrics caused a flood of data records overwhelming Watchtower.  To resolve and clean up the lag, it was recommended to clean up the Kafka PVC. The steps that were followed are listed below and caused the _schemas topic to automatically change to the wrong cleanup policy. 

  1. WT is up
  2. Kafka 1 replicas
  3. Confluent schema registry 1 replicas
  4. Kafka 0 replicas
  5. delete the kafka pvc
  6. Kafka 1 replicas

As long as the Confluent schema registry was up, it would work, but after a rescaling (the cluster upgrade event), the issue happened.

  1. Confluent schema registry 0 replicas
  2. Confluent schema registry 1 replicas 

    All other deployments could have been down. If the Kafka and Confluent schema registry sequence is as above, then the issue will occur irrespective of other Kafka-dependent deployments

 

 

Resolution

Steps to fix the cleanup policy of the _schemas topic

  1. Scale down deployments to 0 replicas
    1. confluent-deployment, data-insights-dbloader, data-insights-ingestor, datastream-hub-deployment, datastream-maas-deployment, ml-insights-profiler-alarm-manager, ml-insights-profiler-notifier  to 0 
      kubectl scale -n NAMESPACE deploy $(kubectl get deploy|grep "confluent-deployment\|data-insights-dbloader\|data-insights-ingestor\|datastream-hub-deployment\|datastream-maas-deployment\|ml-insights-profiler-alarm-manager\|ml-insights-profiler-notifier"|awk -F ' ' '{print $1}') --replicas=0
  2. Scale down ADE sts to 0 replicas 
    kubectl scale -n NAMESPACE sts $(kubectl get sts|grep ml-insights-profiler-ade|awk -F ' ' '{print $1}') --replicas=0
  3. Apply the below job 
    kubectl apply -n NAMESPACE -f fix-confluent.yaml
    1. The logs of the pod created by this job will display the following EXPECTED HARMLESS error. The last line is what matters 

      log4j:ERROR Could not read configuration file from URL [file:/tmp/data/config/log4j.properties].
      java.io.FileNotFoundException: /tmp/data/config/log4j.properties (No such file or directory)
      .

      .

      .
      log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
      Completed updating config for topic _schemas.

    2. fix-confluent.yaml Expand source
  4. Scale up deployments to 1 replicas 
    kubectl scale -n NAMESPACE deploy $(kubectl get deploy|grep "confluent-deployment\|data-insights-ingestor\|datastream-hub-deployment\|datastream-maas-deployment\|ml-insights-profiler-alarm-manager\|ml-insights-profiler-notifier"|awk -F ' ' '{print $1}') --replicas=1
  5. Scale up db-loader to 3 replicas as the diagnostics dump indicates 3 running replicas 
    kubectl scale -n NAMESPACE deploy $(kubectl get deploy|grep "data-insights-dbloader"|awk -F ' ' '{print $1}') --replicas=3
  6. Scale up ADE to 1 replicas 
    kubectl scale -n NAMESPACE sts $(kubectl get sts|grep ml-insights-profiler-ade|awk -F ' ' '{print $1}') --replicas=1

Method 2 - Kafka PVC deletion

  1. Scale down deployments to 0 replicas:
    1. confluent-deployment, data-insights-dbloader, data-insights-ingestor, datastream-hub-deployment, datastream-maas-deployment, ml-insights-profiler-alarm-manager, ml-insights-profiler-notifier  to 0 
      kubectl scale -n NAMESPACE deploy $(kubectl get deploy|grep "confluent-deployment\|data-insights-dbloader\|data-insights-ingestor\|datastream-hub-deployment\|datastream-maas-deployment\|ml-insights-profiler-alarm-manager\|ml-insights-profiler-notifier"|awk -F ' ' '{print $1}') --replicas=0
  2. Scale down ADE and Kafka sts to 0 replicas 
    kubectl scale -n NAMESPACE sts $(kubectl get sts|grep "kafka\|ml-insights-profiler-ade"|awk -F ' ' '{print $1}') --replicas=0
  3. Delete Kafka PVC 
    kubectl scale -n NAMESPACE delete pvc common-service-kafka-pvc-kafka-0
  4. Scale up Kafka to 1 replicas 
    kubectl scale -n NAMESPACE sts $(kubectl get sts|grep kafka|awk -F ' ' '{print $1}') --replicas=1
  5. Scale up deployments to 1 replicas 
    kubectl scale -n NAMESPACE deploy $(kubectl get deploy|grep "confluent-deployment\|data-insights-ingestor\|datastream-hub-deployment\|datastream-maas-deployment\|ml-insights-profiler-alarm-manager\|ml-insights-profiler-notifier"|awk -F ' ' '{print $1}') --replicas=1
  6. Scale up db-loader to 3 replicas as the diagnostics dump indicates 3 running replicas 
    kubectl scale -n NAMESPACE deploy $(kubectl get deploy|grep "data-insights-dbloader"|awk -F ' ' '{print $1}') --replicas=3
  7. Scale up ADE to 1 replicas 
    kubectl scale -n NAMESPACE sts $(kubectl get sts|grep ml-insights-profiler-ade|awk -F ' ' '{print $1}') --replicas=1