VCF Operations for Networks GUI shows high processing lag and indexer lag
search cancel

VCF Operations for Networks GUI shows high processing lag and indexer lag

book

Article ID: 435817

calendar_today

Updated On:

Products

VCF Operations for Networks

Issue/Introduction

  • VCF Operations for Networks GUI shows high processing lag and indexer lag.

    Refer to GUI Screenshot showing high lags



  • From the platform logs at location /var/log/arkin/hadoop-yarn/containers/application_#############_0004/container_#############_0004_01_000002/taskmanager.log below  errors and expectations are seen

    YYYY-MM-DDT14:14:13.700Z INFO kafka.clients.FetchSessionHandler Source Data Fetcher for Source: SDMProcessSRC -> GenSDM -> Filter -> MetStoreMap -> (Sink: RAW_METRIC_SINK, Sink: FlinkKafkaProducer, async wait operator -> Timestamps/Watermarks -> Flat Map, Filter -> Map) (5/8)_1 handleError:445 [Consumer clientId=vrniflink-4, groupId=vrniflink] Error sending fetch request (sessionId=1450162592, epoch=1) to node 0: {}.
    org.apache.kafka.common.errors.DisconnectException: null

    Full error stack message as below:

    YYYY-MM-DDT14:19:58.401Z ERROR runtime.taskexecutor.TaskExecutor flink-akka.actor.default-dispatcher-17 onFatalError:2112 Fatal error occurred in TaskExecutor akka.tcp://flink@localhost:39797/user/rpc/taskmanager_0.
    org.apache.flink.util.FlinkException: The TaskExecutor's registration at the ResourceManager akka.tcp://flink@localhost:34411/user/rpc/resourcemanager__ has been rejected: Rejected TaskExecutor registration at the ResourceManager because: The ResourceManager does not recognize this TaskExecutor.
            at org.apache.flink.runtime.taskexecutor.TaskExecutor_ResourceManagerRegistrationListener.onRegistrationRejection(TaskExecutor.java:2293) _[flink-dist_2.12-1.14.6.jar:1.14.6]
            at org.apache.flink.runtime.taskexecutor.TaskExecutor_ResourceManagerRegistrationListener.onRegistrationRejection(TaskExecutor.java:2248) _[flink-dist_2.12-1.14.6.jar:1.14.6]
            at org.apache.flink.runtime.taskexecutor.TaskExecutorToResourceManagerConnection.onRegistrationRejection(TaskExecutorToResourceManagerConnection.java:109) _[flink-dist_2.12-1.14.6.jar:1.14.6]
            at org.apache.flink.runtime.taskexecutor.TaskExecutorToResourceManagerConnection.onRegistrationRejection(TaskExecutorToResourceManagerConnection.java:40) _[flink-dist_2.12-1.14.6.jar:1.14.6]
            at org.apache.flink.runtime.registration.RegisteredRpcConnection.lambda_createNewRegistration_0(RegisteredRpcConnection.java:269) _[flink-dist_2.12-1.14.6.jar:1.14.6]
            at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) _[_:_]
            at java.util.concurrent.CompletableFuture_UniWhenComplete.tryFire(CompletableFuture.java:841) _[_:_]
            at java.util.concurrent.CompletableFuture_Completion.run(CompletableFuture.java:482) _[_:_]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda_handleRunAsync_4(AkkaRpcActor.java:455) _[flink-rpc-akka_2bd0469e-3e4b-45ed-8fb4-ab5b8ca56b8d.jar:1.14.6]
            at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) _[flink-rpc-akka_2bd0469e-3e4b-45ed-8fb4-ab5b8ca56b8d.jar:1.14.6]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:455) _[flink-rpc-akka_2bd0469e-3e4b-45ed-8fb4-ab5b8ca56b8d.jar:1.14.6]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:213) _[flink-rpc-akka_2bd0469e-3e4b-45ed-8fb4-ab5b8ca56b8d.jar:1.14.6]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163) _[flink-rpc-akka_2bd0469e-3e4b-45ed-8fb4-ab5b8ca56b8d.jar:1.14.6]
            at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) [flink-rpc-akka_2bd0469e-3e4b-45ed-8fb4-ab5b8ca56b8d.jar:1.14.6]
            at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) [flink-rpc-akka_2bd0469e-3e4b-45ed-8fb4-ab5b8ca56b8d.jar:1.14.6]
            at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) [flink-rpc-akka_2bd0469e-3e4b-45ed-8fb4-ab5b8ca56b8d.jar:1.14.6]
            at scala.PartialFunction.applyOrElse_(PartialFunction.scala:122) [flink-rpc-akka_2bd0469e-3e4b-45ed-8fb4-ab5b8ca56b8d.jar:1.14.6]
            at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) [flink-rpc-akka_2bd0469e-3e4b-45ed-8fb4-ab5b8ca56b8d.jar:1.14.6]
            at scala.PartialFunction_OrElse.applyOrElse(PartialFunction.scala:171) [flink-rpc-akka_2bd0469e-3e4b-45ed-8fb4-ab5b8ca56b8d.jar:1.14.6]
            at scala.PartialFunction_OrElse.applyOrElse(PartialFunction.scala:172) [flink-rpc-akka_2bd0469e-3e4b-45ed-8fb4-ab5b8ca56b8d.jar:1.14.6]
            at scala.PartialFunction_OrElse.applyOrElse(PartialFunction.scala:172) [flink-rpc-akka_2bd0469e-3e4b-45ed-8fb4-ab5b8ca56b8d.jar:1.14.6]
            at akka.actor.Actor.aroundReceive(Actor.scala:537) [flink-rpc-akka_2bd0469e-3e4b-45ed-8fb4-ab5b8ca56b8d.jar:1.14.6]
            at akka.actor.Actor.aroundReceive_(Actor.scala:535) [flink-rpc-akka_2bd0469e-3e4b-45ed-8fb4-ab5b8ca56b8d.jar:1.14.6]
            at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220) [flink-rpc-akka_2bd0469e-3e4b-45ed-8fb4-ab5b8ca56b8d.jar:1.14.6]
            at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580) [flink-rpc-akka_2bd0469e-3e4b-45ed-8fb4-ab5b8ca56b8d.jar:1.14.6]
            at akka.actor.ActorCell.invoke(ActorCell.scala:548) [flink-rpc-akka_2bd0469e-3e4b-45ed-8fb4-ab5b8ca56b8d.jar:1.14.6]
            at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) [flink-rpc-akka_2bd0469e-3e4b-45ed-8fb4-ab5b8ca56b8d.jar:1.14.6]
            at akka.dispatch.Mailbox.run(Mailbox.scala:231) [flink-rpc-akka_2bd0469e-3e4b-45ed-8fb4-ab5b8ca56b8d.jar:1.14.6]
            at akka.dispatch.Mailbox.exec(Mailbox.scala:243) [flink-rpc-akka_2bd0469e-3e4b-45ed-8fb4-ab5b8ca56b8d.jar:1.14.6]
            at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373) [_:_]
            at java.util.concurrent.ForkJoinPool_WorkQueue.topLevelExec(ForkJoinPool.java:1182) [_:_]
            at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655) [_:_]
            at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622) [_:_]
            at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165) [_:_]
    YYYY-MM-DDT14:19:58.408Z ERROR runtime.taskexecutor.TaskManagerRunner flink-akka.actor.default-dispatcher-17 onFatalError:330 Fatal error occurred while executing the TaskManager. Shutting it down...
    org.apache.flink.util.FlinkException: The TaskExecutor's registration at the ResourceManager akka.tcp://flink@localhost:34411/user/rpc/resourcemanager__ has been rejected: Rejected TaskExecutor registration at the ResourceManager because: The ResourceManager does not recognize this TaskExecutor.

Environment

  • VCF Operations for Networks 6.14.0
  • VCF Operations for Networks 6.14.1

Cause

The case of this is unknown, however from above errors and expectations some issue with Kafka hence restart of service should fix this issue and lags are expected to settle down gradually.

Resolution

To resolve this issue , perform below:

  1. Take a SSH/putty session on platform node.

  2. Login with username support.

  3. Enter below commands to stop the services:

    ub
    sudo systemctl stop kafka.service
    sudo systemctl stop zookeeper-server.service
  4. Wait for 5 minutes.

  5. Execute below command to start the services:

     sudo systemctl start kafka.service
     sudo systemctl start zookeeper-server.service

     

  6. Wait 5 minutes and then execute below command to validate all the services and running and healthy.
    ./check-service-health.sh -p -d

    Note: All services should be running and healthy when above command is executed.

  7. Once all the services are showing running and healthy we need to wait for 24-36 hours and observe for the Lags to settle down.

  8. Post 24 to 48 hours the GUI should see the lags have settled down and system and infrastructure health both should be showing Good now.

    Refer to below screenshot from VCF Operations for Networks GUI