/var/log/syslog shows intermittent flapping of the manager service.2025-02-23T13:36:08.826Z NSX_Manager MONITORING [nsx@6876 comp="nsx-manager" entId="node_ID" eventFeatureName="clustering" eventSev="error" eventState="On" eventType="cluster_unavailable" level="ERROR" subcomp="cbm"] All group members ####3542-bebd-67ae-####-d2c6####d45,de0####6-a5e4-40##-97a7-27####6f26aa,1####6e9-9d81-####2-9057-d####cf2a7##of service MANAGER are down.
2025-02-23T13:36:13.826Z ambrcpnsxmg01 NSX 75472 MONITORING [nsx@6876 comp="nsx-manager" entId="node_ID" eventFeatureName="clustering" eventSev="error" eventState="On" eventType="cluster_unavailable" level="ERROR" subcomp="cbm"] All group members ####3542-bebd-67ae-####-d2c6####d45,de0####6-a5e4-40##-97a7-27####6f26aa,1####6e9-9d81-####2-9057-d####cf2a7##of service MANAGER are up.VMware NSX
Could see the logs for the GMLE voilation from the proton restart logs (var/log/proton/proton_restart.log):INFO application-restartor restartor 2894022 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] ===== APPLICATION IS GOING RESTART (GMLE leadership safety violation handler triggered for groupType: mp) =====INFO application-restartor restartor 2678549 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] ===== APPLICATION IS GOING RESTART (GMLE leadership safety violation handler triggered for groupType: mp) =====INFO application-restartor restartor 2700052 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] ===== APPLICATION IS GOING RESTART (GMLE leadership safety violation handler triggered for groupType: mp) =====INFO application-restartor restartor 2711738 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] ===== APPLICATION IS GOING RESTART (Disconnected from database) =====INFO GmleClientNonBlockingOpsThread-1 RenewingState 3539225 - [nsx@6876 comp="nsx-manager" level="INFO" s2comp="renew-state" subcomp="manager"] Failed to renew lease before lease expiration for service POLICY_SVC_POLICY_UC_DEPLOYMENT_CONFIG on member de####-a5e4-####97a7-####86f#### of group 1####dcf-35####edd-f7####67fe##, invoking the client safety violation handler
INFO GmleClientNonBlockingOpsThread-1 LeaseState 3539225 - [nsx@6876 comp="nsx-manager" level="INFO" s2comp="lease-state" subcomp="manager"] Stopping FSM, transitioning to StoppedState after safety violation for service POLICY_SVC_POLICY_UC_DEPLOYMENT_CONFIG on member d####-a5e4-####-97a7-####6f26## of group 1f###56-7###-35###-bedd-f7a###7fe##INFO application-restartor ContainerConfigServiceImpl 3539225 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Sending application restart request.vmkernel or vmkwarning logs (/var/run/log):Wa(180) vmkwarning: cpu71 WARNING: ScsiDeviceIO: 1780: Device naa.60002ac00########0002a## performance has deteriorated. I/O latency increased from average value of 793 microseconds to 16834 microseconds./var/log/corfu/corfu.9000.log on all three manager nodes for the following strings:/var/log/corfu/corfu.9000.log shows error:<Time-stamp> | ERROR | WrapperSimpleAppMain | o.c.infrastructure.CorfuServer | CorfuServer: Server exiting due to unrecoverable error: org.corfudb.runtime.exceptions.DataCorruptionException: Checksum mismatch detected while trying to read file
Or error:
<Time-stamp>| ERROR | WrapperSimpleAppMain | o.c.infrastructure.CorfuServer | Failed starting server<Time-stamp> | ERROR | WrapperSimpleAppMain | o.c.infrastructure.CorfuServer | Failed starting server org.corfudb.runtime.exceptions.DataCorruptionException: Can't parse metadata. Segment File: /config/corfu/log/297779.log. File size: 3872130. File position: 3871666
...
...Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).