Corfu data file corruption seen in corfu.9000.log: "Checksum mismatch detected while trying to read file" or "Can't parse metadata"
search cancel

Corfu data file corruption seen in corfu.9000.log: "Checksum mismatch detected while trying to read file" or "Can't parse metadata"

book

Article ID: 303324

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • Datastore service cluster is in degraded state
  • Following errors seen in NSX Manager logs:
    • /var/log/corfu/corfu.9000.log shows error:
<Time-stamp> | ERROR | WrapperSimpleAppMain | o.c.infrastructure.CorfuServer | CorfuServer: Server exiting due to unrecoverable error: org.corfudb.runtime.exceptions.DataCorruptionException: Checksum mismatch detected while trying to read file

OR:  

<Time-stamp>| ERROR | WrapperSimpleAppMain | o.c.infrastructure.CorfuServer | Failed starting server
<Time-stamp> | ERROR | WrapperSimpleAppMain | o.c.infrastructure.CorfuServer | Failed starting server org.corfudb.runtime.exceptions.DataCorruptionException: Can't parse metadata. Segment File: /config/corfu/log/297779.log. File size: 3872130. File position: 3871666
...
...
Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).
    • /var/log/corfu/corfu-compactor-audit.log may show message "Tried to get layout from <node with corruption IP>:9000 but failed by timeout"

    • /var/log/syslog reports the below error

      2026-05-19T05:01:23.622Z nsx-mgr01 NSX 3489 - - getClusterStatus: Error while fetching layout from 10.##.##.##:9000. Exception: 
      2026-05-19T05:01:23.994Z nsx-mgr01 NSX 2342 - - Connect Async 10.##.##.##:9000
      (END)

Cause

When the Corfu server starts up, it loads binary data files and verifies checksums. If a data file is corrupted, Corfu cannot recover on its own. The cluster of Corfu nodes protects against this scenario. If Corfu data files are corrupted on a node, then the node needs to be removed and replaced with a new node.

The issue can also occur in Greenfield VCF deployments if there are underlying storage issues on the host hosting NSX Managers, and if remnant data exists on the LUN.

Resolution

If more than 1 NSX Manager node in a 3 node cluster is showing indications of Corfu corruption, please gather full NSX support bundles from all three NSX Managers by running the following command from the admin shell, copying the resulting files from the default location they will be written to (/image/vmware/nsx/file-store/), and then open a case with Broadcom Support for further assistance.

admin> get support-bundle file <nsx-manager-name>.tgz

 

If only one NSX Manager node shows signs of Corfu corruption, continue below. 

Process to remove and replace corrupted Manager node:

  • Record the IP and FQDN of the Manager node with corrupted data
  • Confirm if the NSX Manager VM to be redeployed is the orchestrator node:

    • nsxmanager1> get service install-upgrade

    • nsx-mngr> get service install-upgrade
      Service name:      install-upgrade
      Service state:     running
      Enabled on:        #.#.#.#   <<< orchestrator node
  • If the node to be replaced is the orchestrator, change the orchestrator to a manager appliance that is not being replaced (NOTE: This command needs to be run from one of the NSX Manager nodes that you are not going to replace/detach.)

    • nsxmanager2> set repository-ip

  • Record config settings for node to be replaced

  • To find the node UUID of the NSX Manager to be replaced

    • nsxmanager2> get nodes

  • Detach the node from the cluster via admin CLI of a node not being replaced

    • nsxmanager2> detach node <UUID>

  • Check cluster status to ensure the node has been removed from all cluster services

    • nsxmanager2> get cluster status

  • Power off detached node and delete from disk via vSphere UI

  • Deploy new NSX Manager Appliance via NSX UI System > Appliances > Add NSX Appliance

  • Wait for repo_sync to complete after cluster stabilizes in the NSX UI (this process can take some time)

Alternatively:

If the failed NSX Manager was auto-deployed through the NSX UI, instead the corrupt NSX Manager can be deleted in the NSX Manager UI under System > Appliances and the Delete option for the corrupt node. Then a new NSX Manager node can deployed to return the cluster to 3 nodes. 

Workaround:
If all 3 NSX Manager nodes Corfu data files are corrupted then the only recourse is to restore from valid NSX Backup. Reference documentation at Restore a Backup