Corfu data file corruption seen in corfu.9000.log: "Checksum mismatch detected while trying to read file" or "Can't parse metadata"

search cancel

Corfu data file corruption seen in corfu.9000.log: "Checksum mismatch detected while trying to read file" or "Can't parse metadata"

book

Article ID: 303324

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Recover Corfu cluster after data corruption on one or all cluster members.

Symptoms:

/var/log/corfu/corfu.9000.log shows error:

2024-09-28T03:46:37.880Z | ERROR | WrapperSimpleAppMain | o.c.infrastructure.CorfuServer | CorfuServer: Server exiting due to unrecoverable error:
org.corfudb.runtime.exceptions.DataCorruptionException: Checksum mismatch detected while trying to read file.

Or error:

2024-09-28T03:46:37.880Z | ERROR | WrapperSimpleAppMain | o.c.infrastructure.CorfuServer | Failed starting server
org.corfudb.runtime.exceptions.DataCorruptionException: Can't parse metadata. Segment File: /config/corfu/log/297779.log. File size: 3872130. File position: 3871666

/var/log/corfu/corfu-compactor-audit.log may show messages "Tried to get layout from <node with corruption IP>:9000 but failed by timeout"

Environment

VMware NSX-T

VMware NSX

Cause

When the Corfu server starts up, it loads binary data files and verifies checksums. If a data file is corrupted (possibly someone modified a file or there was a disk failure), Corfu cannot recover on its own.

The cluster of Corfu nodes protects against this scenario. If Corfu data files are corrupted on a node, then the node needs to be removed and replaced with a new node.

Resolution

If more than 1 NSX Manager node in a 3 node cluster is showing indications of Corfu corruption, please gather full NSX support bundles from all three NSX Managers by running the following command from the admin shell, copying the resulting files from the default location they will be written to (/image/vmware/nsx/file-store/), and then open a case with Broadcom Support for further assistance. If only one NSX Manager node shows signs of Corfu corruption, continue below.

admin> get support-bundle file <nsx-manager-name>.tgz

Resolution:

Process to remove and replace corrupted Manager node:

1. Record the corrupt nodes IP and FQDN.
2. Detach the corrupted node using following detach command ran from the admin shell on one of the healthy and not corrupt NSX Manager nodes to form a 2 node cluster. Reference documentation at https://docs.vmware.com/en/VMware-Cloud-Foundation/5.2/vcf-admin/GUID-3FA1E29E-50AD-4AF3-B46E-24A623D7B4B1.html for additional guidance running this detach command. The failed_node_uuid can be gathered from the 'Manager' service output of the 'get cluster status' command.

admin> detach node failed_node_uuid

3. Shutdown and delete the corrupt NSX Manager node.
4. Deploy a replacement Manager node via OVA deployment through the vCenter Server.
5. Join the newly deployed node to the existing 2 node cluster with the following command ran from the admin shell of the newly deployed NSX Manager. Reference documentation at https://docs.vmware.com/en/VMware-NSX-T-Data-Center/3.2/installation/GUID-9F3C8273-FA5F-41C8-85CA-436F8D34977D.html for additional guidance running this join command.

admin> join <Manager-IP> cluster-id <cluster-id> username <Manager-username> password <Manager-password> thumbprint <Manager-thumbprint>

Alternatively:

If the failed NSX Manager was auto-deployed through the NSX UI, instead the corrupt NSX Manager can be deleted in the NSX Manager UI under System > Appliances and the Delete option for the corrupt node. Then a new NSX Manager node can deployed to return the cluster to 3 nodes.

Workaround:
If all 3 NSX Manager nodes Corfu data files are corrupted then the only recourse is to restore from valid NSX backup.

Additional Information

https://docs.vmware.com/en/VMware-Cloud-Foundation/5.2/vcf-admin/GUID-3FA1E29E-50AD-4AF3-B46E-24A623D7B4B1.html

Feedback

thumb_up Yes

thumb_down No