Troubleshooting NSX Datastore (CorfuDB) Issues

Products

VMware NSX

Issue/Introduction

CorfuDB issue symptoms:

UI has become inaccessible (Ex. "Application not ready" error after logging in)
/config or /nonconfig utilization is above single digits or growing in size consistently
CBM and/or corfu is unresponsive or showing OOMs (grep -i "out of memory" /var/log/cbm/tanuki.log // grep -i "out of memory" /var/log/corfu/tanuki.log)

Environment

VMware NSX

Resolution

When troubleshooting CorfuDB, start with the following:

Check for /config partition utilization.
- root@nsx-mngr-01:~# df -h
  - Anything above single digit usage should be considered a red flag.
  - If the usage is not the same across all 3 NSX-T Manager VMs then that also should be considered a red flag.
- One or more Corfu tables inflate, and checkpoint fails (table size > 2G)
  - Corfu checkpoint and trim logs are written into /var/log/corfu/corfu-compactor.log.
  - Search through the corfu-compactor logs for "completed checkpoint' and 'Trim completed' log messages.
  - Checkpoint happens for every corfu table individually and Trim happens once for the whole address space. Today, the log message prints only the UUID of the corfu table that completed checkpoint.
  - In problematic cases like AD table bloat issues, we have seen the checkpoint for the AD specific tables sometimes took ~1 hr and later checkpoint failed.
  - Use below command to quickly find out if any corfu tables size is >= 2G and taking longer to checkpoint.
    - grep "completed checkpoint" corfu-compactor-audit.* | awk '$19 > 1000000 {print $0}'
Check the LAYOUT_CURRENT.ds Epoch number across all 3 NSX-T Managers. The Epoch number should be the same on all three manager nodes.
- cat /config/corfu/LAYOUT_CURRENT.ds
  - "layoutServers": [
        "###.###.###.###:9000",
        "###.###.###.###:9000",
        "###.###.###.###:9000"
    ],
    "sequencers": [
        "###.###.###.###:9000",
        "###.###.###.###:9000",
        "###.###.###.###:9000"
    ],
    "segments": [
        {
          "replicationMode": "CHAIN_REPLICATION",
          "start": 0,
          "end": -1,
          "stripes": [
            {
              "logServers": [
                "###.###.###.###:9000",
                "###.###.###.###:9000",
                "###.###.###.###:9000"
              ]
            }
          ]
        }
    ],
    "unresponsiveServers": [],
    "epoch": 143
    "clusterId": "<UUID>"
    }
- If there is a difference in Epoch numbers across the 3 manager nodes, then the LAYOUT_CURRENT.ds details should be evaluated further.
- A high Epoch number (greater than 2000) could mean that there is an underlying storage or networking issue between the manager nodes but doesn't necessarily mean there is an issue currently present as the Epoch number will never decrement.
  - An increase in the Epoch number within the LAYOUT_CURRENT.ds means that something has changed in regard to the segment section or an unresponsive server (could be caused by a DBsync issue or a corfu server losing cluster quorum). Which could be the cause of the issue.
  - To determine whether the LAYOUTs are changing frequently:
    - root@nsx-mngr-01:~# ls -ltr /config/corfu/LAYOUT*
      - -rw-r----- 1 corfu corfu 604 Aug 23 15:50 /config/corfu/LAYOUTS_2365.ds
        -rw-r----- 1 corfu corfu 871 Aug 23 15:50 /config/corfu/LAYOUTS_2366.ds
        -rw-r----- 1 corfu corfu 610 Aug 23 15:50 /config/corfu/LAYOUTS_2367.ds
        -rw-r----- 1 corfu corfu 604 Aug 23 18:05 /config/corfu/LAYOUTS_2368.ds
        -rw-r----- 1 corfu corfu 871 Aug 23 18:05 /config/corfu/LAYOUTS_2369.ds
        -rw-r----- 1 corfu corfu 610 Aug 23 18:05 /config/corfu/LAYOUTS_2370.ds
        -rw-r----- 1 corfu corfu 604 Aug 23 19:40 /config/corfu/LAYOUTS_2371.ds
        -rw-r----- 1 corfu corfu 871 Aug 23 19:40 /config/corfu/LAYOUTS_2372.ds
        -rw-r----- 1 corfu corfu 610 Aug 23 19:40 /config/corfu/LAYOUTS_2373.ds
        -rw-r----- 1 corfu corfu 604 Aug 23 20:45 /config/corfu/LAYOUTS_2374.ds
        -rw-r----- 1 corfu corfu 871 Aug 23 20:45 /config/corfu/LAYOUTS_2375.ds
        -rw-r----- 1 corfu corfu 610 Aug 23 20:45 /config/corfu/LAYOUT_CURRENT.ds
        -rw-r----- 1 corfu corfu 610 Aug 23 20:45 /config/corfu/LAYOUTS_2376.ds
    - The above output shows that the layout has changed several times on this day, so further investigation will be needed.
    - Any node(s) showing up in the "unresponsiveServers" section should have the "service corfu-server status" checked to make sure the corfu-server service is running.

- - - - If corfu-server service is stopped, start the service with "service corfu-server start"
        
        Watch the corfu.9000.log with "tail -F /var/log/corfu/corfu.9000.log" and wait for the service to initialize.
        
        It may take several moments for the service to start and the logging to scroll. Look for stack traces that may come up in the log if the corfu-server service crashes or fails to start.
        
        An unresponsive Corfu server could mean there is a problem with the underlying infrastructure (i.e. network and/or storage) and should be investigated.
- There might be other conditions that could cause a Corfu server to go unresponsive.
  - For example, the Corfu JVM might hit an OOM condition. To check for a Corfu OOM condition run the below command.
    - grep -i "out of memory" /var/log/corfu/tanuki.log
  - Check for corfu-server status:
    - root@nsx-mngr-01:~# service corfu-server status

Online Diagnostic System (ODS) Documentation: Debugging NSX at Runtime

ODS CLI command:

nsx-mngr-01> get runbook CorfuServer help

nsx-mngr-01> get runbook CorfuServer help
Mon Oct 02 2023 UTC 15:40:14.189
Runbook ID     : CorfuServer
Descrption     : Corfu Server runbook to find server side issues.
Parameters
    Name           : lookback_days
    Title          : Specify a time window
    Constraint     : <integer>
    Default        : 1
 
    Name           : lookback_hours
    Title          : Specify a time window
    Constraint     : <integer>
    Default        : 0

nsx-mngr-01> start invocation runbook CorfuServer runbook-arg --lookback_days <NUM_OF_DAYS> --lookback_hours <NUM_OF_HOURS>

nsx-mngr-01> start invocation runbook CorfuServer runbook-arg --lookback_days 2 --lookback_hours 8
Runbook Invocation Report
 
Invocation ID   : 72fab7c6-####-####-####-488fbdc34bdb
Timestamp       : 2023-10-02 15:43:14
System Info
    Host Name       : nsx-mngr-01
    OS Name         : Linux
    OS Version      : 5.15.92-nn12-server
    Arch            : x86_64
Runbook Info
    Runbook ID      : CorfuServer
    Version         : 1.0
    Publisher       : VMware, Inc.
Report Type     : VALID
Conclusion      : Finished running the CorfuServer Runbook.
Recommendation  : If there is any failure in the runbook steps, please collect the support bundles and reach out to the support team <https://www.vmware.com/support.html>.
Artifact Bundle : <none>
Steps
 
    Step Number     : 1
    Step Action     : This step checks Corfu Layout changes in the given time window (default is 24h)
    Step Result     : The result of the Corfu Layout Check is {'result': <Result.SUCCESS: 'SUCCESS'>, 'message': 'Layout changes are normal. Found 0 layout changes during the last 56.0 hours. (Thresholds are bad_node_unresponsive_percentage: 50%, unstable_layout_changes_per_hour: 10).', 'data': {'detected_layout_changes_per_hour': 0.0}}
 
    Step Number     : 2
    Step Action     : Check /var/log/stats/ping.stats.
    Step Result     : The result of the Infra Ping Check is {'result': <Result.SUCCESS: 'SUCCESS'>, 'message': 'Infra ping stats are normal (below thresholds packet_loss_threshold_percentage: 30%, avg_rtt_threshold_ms: 10).', 'data': '{}'}
 
    Step Number     : 3
    Step Action     : Check /var/log/stats/sys_threads.stats and analyze CPU load average in the given time window (default is 24h)
    Step Result     : The result of the Infra Load Average Check is {'result': <Result.SUCCESS: 'SUCCESS'>, 'message': 'Infra load averages are normal (below threshold 20).', 'data': '{}'}
 
    Step Number     : 4
    Step Action     : Check trim token movement in the given time window (default is 24h)
    Step Result     : The result of the Corfu Trim Token Movement Check is {'result': <Result.SUCCESS: 'SUCCESS'>, 'message': 'Detected a successful log trim.', 'data': {'last_trim_date': '2023-10-02 15:38:44.854000+00:00'}}
 
    Step Number     : 5
    Step Action     : Check fsync latency metrics in the given time window (default is 24h)
    Step Result     : The result of the Corfu Fsync Latency Metrics Check is {'result': <Result.SUCCESS: 'SUCCESS'>, 'message': "Corfu metrics fsync disk latencies are normal (below thresholds {'0.5': '150000', '0.75': '175000', '0.95': '195000', '0.99': '200000'}).", 'data': '{}'}
 
    Step Number     : 6
    Step Action     : Check failure detector ping latency metrics in the given time window (default is 24h).
    Step Result     : The result of the Corfu Failure Detector Ping Latency Metrics Check is {'result': <Result.SUCCESS: 'SUCCESS'>, 'message': 'Corfu failure detector ping latencies are normal (below threshold 200.0ms).', 'data': '{}'}

Known Issues:

Additional Information

Handling Log Bundles for offline review with Broadcom support:

Collect Support Bundles for Troubleshooting NSX-T
Uploading files to cases on the Broadcom Support Portal
Creating and managing Broadcom support cases

Troubleshooting NSX Datastore (CorfuDB) Issues

Article ID: 378470

Updated On:

Products

Issue/Introduction

Environment

Resolution

Additional Information

Feedback