Troubleshooting VMware NSX Datastore (CorfuDB) Issues
search cancel

Troubleshooting VMware NSX Datastore (CorfuDB) Issues

book

Article ID: 378470

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

CorfuDB issue symptoms:

  • UI has become inaccessible (Ex. "Application not ready" error after logging in)
  • /config or /nonconfig utilization is above single digits or growing in size consistently
  • CBM and/or corfu is unresponsive or showing OOMs (grep -i "out of memory" /var/log/cbm/tanuki.log // grep -i "out of memory" /var/log/corfu/tanuki.log)

Environment

VMware NSX

Resolution

When troubleshooting CorfuDB, start with the following:

  • Check for /config partition utilization.
    • root@nsx-mngr-01:~# df -h

      • Anything above single digit usage should be considered a red flag.
      • If the usage is not the same across all 3 NSX-T Manager VMs then that also should be considered a red flag.
    • One or more Corfu tables inflate, and checkpoint fails (table size > 2G)
      • Corfu checkpoint and trim logs are written into /var/log/corfu/corfu-compactor.log.
      • Search through the corfu-compactor logs for "completed checkpoint' and 'Trim completed' log messages.
      • Checkpoint happens for every corfu table individually and Trim happens once for the whole address space. Today, the log message prints only the UUID of the corfu table that completed checkpoint.
      • In problematic cases like AD table bloat issues, we have seen the checkpoint for the AD specific tables sometimes took ~1 hr and later checkpoint failed.
      • Use below command to quickly find out if any corfu tables size is >= 2G and taking longer to checkpoint.
        • grep "completed checkpoint" corfu-compactor-audit.* | awk '$19 > 1000000 {print $0}'
  • Check the LAYOUT_CURRENT.ds Epoch number across all 3 NSX-T Managers. The Epoch number should be the same on all three manager nodes.
    • cat /config/corfu/LAYOUT_CURRENT.ds

      • "layoutServers": [
            "###.###.###.###:9000",
            "###.###.###.###:9000",
            "###.###.###.###:9000"
          ],
          "sequencers": [
            "###.###.###.###:9000",
            "###.###.###.###:9000",
            "###.###.###.###:9000"
          ],
          "segments": [
            {
              "replicationMode": "CHAIN_REPLICATION",
              "start": 0,
              "end": -1,
              "stripes": [
                {
                  "logServers": [
                    "###.###.###.###:9000",
                    "###.###.###.###:9000",
                    "###.###.###.###:9000"
                  ]
                }
              ]
            }
          ],
          "unresponsiveServers": [],
          "epoch": 143
          "clusterId": "<UUID>"
        }

    • If there is a difference in Epoch numbers across the 3 manager nodes, then the LAYOUT_CURRENT.ds details should be evaluated further.
    • A high Epoch number (greater than 2000) could mean that there is an underlying storage or networking issue between the manager nodes but doesn't necessarily mean there is an issue currently present as the Epoch number will never decrement.
      • An increase in the Epoch number within the LAYOUT_CURRENT.ds means that something has changed in regard to the segment section or an unresponsive server (could be caused by a DBsync issue or a corfu server losing cluster quorum). Which could be the cause of the issue.
      • To determine whether the LAYOUTs are changing frequently:
        • root@nsx-mngr-01:~# ls -ltr /config/corfu/LAYOUT*

          • -rw-r----- 1 corfu corfu 604 Aug 23 15:50 /config/corfu/LAYOUTS_2365.ds
            -rw-r----- 1 corfu corfu 871 Aug 23 15:50 /config/corfu/LAYOUTS_2366.ds
            -rw-r----- 1 corfu corfu 610 Aug 23 15:50 /config/corfu/LAYOUTS_2367.ds
            -rw-r----- 1 corfu corfu 604 Aug 23 18:05 /config/corfu/LAYOUTS_2368.ds
            -rw-r----- 1 corfu corfu 871 Aug 23 18:05 /config/corfu/LAYOUTS_2369.ds
            -rw-r----- 1 corfu corfu 610 Aug 23 18:05 /config/corfu/LAYOUTS_2370.ds
            -rw-r----- 1 corfu corfu 604 Aug 23 19:40 /config/corfu/LAYOUTS_2371.ds
            -rw-r----- 1 corfu corfu 871 Aug 23 19:40 /config/corfu/LAYOUTS_2372.ds
            -rw-r----- 1 corfu corfu 610 Aug 23 19:40 /config/corfu/LAYOUTS_2373.ds
            -rw-r----- 1 corfu corfu 604 Aug 23 20:45 /config/corfu/LAYOUTS_2374.ds
            -rw-r----- 1 corfu corfu 871 Aug 23 20:45 /config/corfu/LAYOUTS_2375.ds
            -rw-r----- 1 corfu corfu 610 Aug 23 20:45 /config/corfu/LAYOUT_CURRENT.ds
            -rw-r----- 1 corfu corfu 610 Aug 23 20:45 /config/corfu/LAYOUTS_2376.ds
        • The above output shows that the layout has changed several times on this day, so further investigation will be needed.
        • Any node(s) showing up in the "unresponsiveServers" section should have the "service corfu-server status" checked to make sure the corfu-server service is running.
            • If corfu-server service is stopped, start the service with "service corfu-server start"
            • Watch the corfu.9000.log with "tail -F /var/log/corfu/corfu.9000.log" and wait for the service to initialize.
              • It may take several moments for the service to start and the logging to scroll. Look for stack traces that may come up in the log if the corfu-server service crashes or fails to start.
            • An unresponsive Corfu server could mean there is a problem with the underlying infrastructure (i.e. network and/or storage) and should be investigated.
    • There might be other conditions that could cause a Corfu server to go unresponsive.
      • For example, the Corfu JVM might hit an OOM condition. To check for a Corfu OOM condition run the below command.
        • grep -i "out of memory" /var/log/corfu/tanuki.log
      • Check for corfu-server status:
        • root@nsx-mngr-01:~# service corfu-server status

 

 

Known Issues:

Additional Information