NSX management cluster status Degraded due to IDPS_REPORTING down on one or more manager nodes

Products

VMware vDefend Firewall

Issue/Introduction

You are using NSX and the cluster is showing as DEGRADED.
The IDPS_REPORTING service is showing as "down" in the UI or when checking the cluster status via CLI using get cluster status.
Checking the service status via CLI and it returns as "running":
- The command to confirm status from NSX manager node: get service idps-reporting

Environment

NSX-T 3.X and NSX 4.X.

Cause

This is due to the /nonconfig/diskonlycorfutable/ directory missing on the manager node during creation. This can be confirmed by the below log entries:

Error code MP370017 is reported on the NSX manager node in var/log/syslog as per below:

2023-08-13T23:34:53.382Z ABCDEFG9842 NSX 1935793 - [nsx@6876 comp="nsx-manager" errorCode="MP370017" level="ERROR" subcomp="idps-reporting"] Exception occurred while creating tables - java.lang.Exception: org.corfudb.runtime.exceptions.unrecoverable.UnrecoverableCorfuError: java.lang.reflect.InvocationTargetException
2023-08-13T23:34:53.384Z ABCDEFG9842 NSX 1935793 - [nsx@6876 comp="nsx-manager" errorCode="MP370017" level="ERROR" subcomp="idps-reporting"] Exception occurred while creating tables - com.vmware.nsx.securitydataservice.common.SecurityDataServiceException: Failed to create corfu tables

Checking in var/log/idps-reporting/idps.log we see the error for no such file or directory:

2023-08-13T11:42:22.812Z ERROR WrapperSimpleAppMain IDSEventDataServiceImpl 3309335 - [nsx@6876 comp="nsx-manager" errorCode="MP370017" level="ERROR" subcomp="idps-reporting"] Ex
ception occurred while creating tables - java.lang.Exception: org.corfudb.runtime.exceptions.unrecoverable.UnrecoverableCorfuError: java.lang.reflect.InvocationTargetException
java.lang.Exception: org.corfudb.runtime.exceptions.unrecoverable.UnrecoverableCorfuError: java.lang.reflect.InvocationTargetException

Caused by: org.corfudb.runtime.exceptions.unrecoverable.UnrecoverableCorfuError: org.rocksdb.RocksDBException: while open a file for lock: /nonconfig/diskonlycorfutable/idps/t_id
s_event_data/LOCK: No such file or directory

Caused by: org.rocksdb.RocksDBException: while open a file for lock: /nonconfig/diskonlycorfutable/idps/t_ids_event_data/LOCK: No such file or directory

NOTE: The preceding log excerpts are only examples. Date, time and environmental variables may vary depending on your environment.

Resolution

There are 2 workarounds to this issue:

First - create folder manually:

On the affected nodes (which do not have the /nonconfig/diskonlycorfutable/idps) run the following CLI commands as root:

mkdir -p /nonconfig/diskonlycorfutable/idps
chown -R nsx-idps:nsx-idps /nonconfig/diskonlycorfutable/idps

Once folder is created switch to the admin CLI (use command "su admin") and restart the IDPS service:

restart service idps-reporting

Second - deploy a new node:

Delete the node(s) missing the /nonconfig/diskonlycorfutable directory and redeploy these nodes. This can be done via the following methods:

If the node was deployed via the UI, using the "Delete" option from the NSX UI from the system tab.
If the node was deployed via OVA using the CLI detach and join commands:

Detach to failed node missing /nonconfig/diskonlycorfutable:

https://docs.vmware.com/en/VMware-Cloud-Foundation/5.1/com.vmware.vcf.vxrail.doc/GUID-3FA1E29E-50AD-4AF3-B46E-24A623D7B4B1.html

Join the newly deployed node with the necessary directory:

https://docs.vmware.com/en/VMware-Cloud-Foundation/5.1/com.vmware.vcf.vxrail.doc/GUID-9973B27F-5DD4-4B16-B89A-0321F7003B7A.html