CORFU /config disk space may grow beyond 20% usage upon upgrading to NSX-T 3.0.0
search cancel

CORFU /config disk space may grow beyond 20% usage upon upgrading to NSX-T 3.0.0

book

Article ID: 318273

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Symptoms:
  • /config partition on NSX Manager nodes may grow to 100% post upgrading to NSX-T 3.0.0
  • Only encountered if upgrading from previous versions of NSX-T with AppDiscovery feature enabled


Environment

VMware NSX-T Data Center 3.x
VMware NSX-T Data Center

Cause

The issue is encountered because check pointing process on the NSX-T Manager nodes (which is a process that keep track of CORFU database tables) is unable to checkpoint the AppDiscovery tables, due to feature obsoleted in NSX-T 3.0.0.

Resolution

This issue is resolved in VMware NSX-T Data Center 3.0.1, available at VMware Downloads.

 


Workaround:

Option #1: Before Upgrading to NSX-T 3.0.0

1) Run the following API to check if any AppDiscovery sessions have been collected.

$ GET /api/v1/app-discovery/sessions (Note this API is not available starting NSX-T Datacenter 3.0.0)

{
"results" : [ {
"status" : "FINISHED",
"reclassification" : "NOT_REQUIRED",  
"start_timestamp" : 1541181098384,
"end_timestamp" : 1541181148659, 
"id" : "f36e3055-6d04-4150-99f4-4547e8c38ce0",   
"_protection" : "NOT_PROTECTED"
} ], 
"result_count" : 1,
 "sort_by" : "start_timestamp",
"sort_ascending" : false
}

If the result_count in the response is greater than 0, then proceed with the remaining steps, ELSE you can continue to upgrade to NSX-T 3.0.0 using normal upgrade procedure

2) The pre-upgrade script (attached to this KB) MUST be run prior to upgrading to NSX-T 3.0.0 Release. Run the attached preUpgradeCleanup.py script to cleanup all AppDiscovery sessions in the database. The script requires 3 arguments as indicated below and when run it gets all the AppDiscovery sessions and cleans up the entries.

  • endpoint-ip:  IP address of the NSX Manager

  • user-name:   Optional parameter of the admin user of the NSX manager, default is admin

  • password:  User password

 

Here is an example on how to run the script: 

$ python preUpgradeCleanup.py --endpoint-ip <nsxmgr-ip> --user-name admin --password <adminpasswd>


Output printed when there are no sessions found

Fetching AppDiscovery sessions
/Library/Python/2.7/site-packages/urllib3/connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made to host '10.92.166.59'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning,

Found 0 AppDiscovery sessions.
Success!


Output printed when there are some sessions found

Fetching AppDiscovery sessions
/Library/Python/2.7/site-packages/urllib3/connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made to host '10.92.166.59'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning,
Found 1 AppDiscovery sessions.
Deleting AppDiscovery Session 600cbf06-e661-4c77-9e86-c57c84da5c4a
/Library/Python/2.7/site-packages/urllib3/connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made to host '10.92.166.59'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning,
Deleted AppDiscovery Session 600cbf06-e661-4c77-9e86-c57c84da5c4a Succesfully
Success!

 

3) After the above steps, proceed with the regular upgrade steps to NSX-T 3.0.0 Release. If you have successfully completed the above steps prior to the upgrade, then you do not have to run Option #2 after upgrade is completed.

-------------------------------------------------------------------------------------------------------------

 

Option #2: After Upgrading to NSX-T 3.0.0

If Option #1 was not exercised prior to the upgrade, then after the upgrade - run these steps on any one node of the NSX-T Manager cluster. Check /var/log/corfu/corfu-compactor-audit.log to see if compaction is failing due to AppProfileInstance deserialization error.

 

1) Run the command below and see if last three trim completed messages all have the same sequence number,

$ grep -a “Trim completed” /var/log/corfu/corfu-compactor-audit.log
 

Note: Above command may throw an error after copy/paste, please retype the quotes in case of error.
 

2) Check if you see results for No binding for AppProfileInstance; this is the known issue with AppProfileInstance table not being cleared during upgrade.

$ zgrep -a "No binding for type: AppProfileInstance" /var/log/corfu/corfu-compactor-audit*

3) Run df -h /config and if usage is above 85%, do not proceed further and engage VMware Support via a Support Request.

 

4) Query the database on any node of the cluster to make sure there are entries in AppProfileInstance table. Change the <node-ip> to the node the you are currently logged in. This query also will fail with the same No binding found for AppProfileInstance table with serialization exception; this indicates there are entries in this table but the browser cannot display them as the AppProfileInstance class has been deleted in NSX-T 3.0.0.

$ java -Dlog4j.configurationFile=/opt/vmware/corfu-tools/corfu-browser-log4j2.xml -cp "/opt/vmware/corfu-tools/corfu-browser-1.0-jar-with-dependencies.jar:/opt/vmware/proton-tomcat/webapps/nsxapi/WEB-INF/lib/*" com.vmware.nsx.management.tools.corfu.CorfuBrowserMain -hostname <node-ip> -port 9000 printTable -tableName 'nsx-manager AppProfileInstance f405'


Note: Above command may throw an error after copy/paste, please retype the quotes in case of error.
 

5) Unzip jarFiles.zip (attached to KB)  (will output app-discovery-1.0.jar and context-common-1.0.jar). Copy the two JAR files into /opt/vmware/proton-tomcat/webapps/nsxapi/WEB-INF/lib/. on all three MP nodes

6) Run the next steps on any one MP node of the cluster. 

Query the database to see the contents of AppProfileInstance table, now that the missing class file is added to the classpath folder.  Now you should be able to see all the data in this table.

$ java -Dlog4j.configurationFile=/opt/vmware/corfu-tools/corfu-browser-log4j2.xml -cp "/opt/vmware/corfu-tools/corfu-browser-1.0-jar-with-dependencies.jar:/opt/vmware/proton-tomcat/webapps/nsxapi/WEB-INF/lib/*" com.vmware.nsx.management.tools.corfu.CorfuBrowserMain -hostname <node-ip> -port 9000 printTable -tableName 'nsx-manager AppProfileInstance f405'

7) Delete all the entries in this AppProfileInstance table (change the IP address to the node you are logged in to).

$ java -Xmx640m -Dlog4j.configurationFile=/opt/vmware/corfu-tools/corfu-browser-log4j2.xml -cp "/opt/vmware/corfu-tools/corfu-editor-1.0-jar-with-dependencies.jar:/opt/vmware/proton-tomcat/webapps/nsxapi/WEB-INF/lib/*" com.vmware.nsx.management.tools.corfu.CorfuEditorMain -hostname <node-ip> -port 9000 removeEntries -tableName 'nsx-manager AppProfileInstance f405' -cleanUp

8) Run the query from step #2 above to make sure that AppProfileInstance table is empty.

9) Wait for 3 compaction cycles to ensure that the data indeed got trimmed. To verify if the above steps have been successfully implemented, look for "Trim completed" messages in the corfu-compactor-audit.log. for most recent timestamp.

The sequence numbers at the end of the log line should be incremental in each line.


$ grep -a “Trim completed” /var/log/corfu/corfu-compactor-audit.log

2020-04-27T23:39:59.765Z  INFO main FrameworkCorfuCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] Trim completed, elapsed(0s), appliance(nsx-manager), token(Token(epoch=423, sequence=65083185)).

2020-04-27T23:55:00.560Z  INFO main FrameworkCorfuCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] Trim completed, elapsed(0s), appliance(nsx-manager), token(Token(epoch=423, sequence=65110210)).

2020-04-28T00:09:59.421Z  INFO main FrameworkCorfuCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] Trim completed, elapsed(0s), appliance(nsx-manager), token(Token(epoch=423, sequence=65137146)).


10) You should no longer see these exceptions in the corfu-compactor-audit.log after you copied those 2 jar files from step 1 above.

$ grep -a "No binding for type: AppProfileInstance" /var/log/corfu/corfu-compactor-audit.log
 

11) Execute the df -h /config command, to verify /config is less than 10%.
 

12) Remove the copied jars files from all nodes. 

$ rm /opt/vmware/proton-tomcat/webapps/nsxapi/WEB-INF/lib/app-discovery-1.0.jar

$ rm /opt/vmware/proton-tomcat/webapps/nsxapi/WEB-INF/lib/context-common-1.0.jar


Additional Information

Impact/Risks:
This will cause the /config partition to reach 100% and there after the NSX Management Cluster will get unstable.

Attachments

jarFiles get_app
preUpgradeCleanup get_app