Unexpected layout changes in a stable NSX environment
search cancel

Unexpected layout changes in a stable NSX environment

book

Article ID: 415746

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Corfu servers are observed to trigger layout changes in an otherwise stable environment. The following command would provide layout changes, when it is run against NSX manager nodes in root mode:

ls -lathr /config/corfu/* | grep LAYOUTS_ | awk '{print $6, $7, $8}' | sort | uniq -c | grep -E "Sep"
      1 Sep 18 12:52
      2 Sep 18 12:53
      1 Sep 18 13:09
      3 Sep 18 13:11
      1 Sep 18 13:18
      2 Sep 18 13:19
      1 Sep 18 13:20
      3 Sep 19 07:00
      3 Sep 19 09:15
      3 Sep 19 13:15
      3 Sep 19 15:40
      3 Sep 20 03:40
      3 Sep 20 09:10
      3 Sep 20 18:50
     3 Sep 20 23:50

Environment

NSX-T Data Center 3.x
NSX 4.0.x - 4.1.x

Cause

This issue is caused by a combination of two factors:

Long GC Pauses: Over time, factors such as heap growth and memory fragmentation cause the Java Garbage Collection (GC) pauses on the Corfu server to exceed 1.5 seconds. This long pause incorrectly causes the server to fail its own failure detection checks.

CorfuDB Bug: A known bug causes the Corfu server to incorrectly aggregate these transient, GC-induced failure detection results.

The combination of a long GC pause and this bug leads the server to mistakenly believe a failure has occurred, triggering an unnecessary layout change.

Resolution

The issue has been resolved in NSX version 4.2.0 and later versions. 

Workaround:
While tuning GC to consistently stay below the failure detection threshold is complex, the most effective mitigation is to prevent the conditions that lead to long GC pauses. We recommend performing a rolling restart of the Corfu server cluster using the command:

service corfu-server restart

This restart will reset the heap state, mitigate memory fragmentation, and significantly reduce the likelihood of long GC pauses.

Additional Information

Corfu server GC log is archived under /var/log/corfu/jvm/. This log can be checked to see the GC pauses:

"2025-10-08T16:35:10.617+0000: Total time for which application threads were stopped: 1.5466943 seconds"