Corfu servers are observed to trigger layout changes in an otherwise stable environment. The following command would provide layout changes, when it is run against NSX manager nodes in root mode:
ls -lathr /config/corfu/* | grep LAYOUTS_ | awk '{print $6, $7, $8}' | sort | uniq -c | grep -E "Sep"
1 Sep 18 12:52
2 Sep 18 12:53
1 Sep 18 13:09
3 Sep 18 13:11
1 Sep 18 13:18
2 Sep 18 13:19
1 Sep 18 13:20
3 Sep 19 07:00
3 Sep 19 09:15
3 Sep 19 13:15
3 Sep 19 15:40
3 Sep 20 03:40
3 Sep 20 09:10
3 Sep 20 18:50
3 Sep 20 23:50
NSX-T Data Center 3.x
NSX 4.0.x - 4.1.x
This issue is caused by a combination of two factors:
Long GC Pauses: Over time, factors such as heap growth and memory fragmentation cause the Java Garbage Collection (GC) pauses on the Corfu server to exceed 1.5 seconds. This long pause incorrectly causes the server to fail its own failure detection checks.
CorfuDB Bug: A known bug causes the Corfu server to incorrectly aggregate these transient, GC-induced failure detection results.
The combination of a long GC pause and this bug leads the server to mistakenly believe a failure has occurred, triggering an unnecessary layout change.
The issue has been resolved in NSX version 4.2.0 and later versions.
Workaround:
While tuning GC to consistently stay below the failure detection threshold is complex, the most effective mitigation is to prevent the conditions that lead to long GC pauses. We recommend performing a rolling restart of the Corfu server cluster using the command:
service corfu-server restart
This restart will reset the heap state, mitigate memory fragmentation, and significantly reduce the likelihood of long GC pauses.
Corfu server GC log is archived under /var/log/corfu/jvm/. This log can be checked to see the GC pauses:"2025-10-08T16:35:10.617+0000: Total time for which application threads were stopped: 1.5466943 seconds"