NSX 4.1.2.x restore from backup fails in a Kubernetes environment

Products

VMware NSX

Issue/Introduction

NSX 4.1.2.x.
A large number of Kubernetes clusters (>= 150) have been registered to NSX.
An NSX restore from backup is failing to deploy a new NSX Manager.
NSX Manager log /var/log/proton/nsxapi.log has an error similar to this example

<Date Time> ERROR task-executor-1-14 LcmRestClient 108084 FABRIC [nsx@6876 comp="nsx-manager" errorCode="MP31815" level="ERROR" subcomp="manager"] Error in rest call url= /api/v1/cluster/node?action=repo_sync , method= POST , response= null , error= [{"errorMessage":"Service is not running","errorData":{"errorCode":"503"}}]
org.springframework.web.client.ResourceAccessException: I/O error on POST request for "https://<MANAGER IP>/api/v1/cluster/node": The size of the handshake message (41085) exceeds the maximum allowed size (32768); nested exception is javax.net.ssl.SSLProtocolException: The size of the handshake message (41085) exceeds the maximum allowed size (32768)
at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:791) ~[?:?]
at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:717) ~[?:?]
at org.springframework.web.client.RestTemplate.exchange(RestTemplate.java:608) ~[?:?]
at com.vmware.nsx.management.lcm.common.connection.client.rest.LcmRestRequestImpl.createEntityAndExchange(LcmRestRequestImpl.java:54) ~[?:?]
at com.vmware.nsx.management.common.rest.RestRequestImpl.doPost(RestRequestImpl.java:67) ~[?:?]
at com.vmware.nsx.management.lcm.common.connection.client.rest.LcmRestClient.makeRequest(LcmRestClient.java:482) ~[?:?]
at com.vmware.nsx.management.lcm.common.connection.client.rest.LcmRestClient.sendPostRequest(LcmRestClient.java:274) ~[?:?]
at com.vmware.nsx.management.appliance.utils.ApplianceDeploymentHelper.triggerRepoSync(ApplianceDeploymentHelper.java:205) ~[?:?]
at com.vmware.nsx.management.appliance.resolver.ManagerDeploymentResolver.resolve(ManagerDeploymentResolver.java:152) ~[?:?]
at com.vmware.nsx.management.resolver.service.ErrorResolverInvokerTask.run(ErrorResolverInvokerTask.java:39) ~[?:?]
at com.vmware.nsx.management.common.executor.TaskExecutorImpl$TaskWrapper$1.run(TaskExecutorImpl.java:240) ~[?:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_372]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_372]
at com.vmware.nsx.management.common.executor.TaskExecutorImpl$TaskWrapper.run(TaskExecutorImpl.java:273) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_372]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_372]
at com.vmware.nsx.util.concurrent.Executors$MeteredRunnable.run(Executors.java:353) ~[nsx-util.jar:?]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_372]
Caused by: javax.net.ssl.SSLProtocolException: The size of the handshake message (41085) exceeds the maximum allowed size (32768)

<Date Time> ERROR task-executor-1-14 ManagerDeploymentResolver 108084 SYSTEM [nsx@6876 comp="nsx-manager" errorCode="MP26044" level="ERROR" subcomp="manager"] Repair failed for MP appliance VM <UUID>.

Resolution

This is a known issue impacting NSX 4.1.2.x.

Workaround

1. ssh to the NSX Manager

2. Edit the following files:

/usr/tanuki/conf/proton-tomcat-wrapper.conf
/usr/tanuki/conf/proton-tomcat-wrapper_debug.conf

Add the following property to both files

'wrapper.java.additional.58=-Djdk.tls.maxHandshakeMessageSize=65536'

The value 58 above can change. Check the value of the last property value in the file and increment it by 1.

3. Restart proton on all managers. Restart command 'service proton restart'.