Symptoms:
- dcsms process reaches very high memory utilization and normal processing gets impacted
- Edge goes in out of memory (OOM) situation causing a reboot.
- HA switchover keeps happening when the above commands are executed in a loop
- Normal routing functionality impacted when dcsms reaches very high memory utilization
- In the NSX Manager vsm.log messages are seen as the following:
2019-03-06 14:46:35.141 UTC INFO SimpleAsyncTaskExecutor-1 StatusAndStatsUtil:822 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Propogating vse event EDGE_MEMORY_USAGE_JUMP_UP Module: vShield Edge Appliance Severity Critical
2019-03-06 13:42:32.199 UTC INFO http-nio-127.0.0.1-7441-exec-436 EdgeUtils:472 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] populateSystemEvent parameters : sourceName edge-xxx, morefIdOfObjectOnVc vm-xxxx, moduleName NSX Edge Health Check, eventCode EDGE_VM_HEALTHCHECK_NO_PULSE, severity Major, messageParams [vm-xxx] eventMetaData {edgeId=edge-xxx, edgeVmName=ESG-TEST-01, error=Configuration failed on NSX Edge VM vm-xxx. Kindly refer Edge and NSX Manager logs for more details., edgeVmVcUUId=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx, edgeVmId=vm-xxx}
2019-03-06 13:42:32.202 UTC INFO http-nio-127.0.0.1-7441-exec-436 EventServiceImpl:119 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] [SystemEvent] Time:'Wed Mar 06 13:42:32.199 UTC 2019', Severity:'Medium', Event Source:'edge-xxx', Code:'30033', Event Message:'NSX Edge VM (vmId : vm-xxx) not responding to health check.', Module:'NSX Edge Health Check', Universal Object:'false'
- In the NSX Edge logs, you see entries similar to:
2019-03-06T18:12:28+00:00 ESG-TEST-01 syslog-ng[868]: [default]: [syslog.err] I/O error occurred while writing; fd='17', error='Network is unreachable (101)'
2019-03-06T18:12:46+00:00 ESG-TEST-01 syslog-ng[807]: [default]: [syslog.err] Connection failed; fd='21', server='AF_INET(198.162.246.80:514)', local='AF_INET(0.0.0.0:0)', error='Network is unreachable (101)'
2019-03-06T18:12:56+00:00 ESG-TEST-01 syslog-ng[807]: [default]: [syslog.err] Connection failed; fd='37', server='AF_INET(198.162.246.80:514)', local='AF_INET(0.0.0.0:0)', error='Network is unreachable (101)'
- Followed by a reboot once memory reaches the maximum threshold:
2019-03-06T18:12:22+00:00 ESG-TEST-01 kernel[]: [default]: [kern.warning] monit invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
...
2019-03-06T18:12:24+00:00 ESG-TEST-01 MsgMgr[1241]: [default]: [daemon.info] payload len:350 data:{"systemEvents":[{"severity":"Critical","metaData":{"message":"23665 60960 176576 VseEventProcess 1237 47200 176748 vmtoolsd 1241 7732 330896 msgmgr 1128 2944 96720 monit 868 2832 199864 syslog-ng "},"timestamp":1551895943,"moduleName":"vShield Edge Appliance","eventCode":30180,"message":"OOM happened, system rebooting in 3 seconds..."}]}
2019-03-06T18:12:26+00:00 ESG-TEST-01 shutdown[23781]: [default]: [user.notice] shutting down for system reboot
...
2019-03-06T18:12:46+00:00 ESG-TEST-01 routing[1029]: [default]: [daemon.info] All SMS configuration is complete.
2019-03-06T18:12:23+00:00 ESG-TEST-01 kernel[]: [default]: [kern.err] Killed process 1083 (dcsms) total-vm:1321552kB, anon-rss:778608kB, file-rss:224kB