VMware Cloud Director is frequently sending database reconnect alerts to administrators

Products

VMware Cloud Director

Issue/Introduction

Frequent email alerts are sent with the message:

"VMware Cloud Director cell with IP address ##.##.##.## is successful in reconnecting to the database"

or

"VMware Cloud Director cell with IP address ##.##.##.## restored the connection to the database"
VCD service restarted unexpectedly.
Checking the log file /opt/vmware/vcloud-director/logs/vmware-vcd-watchdog.log it can be seen that the service is being restarted:

<YYYY-MM-DD> 09:16:29 | INFO | vmware-vcd-cell running
<YYYY-MM-DD> 09:21:30 | ALERT | vmware-vcd-cell is dead but /var/run/vmware-vcd-cell.pid exists, attempting to restart it
<YYYY-MM-DD> 09:21:40 | INFO | Started vmware-vcd-cell (pid=478962)
<YYYY-MM-DD> 09:21:40 | WARN | Server status returned HTTP/1.1 404
<YYYY-MM-DD> 09:22:40 | WARN | Server status returned HTTP/1.1 503
<YYYY-MM-DD> 09:23:40 | WARN | Server status returned HTTP/1.1 503
<YYYY-MM-DD> 09:24:40 | WARN | Server status returned HTTP/1.1 503
<YYYY-MM-DD> 09:26:41 | INFO | vmware-vcd-cell running
<YYYY-MM-DD> 09:31:41 | INFO | vmware-vcd-cell running
If the dmesg command is run and the output checked on the appliance then it can be seen that an out of memory killer was activated and started killing process:

[11#######] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/vmware-vcd.service,task=java,pid=41###,uid=1###
[11#######] Out of memory: Killed process 41#### (java) total-vm:16181468kB, anon-rss:5105644kB, file-rss:0kB, shmem-rss:16kB, UID:1003 pgtables:14688kB oom_score_adj:0
If the journalctl command is run on the appliance and the output checked then it can be seen that a kernel panic RIP (Register Instruction Pointer) occurred and the Out Of Memory (oom-kill) was invoked.

<cell>.example.com kernel: pool-jetty-1680 invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
<cell>.example.com kernel: CPU: 2 PID: 3288855 Comm: pool-jetty-1680 Not tainted 5.10.224-3.ph4 #1-photon
<cell>.example.com kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
<cell>.example.com kernel: Call Trace:
<cell>.example.com kernel: dump_stack+0x70/0x8f
<cell>.example.com kernel: dump_header+0x4f/0x1fa
.....
.....
<cell>.example.com kernel: RIP: 0033:0x7fd3bc9e72b0
<cell>.example.com kernel: Code: Unable to access opcode bytes at RIP 0x7fd3bc9e7286.
.....
.....
<cell>.example.com kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/system.slice/vmware-vcd.service,task=java,pid=3281,uid=1003
<cell>.example.com kernel: Out of memory: Killed process 3281 (java) total-vm:20708864kB, anon-rss:9066416kB, file-rss:0kB, shmem-rss:16kB, UID:1003 pgtables:24464kB oom_score_adj:0

Environment

VMware Cloud Director 10.6.x

Cause

Memory was being consumed at too high of a rate for the appliance to handle. This resulted in the kernel terminating processes to prevent a total system crash when RAM is critically low.

Resolution

The sizing of the Cloud Director server group needs to increase or the number of requests needs to be limited.

Review the current sizing of the Cloud Director appliances in the server group and take corrective action to increase the sizing to large or extra large(VVS) as outlined in the VMware Cloud Director Appliance Sizing Guidelines.

The procedure for resizing is documented here: Recommended Procedure for resizing VMware Cloud Director Appliances

If the appliances are already right-sized then requests need to be limited coming into Cloud Director. That would have to be performed outside of VMware Cloud Director at the loadbalancer level.

Additional Information

VMware Cloud Director がデータベース再接続のアラートを管理者に頻繁に送信する