We are seeing the following kinds of gateway log messages:
************Log Start**********
2022-03-12T02:22:04.838+0000 WARNING 122 com.hazelcast.internal.cluster.impl.ClusterHeartbeatManager: [XX.XX.XX.XXX]:8777 [gateway] [3.12.5] This node does not have a connection to Member [XX.XX.XX.XX]:8777 - 109a01dc-0b33-4303-b56c-2ceb7beeecad
2022-03-12T02:22:04.838+0000 WARNING 122 com.hazelcast.internal.cluster.impl.ClusterHeartbeatManager: [XX.XX.XX.XXX]:8777 [gateway] [3.12.5] This node does not have a connection to Member [XX.XX.XX.XX]:8777 - a2a883b1-e133-49cb-a3a8-bf69b863b485
2022-03-12T02:22:06.075+0000 WARNING 124 com.hazelcast.nio.tcp.TcpIpConnectionErrorHandler: [XX.XX.XX.XXX]:8777 [gateway] [3.12.5] Removing connection to endpoint [10.67.21.16]:8777 Cause => java.net.SocketTimeoutException {null}, Error-Count: 55
2022-03-12T02:22:06.176+0000 WARNING 109 com.hazelcast.nio.tcp.TcpIpConnectionErrorHandler: [XX.XX.XX.XXX]:8777 [gateway] [3.12.5] Removing connection to endpoint [XX.XX.XX.XX]:8777 Cause => java.net.SocketTimeoutException {null}, Error-Count: 55
**************Log End************
Below are the hazelcast properties within the values.yaml file in the helm charts.
**********Values Start***************
hazelcast:
# If you wish to connect to an existing Hazelcast instance set enabled to false
# external to true, and uncomment and set url.
enabled: false
external: false
# url: hazelcast.example.com:5701
image:
tag: "3.12.8"
cluster:
memberCount: 2
mancenter:
enabled: false
hazelcast:
yaml:
hazelcast:
network:
join:
multicast:
enabled: false
kubernetes:
enabled: true
service-name: ${serviceName}
namespace: ${namespace}
resolve-not-ready-addresses: true
**********Values End***************
Release : 10.0
Component : API GATEWAY
CONTAINER GATEWAY
In Kubernetes, the pods(gateway nodes) join and exit the Cluster dynamically with dynamically assigned IPs. The cluster_info table is still persisted with the old unused Gateway nodes (IPs).
This is causing each node to send requests to other nodes confirming cluster membership and causing these errors. The list would big based on the auto scaling as each time we ramp up and down, a new list of nodes(IPs) join the table and still persists until we manually remove them from the Gateway Dashboard or by deleting the cluster_info table.
You would just scale to zero and clear the cluster_info table.
You can also set this system property to clean up inactive nodes older than x (background task is hardcoded and runs every 24 hours)
com.l7tech.server.clusterStaleNodeCleanupTimeoutSeconds
systemProperties: |- # By default, FIPS module will block an RSA modulus from being used for encryption if it has been used for # signing, or visa-versa. Set true to disable this default behaviour and remain backwards compatible. com.safelogic.cryptocomply.rsa.allow_multi_use=true # Specifies the type of Trust Store (JKS/PKCS12) provided by AdoptOpenJDK that is used by Gateway. # Must be set correctly when Gateway is running in FIPS mode. If not specified it will default to PKCS12. javax.net.ssl.trustStoreType=jks # Additional properties go here