Hazelcast - java.net.SocketTimeoutException - CONGW 10.1
search cancel

Hazelcast - java.net.SocketTimeoutException - CONGW 10.1

book

Article ID: 237051

calendar_today

Updated On:

Products

CA API Gateway

Issue/Introduction

We are seeing the following kinds of gateway log messages:

************Log Start**********

2022-03-12T02:22:04.838+0000 WARNING 122    com.hazelcast.internal.cluster.impl.ClusterHeartbeatManager: [XX.XX.XX.XXX]:8777 [gateway] [3.12.5] This node does not have a connection to Member [XX.XX.XX.XX]:8777 - 109a01dc-0b33-4303-b56c-2ceb7beeecad
2022-03-12T02:22:04.838+0000 WARNING 122    com.hazelcast.internal.cluster.impl.ClusterHeartbeatManager: [XX.XX.XX.XXX]:8777 [gateway] [3.12.5] This node does not have a connection to Member [XX.XX.XX.XX]:8777 - a2a883b1-e133-49cb-a3a8-bf69b863b485
2022-03-12T02:22:06.075+0000 WARNING 124    com.hazelcast.nio.tcp.TcpIpConnectionErrorHandler: [XX.XX.XX.XXX]:8777 [gateway] [3.12.5] Removing connection to endpoint [10.67.21.16]:8777 Cause => java.net.SocketTimeoutException {null}, Error-Count: 55
2022-03-12T02:22:06.176+0000 WARNING 109    com.hazelcast.nio.tcp.TcpIpConnectionErrorHandler: [XX.XX.XX.XXX]:8777 [gateway] [3.12.5] Removing connection to endpoint [XX.XX.XX.XX]:8777 Cause => java.net.SocketTimeoutException {null}, Error-Count: 55

**************Log End************

Below are the hazelcast properties within the values.yaml file in the helm charts.

**********Values Start***************

hazelcast:
  # If you wish to connect to an existing Hazelcast instance set enabled to false
  # external to true, and uncomment and set url.
  enabled: false
  external: false
  # url: hazelcast.example.com:5701
  image:
    tag: "3.12.8"
  cluster:
    memberCount: 2
  mancenter:
    enabled: false
  hazelcast:
    yaml:
      hazelcast:
        network:
          join:
            multicast:
              enabled: false
            kubernetes:
              enabled: true
              service-name: ${serviceName}
              namespace: ${namespace}
              resolve-not-ready-addresses: true

**********Values End***************

Environment

Release : 10.0

Component : API GATEWAY

CONTAINER GATEWAY 

Cause

In Kubernetes, the pods(gateway nodes) join and exit the Cluster dynamically with dynamically assigned IPs. The cluster_info table is still persisted with the old unused Gateway nodes (IPs).
This is causing each node to send requests to other nodes confirming cluster membership and causing these errors. The list would big based on the auto scaling as each time we ramp up and down, a new list of nodes(IPs) join the table and still persists until we manually remove them from the Gateway Dashboard or by deleting the cluster_info table.

Resolution

You would just scale to zero and clear the cluster_info table.
You can also set this system property to clean up inactive nodes older than x (background task is hardcoded and runs every 24 hours)

com.l7tech.server.clusterStaleNodeCleanupTimeoutSeconds

This is under config - line 141
  systemProperties: |-    # By default, FIPS module will block an RSA modulus from being used for encryption if it has been used for    # signing, or visa-versa. Set true to disable this default behaviour and remain backwards compatible.    com.safelogic.cryptocomply.rsa.allow_multi_use=true    # Specifies the type of Trust Store (JKS/PKCS12) provided by AdoptOpenJDK that is used by Gateway.    # Must be set correctly when Gateway is running in FIPS mode. If not specified it will default to PKCS12.    javax.net.ssl.trustStoreType=jks    # Additional properties go here