Enabling firewall hardening in Aria Operations disrupts communication within the cluster and it gets struck in "Going Online" mode.

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

Enabling firewall hardening (activated Firewall Hardening via "Admin UI > Administrator Settings > Firewall Hardening") in Aria Operations to mitigate any vulnerabilities disrupts communication within the cluster.

The "gemfire_vRealize Ops Analytics-xxxx.log" file contains messages similar to:

INFO [Analytics Main Thread tid=xx] org.apache.geode.distributed.internal.membership.gms.Services.findCoordinator - received FindCoordinatorResponse(coordinator=XX.XX.XX.XX(51563:locator)<ec><v0>:20003, fromView=true, viewId=0, registrants=[XX.XX.XX.XX
(vRealize Ops Analytics-XX.XX.XX.XXXXX)<ec>:10010, XX.XX.XX.XX(vRealize Ops Analytics-XX.XX.XX.XX:52474)<ec>:10007, XX.XX.XX.XX(25595:locator)<ec>:20003], senderId=XX.XX.XX.XX(51563:locator)<ec><v0>:20003, network partition detection enabled=false, locators preferred as coordinators=false, view=View[XX.XX.XX.XX(51563:locator)<ec><v0>:20003|0] members: [XX.XX.XX.XX(51563:locator)<ec><v0>:20003]) from locator /XX.XX.XX.XX:6061
INFO [Analytics Main Thread tid=xx] org.apache.geode.distributed.internal.membership.gms.Services.findCoordinator - Locator's address indicates it is part of a distributed system so I will not become membership coordinator on this attempt to join
INFO [Analytics Main Thread tid=xx] org.apache.geode.distributed.internal.membership.gms.Services.findCoordinator - Unable to contact locator /XX.XX.XX.XX:6061: java.net.SocketException: Protocol not available (connect failed)
INFO [Analytics Main Thread tid=xx] org.apache.geode.distributed.internal.membership.gms.Services.findCoordinator - findCoordinator chose XX.XX.XX.XX(51563:locator)<ec><v0>:20003 out of these possible coordinators: [XX.XX.XX.XX(51563:locator)<ec><v0>:2
0003]
INFO [Analytics Main Thread tid=xx] org.apache.geode.distributed.internal.membership.gms.Services.findCoordinator - Unable to contact locator /XX.XX.XX.XX:6061: java.net.ConnectException: Connection refused (Connection refused)
INFO [Geode Failure Detection Server thread 1 tid=yy] org.apache.geode.distributed.internal.membership.gms.Services.lambda$startTcpServer$3 - Started failure detection server thread on /XX.XX.XX.XX:10002.
INFO [Analytics Main Thread tid=xx] org.apache.geode.distributed.internal.membership.gms.Services.findCoordinator - Unable to contact locator /XX.XX.XX.XX:6061: java.net.ConnectException: Connection refused (Connection refused)

Also the data node's "analytics.log" file contains error similar to:

ERROR [Analytics Main Thread]  com.integrien.analytics.AnalyticsMain.createGemfireCache - Can not connect to gemfire: Problem starting up membership services
org.apache.geode.SystemConnectException: Problem starting up membership services
        at org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:xxx) ~[gemfire-core-10.0.1.jar:?]
        at org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:xxx) ~[gemfire-core-10.0.1.jar:?]
        at org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:xxx) ~[gemfire-core-10.0.1.jar:?]
        at org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:xxx) ~[gemfire-core-10.0.1.jar:?]
        at org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:xxx) ~[gemfire-core-10.0.1.jar:?]
        at org.apache.geode.distributed.internal.InternalDistributedSystem$DefaultClusterDistributionManagerConstructor.create(InternalDistributedSystem.java:xxxx) ~[gemfire-core-10.0.1.jar:?]

Environment

Aria Operations 8.18.x

Cause

The firewall hardening script relies on the primary node's FQDN stored in the casa.db.script file (specifically the clusterMembership section). However, the script utilizes the dig command which fails to resolve the hostname if it's a short name instead of the FQDN. Aria Operations mandates the use of FQDNs for primary and replica nodes. The script is designed to work with FQDNs to ensure proper communication during firewall hardening.

Refer: Getting Started with VMware Aria Operations (8.18)

Resolution

1. Take snapshot of all the nodes.

2. Modify the /storage/db/casa/webapp/hsqldb/casa.db.script values to ensure the primary node information in the clusterMembership section uses the FQDN instead of the short hostname.

Modify each Aria Operations node that requires the update of casa.db.script values.

Incorrect:

INSERT INTO CASA_DOCS VALUES('clusterMembership','{"onlineState":"ONLINE","cluster_name":"Name of cluster","is_ha_enabled":false,"ha_transition_state":null,"ca_state":"DISABLED","initialization_state":"NONE","remove_node_state":"NONE","document_version":20,"document_time":1731645460795,"online_state":"ONLINE","online_state_time":1731645460792,"online_state_reason":"","out_of_diskspace_slice":"","email":null,"cluster_members":[],"admin_slices":[],"installation_state":"DONE","fail_going_offline":false,"slices":{"XXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX":{"slice_uuid":"XXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX","pair_uuid":null,"is_admin_node":true,"ip_address":"Node Short name","preferred_addresses":{},"slice_name":"master","membership_state":null,"region":null}}}')

INSERT INTO VERSION VALUES(1,3)

Correct:

INSERT INTO CASA_DOCS VALUES('clusterMembership','{"onlineState":"ONLINE","cluster_name":"Name of cluster","is_ha_enabled":false,"ha_transition_state":null,"ca_state":"DISABLED","initialization_state":"NONE","remove_node_state":"NONE","document_version":20,"document_time":1731645460795,"online_state":"ONLINE","online_state_time":1731645460792,"online_state_reason":"","out_of_diskspace_slice":"","email":null,"cluster_members":[],"admin_slices":[],"installation_state":"DONE","fail_going_offline":false,"slices":{"XXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX":{"slice_uuid":"XXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX","pair_uuid":null,"is_admin_node":true,"ip_address":"Node FQDN","preferred_addresses":{},"slice_name":"master","membership_state":null,"region":null}}}')

INSERT INTO VERSION VALUES(1,3)

Additional Information

Primary Networking Requirements:

1.The primary and replica nodes must use a static IP address, or fully qualified domain name (FQDN) with a static IP address.
2.Data nodes can use dynamic host control protocol (DHCP).
3.You can successfully reverse-DNS all nodes to their FQDN, currently the node hostname.
4.Nodes deployed by OVF have their hostnames set to the retrieved FQDN by default.
5.All nodes, must be bidirectionally routable by IP address or FQDN.
6.Do not separate analytics cluster nodes with network address translation (NAT), load balancer, firewall, or a proxy that inhibits bidirectional communication by IP address or FQDN.

Refer: Unable to Activate Firewall Hardening in Aria Operations