TKGi has large clusters configured, 1000 pods or more in a cluster
Pods are configured to use TCP liveness probe
There are an unexpectedly large number of healthcheck DFW firewall sections and rules. These are identified with a scope: ncp/fw_sect_type and tag: healthcheck. The number of these rules is higher than the total number of pods across all clusters.
Environment
VMware NSX-T Data Center 3.x VMware NSX 4.x Tanzu Kubernetes Grid Integrated Edition
Cause
If the NSX firewall processing is slow, NCP rule creation requests may time out however the initial request is eventually processed by NSX. After NCP times out, it sends a retry, and this generates the duplicated DFW rule.
Resolution
This issue can be resolved by running the attached script, delete_duplicate_rules_v2.py.
Dry run
Identify if duplicate healthcheck rules exist and take no other action i.e. dry run/read only mode. A JSON file named duplicate_rule.json will be created.