There are three processes involved in the application of Application Security Group (ASG) rules and Container to Container (C2C) Network Policy rules to Diego Cells:
During the initial load testing of dynamic ASGs, we identified two primary concerns for these processes when running at high scale (hundreds of Diego Cells, tens of thousands of apps, hundreds of thousands of iptables rules per Diego Cell).
As the number of iptables rules per Cell, so does the time taken for any commands updating or listing iptables rules (executing rules on inbound/outbound traffic packets is not affected in the same way). This significantly slows down the syncing process between policy-server and vxlan-policy-agent for ASG data.
The vxlan-policy-agent keeps ASG rules in memory. More ASG rules will result in less memory available for apps on a given Diego Cell. As a result, it may be necessary to adjust the amount of memory reserved for system processes on Diego Cells, and/or increase the amount of memory allocated to Diego Cell VMs. For more information, see the Tanzu Documentation on this topic.
To assist in understanding the performance of your ASG and C2C Network Policy syncing, Tanzu Platform for CF provides a number of metrics to monitor the performance of each sync process:
Metric Name |
Healthwatch Metric Name |
Measurement |
SecurityGroupsRetrievalFromCCTime |
SecurityGroupsRetrievalFromCCTime |
The time taken for policy-server to retrieve security group data from CAPI. |
SecurityGroupsStoreReplaceSuccessTime |
SecurityGroupsStoreReplaceSuccessTime |
The the time taken for policy-server to store ASG data in the database. |
SecurityGroupsTotalSyncTime |
SecurityGroupsTotalSyncTime |
The total time taken for the entire sync from CAPI to policy-server. |
The total time for a sync of ASG data from CAPI to Policy-Server is reflected in the Policy_server_asg_syncer_security_groups_total_sync_time metric.
Metric Name |
Healthwatch Metric Name |
Measurement |
asgTotalPollTime |
asgTotalPollTime |
The time taken for vxlan-policy-agent to pull ASG data from policy-server’s API. |
asgIptablesEnforceTime |
asgIptablesEnforceTime |
The time taken for vxlan-policy-agent to apply ASG data as iptables rules on the Diego Cell. |
asgIptablesCleanupTime |
asgIptablesCleanupTime |
The time taken for vxlan-policy-agent to clean up iptables rules that are no longer needed on the Diego Cell. |
The total time for a sync of ASG data from policy-server to vxlan-policy-agent can be obtained by summing the above metrics.
Metric Name |
Healthwatch Metric Name |
Measurement |
totalPollTime |
totalPollTime |
The time taken for vxlan-policy-agent to pull C2C Network Policies from policy-server’s API. |
iptablesEnforceTime |
iptablesEnforceTime |
The time taken for vxlan-policy-agent to apply C2C Network Policies as iptables rules on the Diego Cell. |
The total time for a sync of C2C Network Policy data from policy-server to vxlan-policy-agent can be obtained by summing the above metrics.
Metric Name |
Healthwatch Metric Name |
Measurement |
IPTablesRuleCount |
IPTablesRuleCount |
The number of iptables rules on a given Diego Cell (affects the time taken to apply rules on the diego cell). |
memoryStats.numBytesAllocated |
memoryStats_numBytesAllocated |
The amount of memory consumed by the vxlan-policy-server. |
During steady state, these sync metrics are not emitted. Data is only synced from CAPI to policy server when changes to the CAPI database are detected. Typically this is due to changes of ASG definitions, global ASG bindings, and org/space specific ASG bindings.
During steady state, this sync time should stay constant. As the number and size of ASGs increase, this increases the size of the data sent from the policy-server to the vxlan-policy-agent on the Diego Cell will increase. During periods of heavy application container churn on Diego Cells, such as a platform upgrade, this number can spike above the default sync interval when there is a high volume (tens of thousands) of iptables rules present on a Cell.
The ramifications of this spike are that it will take longer for ASG changes to be applied to existing containers on that Cell, and that it will take longer for new containers to start up on that Cell. At no point would a container run without having its ASG data applied first.
During steady state, these sync times should stay constant. This sync time is largely unaffected by the number of ASGs rules present for apps on a Diego Cell.
When sync times exceed their configured sync intervals, consider increasing the sync intervals, to avoid being in a state of constantly syncing data. These intervals can be configured via the om CLI, using the following properties in the Tanzu Platform for Cloud Foundry tile:
Property |
.properties.policy_server_asg_syncer_interval |
Default |
60s |
Function |
Controls how often policy-server syncs ASG data from CAPI. |
Usage |
Increase this value to ensure sync times between policy-server and CAPI are within the window. |
Configurability |
om CLI only |
Property |
.properties.vxlan_policy_agent_asg_update_interval |
Default |
60s |
Function |
Controls how often vxlan-policy-agent syncs and applies ASG data from policy-server. |
Usage |
Increase this value to ensure sync times between vxlan-policy-agent and policy-server are within the window. |
Configurability |
om CLI only |
Property |
.properties.container_networking_interface_plugin.silk.policy_enforcement_poll_interval |
Default |
5s |
Function |
Controls how often vxlan-policy-agent syncs and applies C2C Network Policy data from policy-server |
Usage |
Increase this value to ensure sync times of C2C policy between vxlan-policy-agent and policy-server are within the interval. |
Configurability |
om CLI, and in Ops Manager under the Networking tab of the Tanzu Platform for CF tile as “Policies polling interval” |
Property |
.diego_cell.executor_memory_capacity |
Default |
None |
Function |
Allows operators to specify exactly how much of Cell memory to make available to app containers running on the cell, reserving the rest for system processes (including vxlan-policy-agent). |
Usage |
Set this to Diego Cell memory - ( memory consumed by the kernel + memory consumed by BOSH jobs on the VM) |
Configurability |
om CLI, and in Ops Manager under the Advanced Features tab of the Tanzu Platform for CF tile as “Diego Cell memory capacity” |
The primary concern for increasing sync times is that it takes longer for rule changes to be applied to active containers.
The maximum time taken for ASG changes made in CAPI to be reflected in existing application containers running on Diego Cells is:
policy_server_asg_syncer_interval + vxlan_policy_agent_asg_update_interval
The maximum time taken for ASG changes made in CAPI prior to a new application being scheduled on a Diego Cell is:
vxlan_policy_agent_asg_update_interval
If the ASG change was made during a time window when Diego is starting an app affected by that ASG change, the max time taken to apply it is:
policy_server_asg_syncer_interval + vxlan_policy_agent_asg_update_interval
There is a chance the container will start with the ASG rules defined prior to the change made while it was being started. However within the next sync window, they would be updated to the latest rules.