Scaling Considerations for Dynamic ASGs

Products

VMware Tanzu Application Service VMware Tanzu Application Service VMware Tanzu Application Service for VMs

Issue/Introduction

Overview

There are three processes involved in the application of Application Security Group (ASG) rules and Container to Container (C2C) Network Policy rules to Diego Cells:

Syncing ASG data from CAPI to policy-server.
Syncing ASG data from policy-server to vxlan-policy-agent and applying the ASG data to containers on the Diego Cell via iptables
Syncing of C2C Network Policy data from policy-server to vxlan-policy-agent and applying the Network Policies to containers on the Diego Cell via iptables

During the initial load testing of dynamic ASGs, we identified two primary concerns for these processes when running at high scale (hundreds of Diego Cells, tens of thousands of apps, hundreds of thousands of iptables rules per Diego Cell).

Concern 1: Number of iptables rules per Diego Cell

As the number of iptables rules per Cell, so does the time taken for any commands updating or listing iptables rules (executing rules on inbound/outbound traffic packets is not affected in the same way). This significantly slows down the syncing process between policy-server and vxlan-policy-agent for ASG data.

Concern 2: Memory consumed by vxlan-policy-agent

The vxlan-policy-agent keeps ASG rules in memory. More ASG rules will result in less memory available for apps on a given Diego Cell. As a result, it may be necessary to adjust the amount of memory reserved for system processes on Diego Cells, and/or increase the amount of memory allocated to Diego Cell VMs. For more information, see the Tanzu Documentation on this topic (Diego Cell memory and disk overcommit).

Resolution

Metrics To Watch

To assist in understanding the performance of your ASG and C2C Network Policy syncing, Tanzu Platform for CF provides a number of metrics to monitor the performance of each sync process:

Syncing ASG Data from CAPI to Policy-Server

Metric Name	Healthwatch Metric Name	Measurement
SecurityGroupsRetrievalFromCCTime	SecurityGroupsRetrievalFromCCTime	The time taken for policy-server to retrieve security group data from CAPI.
SecurityGroupsStoreReplaceSuccessTime	SecurityGroupsStoreReplaceSuccessTime	The the time taken for policy-server to store ASG data in the database.
SecurityGroupsTotalSyncTime	SecurityGroupsTotalSyncTime	The total time taken for the entire sync from CAPI to policy-server.

The total time for a sync of ASG data from CAPI to Policy-Server is reflected in the Policy_server_asg_syncer_security_groups_total_sync_time metric.

Syncing ASG Data from Policy-Server to Diego Cells and Applying Rules

Metric Name	Healthwatch Metric Name	Measurement
asgTotalPollTime	asgTotalPollTime	The time taken for vxlan-policy-agent to pull ASG data from policy-server’s API.
asgIptablesEnforceTime	asgIptablesEnforceTime	The time taken for vxlan-policy-agent to apply ASG data as iptables rules on the Diego Cell.
asgIptablesCleanupTime	asgIptablesCleanupTime	The time taken for vxlan-policy-agent to clean up iptables rules that are no longer needed on the Diego Cell.

The total time for a sync of ASG data from policy-server to vxlan-policy-agent can be obtained by summing the above metrics.

Syncing C2C Data from Policy-Server to Diego Cells and Applying Rules

Metric Name	Healthwatch Metric Name	Measurement
totalPollTime	totalPollTime	The time taken for vxlan-policy-agent to pull C2C Network Policies from policy-server’s API.
iptablesEnforceTime	iptablesEnforceTime	The time taken for vxlan-policy-agent to apply C2C Network Policies as iptables rules on the Diego Cell.

The total time for a sync of C2C Network Policy data from policy-server to vxlan-policy-agent can be obtained by summing the above metrics.

Additional Capacity Metrics

Metric Name	Healthwatch Metric Name	Measurement
IPTablesRuleCount	IPTablesRuleCount	The number of iptables rules on a given Diego Cell (affects the time taken to apply rules on the diego cell).
memoryStats.numBytesAllocated	memoryStats_numBytesAllocated	The amount of memory consumed by the vxlan-policy-server.

Behaviors to Expect

ASG Data from CAPI to policy-server

During steady state, these sync metrics are not emitted. Data is only synced from CAPI to policy server when changes to the CAPI database are detected. Typically this is due to changes of ASG definitions, global ASG bindings, and org/space specific ASG bindings.

ASG Data from policy-server to vxlan-policy-agent

During steady state, this sync time should stay constant. As the number and size of ASGs increase, this increases the size of the data sent from the policy-server to the vxlan-policy-agent on the Diego Cell will increase. During periods of heavy application container churn on Diego Cells, such as a platform upgrade, this number can spike above the default sync interval when there is a high volume (tens of thousands) of iptables rules present on a Cell.

The ramifications of this spike are that it will take longer for ASG changes to be applied to existing containers on that Cell, and that it will take longer for new containers to start up on that Cell. At no point would a container run without having its ASG data applied first.

C2C Network Policy data from policy-server to vxlan-policy-agent

During steady state, these sync times should stay constant. This sync time is largely unaffected by the number of ASGs rules present for apps on a Diego Cell.

Actions to Take As Sync Times Increase

When sync times exceed their configured sync intervals, consider increasing the sync intervals, to avoid being in a state of constantly syncing data. These intervals can be configured via the om CLI, using the following properties in the Tanzu Platform for Cloud Foundry tile:

Property	.properties.policy_server_asg_syncer_interval
Default	60s
Function	Controls how often policy-server syncs ASG data from CAPI.
Usage	Increase this value to ensure sync times between policy-server and CAPI are within the window.
Configurability	om CLI only

Property	.properties.vxlan_policy_agent_asg_update_interval
Default	60s
Function	Controls how often vxlan-policy-agent syncs and applies ASG data from policy-server.
Usage	Increase this value to ensure sync times between vxlan-policy-agent and policy-server are within the window.
Configurability	om CLI only

Property	.properties.container_networking_interface_plugin.silk.policy_enforcement_poll_interval
Default	5s
Function	Controls how often vxlan-policy-agent syncs and applies C2C Network Policy data from policy-server
Usage	Increase this value to ensure sync times of C2C policy between vxlan-policy-agent and policy-server are within the interval.
Configurability	om CLI, and in Ops Manager under the Networking tab of the Tanzu Platform for CF tile as “Policies polling interval”

Property	.diego_cell.executor_memory_capacity
Default	None
Function	Allows operators to specify exactly how much of Cell memory to make available to app containers running on the cell, reserving the rest for system processes (including vxlan-policy-agent).
Usage	Set this to Diego Cell memory - ( memory consumed by the kernel + memory consumed by BOSH jobs on the VM)
Configurability	om CLI, and in Ops Manager under the Advanced Features tab of the Tanzu Platform for CF tile as “Diego Cell memory capacity”

Drawbacks of Increasing Sync Times

The primary concern for increasing sync times is that it takes longer for rule changes to be applied to active containers.

The maximum time taken for ASG changes made in CAPI to be reflected in existing application containers running on Diego Cells is:

policy_server_asg_syncer_interval + vxlan_policy_agent_asg_update_interval

The maximum time taken for ASG changes made in CAPI prior to a new application being scheduled on a Diego Cell is:

vxlan_policy_agent_asg_update_interval

If the ASG change was made during a time window when Diego is starting an app affected by that ASG change, the max time taken to apply it is:

policy_server_asg_syncer_interval + vxlan_policy_agent_asg_update_interval

There is a chance the container will start with the ASG rules defined prior to the change made while it was being started. However within the next sync window, they would be updated to the latest rules.