Scaling Considerations for Dynamic ASGs
search cancel

Scaling Considerations for Dynamic ASGs

book

Article ID: 379833

calendar_today

Updated On:

Products

VMware Tanzu Application Service VMware Tanzu Application Service VMware Tanzu Application Service for VMs

Issue/Introduction

Overview

There are three processes involved in the application of Application Security Group (ASG) rules and Container to Container (C2C) Network Policy rules to Diego Cells:

  • Syncing ASG data from CAPI to policy-server.
  • Syncing ASG data from policy-server to vxlan-policy-agent and applying the ASG data to containers on the Diego Cell via iptables
  • Syncing of C2C Network Policy data from policy-server to vxlan-policy-agent and applying the Network Policies to containers on the Diego Cell via iptables

During the initial load testing of dynamic ASGs, we identified two primary concerns for these processes when running at high scale (hundreds of Diego Cells, tens of thousands of apps, hundreds of thousands of iptables rules per Diego Cell).

Concern 1: Number of iptables rules per Diego Cell

As the number of iptables rules per Cell, so does the time taken for any commands updating or listing iptables rules (executing rules on inbound/outbound traffic packets is not affected in the same way). This significantly slows down the syncing process between policy-server and vxlan-policy-agent for ASG data. 

Concern 2: Memory consumed by vxlan-policy-agent 

The vxlan-policy-agent keeps ASG rules in memory. More ASG rules will result in less memory available for apps on a given Diego Cell. As a result, it may be necessary to adjust the amount of memory reserved for system processes on Diego Cells, and/or increase the amount of memory allocated to Diego Cell VMs. For more information, see the Tanzu Documentation on this topic.

Resolution

Metrics To Watch

To assist in understanding the performance of your ASG and C2C Network Policy syncing, Tanzu Platform for CF provides a number of metrics to monitor the performance of each sync process:

Syncing ASG Data from CAPI to Policy-Server

Metric Name

Healthwatch Metric Name

Measurement

SecurityGroupsRetrievalFromCCTime

SecurityGroupsRetrievalFromCCTime

The time taken for policy-server to retrieve security group data from CAPI.

SecurityGroupsStoreReplaceSuccessTime

SecurityGroupsStoreReplaceSuccessTime

The the time taken for policy-server to store ASG data in the database.

SecurityGroupsTotalSyncTime

SecurityGroupsTotalSyncTime

The total time taken for the entire sync from CAPI to policy-server.

The total time for a sync of ASG data from CAPI to Policy-Server is reflected in the Policy_server_asg_syncer_security_groups_total_sync_time metric.

 

Syncing ASG Data from Policy-Server to Diego Cells and Applying Rules

Metric Name

Healthwatch Metric Name

Measurement

asgTotalPollTime

asgTotalPollTime

The time taken for vxlan-policy-agent to pull ASG data from policy-server’s API.

asgIptablesEnforceTime

asgIptablesEnforceTime

The time taken for vxlan-policy-agent to apply ASG data as iptables rules on the Diego Cell.

asgIptablesCleanupTime

asgIptablesCleanupTime

The time taken for vxlan-policy-agent to clean up iptables rules that are no longer needed on the Diego Cell.

The total time for a sync of ASG data from policy-server to vxlan-policy-agent can be obtained by summing the above metrics.

 

Syncing C2C Data from Policy-Server to Diego Cells and Applying Rules

Metric Name

Healthwatch Metric Name

Measurement

totalPollTime

totalPollTime

The time taken for vxlan-policy-agent to pull C2C Network Policies from policy-server’s API.

iptablesEnforceTime

iptablesEnforceTime

The time taken for vxlan-policy-agent to apply C2C Network Policies as iptables rules on the Diego Cell.

The total time for a sync of C2C Network Policy data from policy-server to vxlan-policy-agent can be obtained by summing the above metrics.

Additional Capacity Metrics

Metric Name

Healthwatch Metric Name

Measurement

IPTablesRuleCount

IPTablesRuleCount

The number of iptables rules on a given Diego Cell (affects the time taken to apply rules on the diego cell).

memoryStats.numBytesAllocated

memoryStats_numBytesAllocated

The amount of memory consumed by the vxlan-policy-server.

 

Behaviors to Expect

ASG Data from CAPI to policy-server

During steady state, these sync metrics are not emitted. Data is only synced from CAPI to policy server when changes to the CAPI database are detected. Typically this is due to changes of ASG definitions, global ASG bindings, and org/space specific ASG bindings.

ASG Data from policy-server to vxlan-policy-agent

During steady state, this sync time should stay constant. As the number and size of ASGs increase, this increases the size of the data sent from the policy-server to the vxlan-policy-agent on the Diego Cell will increase. During periods of heavy application container churn on Diego Cells, such as a platform upgrade, this number can spike above the default sync interval when there is a high volume (tens of thousands) of iptables rules present on a Cell.

The ramifications of this spike are that it will take longer for ASG changes to be applied to existing containers on that Cell, and that it will take longer for new containers to start up on that Cell. At no point would a container run without having its ASG data applied first.

C2C Network Policy data from policy-server to vxlan-policy-agent

During steady state, these sync times should stay constant. This sync time is largely unaffected by the number of ASGs rules present for apps on a Diego Cell.

Actions to Take As Sync Times Increase

When sync times exceed their configured sync intervals, consider increasing the sync intervals, to avoid being in a state of constantly syncing data. These intervals can be configured via the om CLI, using the following properties in the Tanzu Platform for Cloud Foundry tile:

Property

.properties.policy_server_asg_syncer_interval

Default

60s

Function

Controls how often policy-server syncs ASG data from CAPI.

Usage

Increase this value to ensure sync times between policy-server and CAPI are within the window.

Configurability

om CLI only

 

Property

.properties.vxlan_policy_agent_asg_update_interval

Default

60s

Function

Controls how often vxlan-policy-agent syncs and applies ASG data from policy-server. 

Usage

Increase this value to ensure sync times between vxlan-policy-agent and policy-server are within the window.

Configurability

om CLI only

 

Property

.properties.container_networking_interface_plugin.silk.policy_enforcement_poll_interval

Default

5s

Function

Controls how often vxlan-policy-agent syncs and applies C2C Network Policy data from policy-server

Usage

Increase this value to ensure sync times of C2C policy between vxlan-policy-agent and policy-server are within the interval.

Configurability

om CLI, and in Ops Manager under the Networking tab of the Tanzu Platform for CF tile as “Policies polling interval”

 

Property

.diego_cell.executor_memory_capacity

Default

None

Function

Allows operators to specify exactly how much of Cell memory to make available to app containers running on the cell, reserving the rest for system processes (including vxlan-policy-agent).

Usage

Set this to Diego Cell memory - ( memory consumed by the kernel + memory consumed by BOSH jobs on the VM)

Configurability

om CLI, and in Ops Manager under the Advanced Features tab of the Tanzu Platform for CF tile as “Diego Cell memory capacity”

 

Drawbacks of Increasing Sync Times

The primary concern for increasing sync times is that it takes longer for rule changes to be applied to active containers.

The maximum time taken for ASG changes made in CAPI to be reflected in existing application containers running on Diego Cells is:

policy_server_asg_syncer_interval + vxlan_policy_agent_asg_update_interval

The maximum time taken for ASG changes made in CAPI prior to a new application being scheduled on a Diego Cell is:

 vxlan_policy_agent_asg_update_interval

If the ASG change was made during a time window when Diego is starting an app affected by that ASG change, the max time taken to apply it is:

policy_server_asg_syncer_interval + vxlan_policy_agent_asg_update_interval

There is a chance the container will start with the ASG rules defined prior to the change made while it was being started. However within the next sync window, they would be updated to the latest rules.