Preparing Deployment step takes a long time

Products

VMware Tanzu Platform - Cloud Foundry

Issue/Introduction

Cause

Customer sees the cf deployment with the long template render time has an extremely high number of VMs. Indeed the number of VMs has a direct effect on the template render time, but it's not always easy to reason about why this is so.

Especially when you do a naive comparison on number of VMs, e.g.

the wavefront deployment has 7 VMs and template rendering takes 2 seconds.
the cf deployment has 636 VMs. If we expected the "same" per-VM render time, then we'd expect it to take around 3 minutes ((2/7) * 636 = ~181) – but it takes much longer than we would expect – 19:42!

This is confusing, but it's because we are not comparing apples-to-apples – rendering isn't per-VM, it is per job. This is because job templates may need to include values specific to that VM in order to render the values necessary to manage that job.

For what it's worth, although there have been efforts (some successful) at improving the performance in this area (since it is one of the more expensive parts of a deployment), it's extremely challenging to optimize because job templates have no fixed API. That is, current templates are permitted to execute arbitrary Ruby as part of the render process, which means in general the problem of making templates render faster is the same as making Ruby run faster (which ... many smart core Ruby developers have been working on for years with mostly small incremental improvements). There may be some other approaches (e.g., some kind of smart caching that pre-processes the template and allows us to render subsequent VMs in the same instance group) that will bear fruit, but we then start getting into somewhat complex caching strategies, which could be very dangerous if edge cases are missed.

Challenges with optimization aside, you hope the per template/per job understanding makes the timing you are seeing with this deployment somewhat more understandable. Anyhow, with this principle in mind, we can compare the two deployments and you think you will see that the render time is both fairly reasonable and consistent, but that there's just a lot more rendering in the cf-deployment case:

You can write a quick little bash snippet to crunch out some numbers from an input manifest:

function analyze_manifest_renders() {
yq -o=json '.instance_groups' "$1" | jq -r '. as $g | map(" \(.name) \(.instances) instance, \(.jobs | length) jobs = \((.jobs | length) * (.instances)) renders") | join("\n") as $out1 | $g | " Total VMs: \(map(.instances) | add) Total renders: \(map((.jobs | length) * (.instances)) | add)" as $out2 | "\($out1)\n\n\($out2)"'
}

Using this, you can look again at our wavefront vs. cf deployment comparison:

Wavefront

analyze_manifest_renders ./support_bundle_20251006173628/deployed_manifest_and_configs/wavefront-nozzle-35d1872438958f70241a/manifest_last_successful_20251006160138.yml
wavefront_proxy 1 instance, 2 jobs = 2 renders
pas-exporter-counter 1 instance, 2 jobs = 2 renders
pas-exporter-gauge 1 instance, 2 jobs = 2 renders
pas-exporter-container 1 instance, 2 jobs = 2 renders
pas-exporter-timer 1 instance, 2 jobs = 2 renders
tas-exporters 1 instance, 11 jobs = 11 renders
telegraf_agent 1 instance, 2 jobs = 2 renders
telegraf_standalone 0 instance, 2 jobs = 0 renders
Total renders: 23
Prepare Time: 2 seconds
Per-template-time: 2/23 = 0.08695652173 seconds (86.9 ms)

cf deployment

❯ analyze_manifest_renders ./support_bundle_20251006173628/deployed_manifest_and_configs/cf-d0c56a4b0add97a015a0/manifest_last_successful_20251001141633.yml
database 0 instance, 1 jobs = 0 renders
blobstore 0 instance, 1 jobs = 0 renders
control 0 instance, 1 jobs = 0 renders
compute 0 instance, 1 jobs = 0 renders
nats 2 instance, 8 jobs = 16 renders
nfs_server 0 instance, 8 jobs = 0 renders
mysql_proxy 2 instance, 8 jobs = 16 renders
mysql 3 instance, 12 jobs = 36 renders
backup_restore 1 instance, 30 jobs = 30 renders
diego_database 3 instance, 14 jobs = 42 renders
uaa 8 instance, 9 jobs = 72 renders
cloud_controller 16 instance, 26 jobs = 416 renders
cloud_controller_worker 16 instance, 7 jobs = 112 renders
ha_proxy 0 instance, 1 jobs = 0 renders
diego_brain 2 instance, 15 jobs = 30 renders
router 22 instance, 8 jobs = 176 renders
tcp_router 6 instance, 8 jobs = 48 renders
mysql_monitor 1 instance, 7 jobs = 7 renders
diego_cell 470 instance, 25 jobs = 11750 renders
loggregator_trafficcontroller 16 instance, 10 jobs = 160 renders
log_cache 32 instance, 11 jobs = 352 renders
clock_global 2 instance, 32 jobs = 64 renders
doppler 32 instance, 10 jobs = 320 renders
credhub 2 instance, 8 jobs = 16 renders
Total renders: 13663
Prepare Time: 19m 42s (1182 seconds)
Per-template-time: 1182/13663 = 0.08651101515 seconds (86.5 ms)

So, as you can see, both deployments have approximately the same template render time – around 1/10th of a second – it's just we are rendering several orders of magnitude more templates in the case of the cf-deployment, largely because of the number of diego cells being deployed.

Resolution

This is normal behaviour