Preparing Deployment step takes a long time
search cancel

Preparing Deployment step takes a long time

book

Article ID: 415627

calendar_today

Updated On:

Products

VMware Tanzu Platform - Cloud Foundry

Issue/Introduction

Preparing Deployment step takes a long time

Cause

Customer sees the cf deployment with the long template render time has an extremely high number of VMs. Indeed the number of VMs has a direct effect on the template render time, but it's not always easy to reason about why this is so. 

Especially when you do a naive comparison on number of VMs, e.g. 

the wavefront deployment has 7 VMs and template rendering takes 2 seconds. 
the cf deployment has 636 VMs. If we expected the "same" per-VM render time, then we'd expect it to take around 3 minutes ((2/7) * 636 = ~181)   – but it takes much longer than we would expect – 19:42!  

This is confusing, but it's because we are not comparing apples-to-apples – rendering isn't per-VM, it is per job. This is because job templates may need to include values specific to that VM in order to render the values necessary to manage that job.

For what it's worth, although there have been efforts (some successful) at improving the performance in this area (since it is one of the more expensive parts of a deployment), it's extremely challenging to optimize because job templates have no fixed API. That is, current templates are permitted to execute arbitrary Ruby as part of the render process, which means in general the problem of making templates render faster is the same as making Ruby run faster (which ... many smart core Ruby developers have been working on for years with mostly small incremental improvements). There may be some other approaches (e.g., some kind of smart caching that pre-processes the template and allows us to render subsequent VMs in the same instance group) that will bear fruit, but we then start getting into somewhat complex caching strategies, which could be very dangerous if edge cases are missed.

Challenges with optimization aside, you hope the per template/per job understanding makes the timing you are seeing with this deployment somewhat more understandable. Anyhow, with this principle in mind, we can compare the two deployments and you  think you will see that the render time is both fairly reasonable and consistent, but that there's just a lot more rendering in the cf-deployment case: 

You can write a quick little bash snippet to crunch out some numbers from an input manifest: 

function analyze_manifest_renders() {
    yq -o=json '.instance_groups' "$1" | jq -r '. as $g | map(" \(.name) \(.instances) instance, \(.jobs | length) jobs = \((.jobs | length) * (.instances)) renders") | join("\n") as $out1 | $g | "  Total VMs: \(map(.instances) | add) Total renders: \(map((.jobs | length) * (.instances)) | add)" as $out2 | "\($out1)\n\n\($out2)"'
}
 
Using this, you can look again at our wavefront vs. cf deployment comparison: 

Wavefront

 analyze_manifest_renders ./support_bundle_20251006173628/deployed_manifest_and_configs/wavefront-nozzle-35d1872438958f70241a/manifest_last_successful_20251006160138.yml
 wavefront_proxy 1 instance, 2 jobs = 2 renders
 pas-exporter-counter 1 instance, 2 jobs = 2 renders
 pas-exporter-gauge 1 instance, 2 jobs = 2 renders
 pas-exporter-container 1 instance, 2 jobs = 2 renders
 pas-exporter-timer 1 instance, 2 jobs = 2 renders
 tas-exporters 1 instance, 11 jobs = 11 renders
 telegraf_agent 1 instance, 2 jobs = 2 renders
 telegraf_standalone 0 instance, 2 jobs = 0 renders
 Total renders: 23
 Prepare Time: 2 seconds
 Per-template-time: 2/23 = 0.08695652173 seconds (86.9 ms)
 
cf deployment

 ❯ analyze_manifest_renders ./support_bundle_20251006173628/deployed_manifest_and_configs/cf-d0c56a4b0add97a015a0/manifest_last_successful_20251001141633.yml
  database 0 instance, 1 jobs = 0 renders
  blobstore 0 instance, 1 jobs = 0 renders
  control 0 instance, 1 jobs = 0 renders
  compute 0 instance, 1 jobs = 0 renders
  nats 2 instance, 8 jobs = 16 renders
  nfs_server 0 instance, 8 jobs = 0 renders
  mysql_proxy 2 instance, 8 jobs = 16 renders
  mysql 3 instance, 12 jobs = 36 renders
  backup_restore 1 instance, 30 jobs = 30 renders
  diego_database 3 instance, 14 jobs = 42 renders
  uaa 8 instance, 9 jobs = 72 renders
  cloud_controller 16 instance, 26 jobs = 416 renders
  cloud_controller_worker 16 instance, 7 jobs = 112 renders
  ha_proxy 0 instance, 1 jobs = 0 renders
  diego_brain 2 instance, 15 jobs = 30 renders
  router 22 instance, 8 jobs = 176 renders
  tcp_router 6 instance, 8 jobs = 48 renders
  mysql_monitor 1 instance, 7 jobs = 7 renders
  diego_cell 470 instance, 25 jobs = 11750 renders
  loggregator_trafficcontroller 16 instance, 10 jobs = 160 renders
  log_cache 32 instance, 11 jobs = 352 renders
  clock_global 2 instance, 32 jobs = 64 renders
  doppler 32 instance, 10 jobs = 320 renders
  credhub 2 instance, 8 jobs = 16 renders
  Total renders: 13663
Prepare Time: 19m 42s (1182 seconds)
Per-template-time: 1182/13663 = 0.08651101515 seconds (86.5 ms)
 
So, as you can see, both deployments have approximately the same template render time – around 1/10th of a second – it's just we are rendering several orders of magnitude more templates in the case of the cf-deployment, largely because of the number of diego cells being deployed.

Resolution

This is normal behaviour