Isolation segment upgrade failed due to hung NFS connections
search cancel

Isolation segment upgrade failed due to hung NFS connections

book

Article ID: 298174

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Customer performed a successful upgrade of the isolation segment tile from v4.0.14 to 5.0.4 in a test environment, but in production they encountered the following error:
 
Task 150528 | 19:15:17 | Preparing deployment: Preparing deployment (00:00:04)
Task 150528 | 19:15:21 | Preparing deployment: Rendering templates (00:00:16)
Task 150528 | 19:15:37 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 150528 | 19:15:39 | Updating instance isolated_diego_cell: isolated_diego_cell/4409ffa1-af28-4d85-b678-5ae29ef3160a (10) (canary)
Task 150528 | 19:15:41 | L executing pre-stop: isolated_diego_cell/4409ffa1-af28-4d85-b678-5ae29ef3160a (10) (canary)
Task 150528 | 19:15:41 | L executing drain: isolated_diego_cell/4409ffa1-af28-4d85-b678-5ae29ef3160a (10) (canary)
Task 150528 | 19:16:15 | L stopping jobs: isolated_diego_cell/4409ffa1-af28-4d85-b678-5ae29ef3160a (10) (canary)
Task 150528 | 19:16:37 | L executing post-stop: isolated_diego_cell/4409ffa1-af28-4d85-b678-5ae29ef3160a (10) (canary)
Task 150528 | 19:16:41 | L installing packages: isolated_diego_cell/4409ffa1-af28-4d85-b678-5ae29ef3160a (10) (canary)
Task 150528 | 19:16:47 | L configuring jobs: isolated_diego_cell/4409ffa1-af28-4d85-b678-5ae29ef3160a (10) (canary)
Task 150528 | 19:16:47 | L executing pre-start: isolated_diego_cell/4409ffa1-af28-4d85-b678-5ae29ef3160a (10) (canary)
Task 150528 | 19:17:44 | L starting jobs: isolated_diego_cell/4409ffa1-af28-4d85-b678-5ae29ef3160a (10) (canary)
Updating deployment: Expected task '150528' to succeed but state is 'error' Exit code 1
Task 150528 | 19:18:29 | L executing post-start: isolated_diego_cell/4409ffa1-af28-4d85-b678-5ae29ef3160a (10) (canary) (01:03:46) L Error: Action Failed get_task:
Task 5fe2d31d-92af-4488-456f-376537cff8a6 result: 1 of 6 post-start scripts failed. Failed Jobs: rep. Successful Jobs: silk-daemon, route_emitter, vxlan-policy-agent, bosh-dns, garden.
Task 150528 | 20:19:25 | Error: Action Failed get_task:
Task 5fe2d31d-92af-4488-456f-376537cff8a6 result: 1 of 6 post-start scripts failed. Failed Jobs: rep. Successful Jobs: silk-daemon, route_emitter, vxlan-policy-agent, bosh-dns, garden.
Task 150528 Started Tue Jan 30 19:15:17 UTC 2024
Task 150528 Finished Tue Jan 30 20:19:25 UTC 2024
Task 150528 Duration 01:04:08
Task 150528 error ===== 2024-01-30 20:19:02 UTC Finished "/usr/local/bin/bosh --no-color --non-interactive --tty --environment=172.17.0.197 --deployment=p-isolation-segment-a32bf28b6d25f6fc89a3 deploy --no-redact /var/tempest/workspaces/default/deployments/p-isolation-segment-a32bf28b6d25f6fc89a3.yml";
Duration: 3862s; Exit Status: 1
Exited with 1.
Exited with 1.
--- bosh debug task output: {"time":1706645965,"stage":"Updating instance","tags":["isolated_diego_cell"],"total":8,"task":"isolated_diego_cell/4409ffa1-af28-4d85-b678-5ae29ef3160a (10) (canary)","index":1,"state":"failed","progress":100,"data":{"error":"Action Failed get_task:
Task 5fe2d31d-92af-4488-456f-376537cff8a6 result: 1 of 6 post-start scripts failed. Failed Jobs: rep. Successful Jobs: silk-daemon, route_emitter, vxlan-policy-agent, bosh-dns, garden."}} {"time":1706645965,"error":{"code":450001,"message":"Action Failed get_task:
Task 5fe2d31d-92af-4488-456f-376537cff8a6 result: 1 of 6 post-start scripts failed. Failed Jobs: rep. Successful Jobs: silk-daemon, route_emitter, vxlan-policy-agent, bosh-dns, garden."}} ', "result_output" = '', "context_id" = '' WHERE ("id" = 150528) D, [2024-01-30T20:19:25.445832 #1106102] [task:150528] DEBUG -- DirectorJobRunner: (0.001253s) (conn: 42940) COMMIT I, [2024-01-30T20:19:25.445921 #1106102] [] INFO -- DirectorJobRunner: Task took 1 hour 4 minutes 9.771178410000175 seconds to process.
Task 150528 error  


Environment

Product Version: 4.0

Resolution

Based on another similar case, we believe that there were one or more containers running apps with NFS service connections which were interrupted and could not be terminated. The only workaround was to perform a bosh recreate of the affected isolated diego cell.
 
bosh -d <deployment> recreate isolated_diego_cell/GUID --no-converge --fix