Import of vCenter with more than 400 hosts into SDDC Manager fails in vcf_brownfield.py script with an inventory import error or timeout

Products

VMware SDDC Manager

Issue/Introduction

When performing VCF import of a vSphere environment into SDDC Manager, if the environment contains more than 400 hosts, then the operation fails due to SDDC Manager NGINX capacity limitations. The failure stack trace from the import scripts output/vcf_brownfield.log file:

[2024-09-16 16:48:48,949] [ERROR] request_helper:28: Result status code from [91minventory import: 413[00m
[2024-09-16 16:48:48,949] [CRITICAL] import_domain:203: Could not import the information for the new domain in SDDC Manager: Could not import inventory into SDDC Manager. Please review '/var/log/vmware/vcf/commonsvcs/vcf-commonsvcs.log' for further details.
[2024-09-16 16:48:48,949] [INFO] delete_domain:59: Locating domain information to roll back the inventory changes for domain with ID: {domain-id}
[2024-09-16 16:48:48,949] [INFO] sddc_manager_helper:540: Retrieving domain inventory for domain with id {domain-id}
[2024-09-16 16:48:48,949] [INFO] sddc_manager_helper:403: Using cached SDDC Manager token header
[2024-09-16 16:48:48,963] [ERROR] request_helper:28: Result status code from [91mGet domain inventory for domain with id {domain-id}: 500[00m
[2024-09-16 16:48:48,963] [WARNING] delete_domain:64: Could not find inventory information to clean up domain with ID: {domain-id}. Skipping roll back.
[2024-09-16 16:48:48,963] [INFO] vcf_brownfield:463: [91mCould not complete import domain operation for vCenter: bion-vi-vc.example.com[00m
[2024-09-16 16:48:48,964] [INFO] vcf_brownfield:353: Operation [93mimport[00m completed on target: [bion-vi-vc.example.com[00m with status: [93mFAIL[00m in [93m2238.49[00ms

Additionally, if VCF inventory import stage passes, then a subsequent step seeding the vLCM cluster images from each cluster into SDDC Manager LCM will time out with the following error:

[2024-09-18 06:31:38,747] [DEBUG] utils:121: Retrying poll_sddc_manager_config_task_status in 20 seconds because of: <class 'Exception'>
[2024-09-18 06:31:58,768] [INFO] sddc_manager_helper:662: Polling status of task with URL: /v1/tasks/{task_id}
[2024-09-18 06:31:58,768] [INFO] sddc_manager_helper:403: Using cached SDDC Manager token header
[2024-09-18 06:31:58,792] [INFO] request_helper:22: Response status from polling SDDC Manager brownfield initialization workflow progress: 200
[2024-09-18 06:31:58,792] [DEBUG] sddc_manager_helper:674: Task status is in progress
[2024-09-18 06:31:58,792] [INFO] sddc_manager_helper:685: Task is still in progress, polling...
[2024-09-18 06:31:58,793] [INFO] utils:75: Phase '5. Brownfield initialization of SDDC Manager' completed with 1 warnings:
[2024-09-18 06:31:58,793] [INFO] utils:78: Could not monitor progress of task: Failed to monitor task due to: Task is still in progress, polling.... Please review '/var/log/vmware/vcf/domainmanager/domainmanager.log' for further details.
[2024-09-18 06:31:58,793] [INFO] vcf_brownfield:482: Could not complete import domain operation for vCenter: bion-vi-vc.example.com
[2024-09-18 06:31:58,793] [INFO] vcf_brownfield:353: Operation import completed on target: bion-vi-vc.w2-qes-001.rainpole.local with status: FAIL in 5898.93s

Environment

VCF 5.2.1

Cause

During VCF import operation, the vcf_brownfield.py python script attempts to import inventory payload that exceeds the default limits of NGINX, and blocks the API call to populate SDDC Manager inventory. If the API call to populate inventory passes, then the subsequent step of initial domain configuration that seeds vLCM cluster images into SDDC Manager LCM service can time out if there are many vLCM clusters in the vCenter (more than 20).

Resolution

Increase the NGINX limits for inventory API operations by editing /etc/nginx/nginx.conf.

SSH to SDDC Manager as vcf user.
Escalate to root user with `su`
Run the command below to edit the nginx.conf file:
vi /etc/nginx/nginx.conf
Add the following line "client_max_body_size 1g;" in the following block of lines:
location ~ ^/v1/(tasks|network-pools|pscs|vxrail-managers|vcenters|sddc-managers|nsx-managers|vcf-services|users|roles|tokens|sso-domains|sddc-manager|securitySettings|identity-providers|resource-warnings|resource-functionalities|current-user|compliance|resource-locks|inventory|css)(.*) {
# auth_basic "closed site";
# auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://127.0.0.1:7100/v1/$1$2$is_args$args;
}

So that it becomes:

location ~ ^/v1/(tasks|network-pools|pscs|vxrail-managers|vcenters|sddc-managers|nsx-managers|vcf-services|users|roles|tokens|sso-domains|sddc-manager|securitySettings|identity-providers|resource-warnings|resource-functionalities|current-user|compliance|resource-locks|inventory|css)(.*) {
# auth_basic "closed site";
# auth_basic_user_file /etc/nginx/.htpasswd;
client_max_body_size 1g;
proxy_pass http://127.0.0.1:7100/v1/$1$2$is_args$args;
}
Restart NGINX service through:
systemctl restart nginx
Edit the following python file, `api/sddc_manager/sddc_manager_helper.py` the location will be relative to where the VCF import toolset is stored.
Change the numbers in the @Retry line of this block from:
@Utils.retry(attempts=180, delay=20, should_log_attempts=True)
@RequestHelper.ignore_ssl_warnings
def poll_sddc_manager_config_task_status(self, task_url, warnings):

To:

@Utils.retry(attempts=1800, delay=20, should_log_attempts=True)
@RequestHelper.ignore_ssl_warnings
def poll_sddc_manager_config_task_status(self, task_url, warnings):
Reattempt the operation that has failed (convert/import) with the same CLI arguments.