"exception: (vim.fault.GuestOperationsUnavailable)" and "Error executing script in node nfsd-xxxx: process not found" errors when creating Native Kubernetes Clusters with NFS nodes in VMware Cloud Director Container Service Extension 3.X
search cancel

"exception: (vim.fault.GuestOperationsUnavailable)" and "Error executing script in node nfsd-xxxx: process not found" errors when creating Native Kubernetes Clusters with NFS nodes in VMware Cloud Director Container Service Extension 3.X

book

Article ID: 325547

calendar_today

Updated On:

Products

VMware Cloud Director

Issue/Introduction

Symptoms:
  • Creating Native Kubernetes Clusters with an NFS node in VMware Cloud Director Container Service Extension 3.X fails.
  • The native cluster fails to deploy successfully due to errors related to the NFS node.
  • The cse-server-debug.log on the CSE Server shows errors of the form:
| cluster_service_2_x:2877 - _execute_script_in_nodes | DEBUG :: about to execute script on nfsd-xxxx (vm='vim.VirtualMachine:vm-123'), wait=True
| cluster_service_2_x:2823 - _wait_for_guest_execution_callback | DEBUG :: waiting for process 1699 on vm 'vim.VirtualMachine:vm-123' to finish (1)
| cluster_service_2_x:2823 - _wait_for_guest_execution_callback | DEBUG :: exception, will retry in a few seconds, vm 'vim.VirtualMachine:vm-123'
| cluster_service_2_x:2825 - _wait_for_guest_execution_callback | ERROR :: exception: (vim.fault.GuestOperationsUnavailable) {
dynamicType = <unset>,
dynamicProperty = (vmodl.DynamicProperty) [],
msg = 'The guest operations agent could not be contacted.',
faultCause = <unset>,
faultMessage = (vmodl.LocalizableMessage) []
}
| cluster_service_2_x:2906 - _execute_script_in_nodes | ERROR :: Error executing script in node nfsd-xxxx: process not found (pid=1699) (vm='vim.VirtualMachine:vm-123')
Traceback (most recent call last):
File "/root/.local/lib/python3.7/site-packages/container_service_extension/rde/backend/cluster_service_2_x.py", line 2889, in _execute_script_in_nodes
callback=_wait_for_guest_execution_callback)
File "/usr/local/lib/python3.7/site-packages/vsphere_guest_run/vsphere.py", line 216, in execute_script_in_guest
callback=callback)
File "/usr/local/lib/python3.7/site-packages/vsphere_guest_run/vsphere.py", line 123, in execute_program_in_guest
raise e
File "/usr/local/lib/python3.7/site-packages/vsphere_guest_run/vsphere.py", line 89, in execute_program_in_guest
(pid, vm))
Exception: process not found (pid=1699) (vm='vim.VirtualMachine:vm-123')


Environment

VMware Cloud Director 10.x

Cause

This issue can occur if the VMware Guest Tools are not ready in time on the NFS VM when Cloud Director Container Service Extension 3.X attempts to run setup steps.

Resolution

To resolve this issue Cloud Director Container Service Extension 3.X can be reconfigured to wait longer for VMware Guest Tools to be ready on the NFS VM.
Follow the steps in the Workaround section below to apply this change.

Workaround:
  1. Log into the Linux server where CSE 3.X is installed using SSH.
  2. Locate the file cluster_service_2_x.py on the CSE Server.
    This file's location depends on the individual CSE installation and could be found using the following find command for example:

        find / -iname cluster_service_2_x.py
        
    The path returned should be similar to the following example where <python_virutal_env_home> is the python installation location used:

        <python_virutal_env_home>/lib/python3.7/site-packages/container_service_extension/rde/backend/cluster_service_2_x.py
        
    In these steps we will use the following example install path:

        /root/cse-venv/lib/python3.7/site-packages/container_service_extension/rde/backend/cluster_service_2_x.py
     
  3. Back up this existing file before making any changes:

        cp /root/cse-venv/lib/python3.7/site-packages/container_service_extension/rde/backend/cluster_service_2_x.py /root/cluster_service_2_x.py.bak
     
  4. Open the cluster_service_2_x.py to make changes, for example with vi.
    Please note again that the path depends on the location on your CSE server found in step 2:

        vi /root/cse-venv/lib/python3.7/site-packages/container_service_extension/rde/backend/cluster_service_2_x.py
     
  5. Locate the line numbers 2666-2668 with the following code in cluster_service_2_x.py:

       Original Code:
       exec_results = _execute_script_in_nodes(
                            sysadmin_client, vapp=vapp, node_names=[vm_name],
                            script=script)

     
  6. Update the code as follows to include an extra parameter template_os=template.get('os')
    This will bring the NFS node deployment more in line with the Control and Worker nodes:

       exec_results = _execute_script_in_nodes(
                            sysadmin_client, vapp=vapp, node_names=[vm_name],
                            script=script, template_os=template.get('os'))

     
  7. Please save the file after making the changes.
  8. Restart CSE service on the CSE Server, for example if CSE is configured as configured as a service:

        systemctl restart cse
     
  9. Create a new Native Kubernetes Cluster with an NFS node and confirm that the issue is resolved.