Live Patching of user world daemons

search cancel

Live Patching of user world daemons

book

Article ID: 375947

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

With ESX 9.0 release, the operating system supports updating certain user world entities without rebooting the host. This update is termed as Live Update which can be applied to a live system without rebooting it thus making the update process efficient and faster. The user world entities for which, Live patching is currently supported for user world daemons, application binaries and security policy files.

Remediating process has two stages:

Scan : In this stage the host is scanned to ensure that it is Live patchable and has the necessary resources to carry out the remediation.
Apply: In this stage the host is patched with the updated contents from the patch payload. Entities involved in the patching process could be restarted as required.

Remediation of a live patch which updates the above user world entities on a running ESX could fail for multiple reasons. This article outlines failure scenarios and corresponding mitigations for restoring ESX's stability.

Patching User World Daemons:
These are the applications that run in the background and provide some services in the ESX operation system. The patches to these daemons are applied by overlaying the existing executable file, with a newer one. and when such daemon binaries are patched it requires a daemon/service restart. The remediation process updates the daemon application binary and the running daemon is patched with the following steps:

Stop the daemon:
- Stop the currently running instance of the daemon that needs to be patched.
- Failure to stop the daemon will result in patching failure and the remediation process will exit without patching further components that are a part of the patch payload.
- The following log message appears in var/run/log/syslog.log when this error occurs:
  
  syslog.log error message
  
  YYYY-MM-DDTHH:MM:SS Db(15) daemon_apply_published.py[1000081850]: restartDaemon: <DAEMON_NAME> Stop failed
- Following messages appear in the vCenter UI when the remediation process encounters this error:
  
  Stop failure
  
  Live Patch - Daemon patching failed. Failure occurred during the stop call for <DAEMON_NAME>. Manual remediation recommended. Please refer to KB article: KB 375947.

- Recommended action: Please refer to daemon stop failure section in the resolution below.

Start the patched version of the daemon:

Patched version of the daemon is launched using the new init script that maybe updated as a part of remediation process.

The init script checks the health of the newly patched daemon and it will verify the health of the newly patched daemon to ensure that it is started successfully and is able to provide its services. However the remediation process could encounter the following errors at this stage:

Failure to start the patched daemon binary: If the patched daemon fails to start, the patching process will fallback to the recovery state where it will try to restart the unpatched version of the daemon that was present during system boot.

The following message appears in var/run/log/syslog.log when this error occurs:

Failure to start the patched daemon

YYYY-MM-DDTHH:MM:SS Db(15) daemon_apply_published.py[1000081775]: restartDaemon: <DAEMON_NAME> Start failed.

YYYY-MM-DDTHH:MM:SS Db(15) daemon_apply_published.py[1000081775]: rollbackDaemon: Previous version of <DAEMON_NAME> started.

The following message will appear on the vCenter UI when this error is encountered:

Patched daemon start failure

Live Patch - Daemon patching failed. Failure occurred during the start of patched version of <DAEMON_NAME>. Restarted unpatched version. Please refer to KB article: KB 375947."
The remediation process will attempt to restore the system back to the previous stable state by launching the unpatched version of the binary in case of this error. Even if the unpatched version of the binary is launched successfully this is considered as a patch failure.
Recommended action: Please refer to Failure to start the patched daemon binary section in the resolution below.

Failure to launch the unpatched daemon binary:
- If this error occurs, then the system is in a degraded state. Without further manual intervention it maybe not possible to stop or migrate VMs.
- The following message appears in var/run/log/syslog.log when this error occurs:
  
  Rollback failure
  
  YYYY-MM-DDTHH:MM:SS Db(15) daemon_apply_published.py[1000081356]: rollbackDaemon: <DAEMON_NAME> rollback failed.
- The following message will appear on the vCentre UI when this error is encountered:
  
  Failure to start unpatched daemon
  
  Live Patch - Daemon patching failed. Failure occurred during restarting unpatched version of <DAEMON_NAME>. Manual remediation recommended. Please refer to KB article: KB 375947.
- Recommended action: Please refer to Failure to start the unpatched daemon binary section in the resolution below.

Patching Applications:
Besides daemons, any application can be patched. In contrast to daemons, applications are usually short running and are lunched by daemon or from a user. While applying the patch, the functionality of the patched application is verified and a failure is raised if the new version cannot be launched.

If there is a running instance of the patched binary, this instance is not re-started! For instance, if there is an open ssh session, this session must be closed to get the host into a compliant state. Following errors could be encountered during the patching of an application:

Application verification failure:

Occurs when the verification command provided to verify the patched application returns failure.
The following message appears in var/run/log/syslog.log when this error occurs:

Application binary patch failure - syslog

YYYY-MM-DDTHH:MM:SS Db(15) daemon_helper_apply_published.py[1000343537]: verifyDaemonHelper: <DAEMON_HELPER_APPLICATION_NAME> verification failed

The following message will appear on the vCenter UI when this error is encountered:

Helper Daemon verify failure

Live Patch - Daemon helper patching failed. Verification of the helper binary <HELPER_DAEMON_NAME> failed (command '<VERIFY_CMD>'). Manual remediation recommended. Please refer to KB article: KB 375947.

Recommended action: Please refer to Failure to verify patched daemon helper section in the resolution below.

If patching an application that a specific daemon depends on fails, a restart of the related daemon maybe required. If this restart fails:
- The following message appears in var/run/log/syslog.log when this error occurs:
  
  Helper application related Daemon restart failure
  
  YYYY-MM-DDTHH:MM:SS Db(15) daemon_helper_apply_published.py[1000343586]: rollbackDaemon: <DAEMON_NAME> rollback failed.
- The following message will be shown on vCenter UI:
  
  Dependent daemon - unpatched version launch failure
  
  Live Patch - Daemon patching failed. Failure occurred during restarting unpatched version of <DAEMON_NAME>. Manual remediation recommended. Please refer to KB article: KB 375947.

- Recommended action: Please refer to Failure to launch unpatched dependent daemon section in the resolution below.
Host being reported as "Non-compliant" after the patch was successfully applied.
- This indicates that there is a running instance of a non-patched version.
- One such case is an ssh session which was open before the patch was applied.
- Following message will appear on vCenter UI:
  
  Host reporting Non-Compliant
  
  Following daemon helpers are not compliant, as unpatched instance(s) are still running: <APPLICATION_LIST>. Please refer to KB article: KB 375947.

Patching Security Policy files:
Access domains specifies the rules to extend/restrict the access permissions to certain system components/services of the user world. The specifics of these access/restriction is provided as a access domain file which can be live patched. When the access domain files are live patched, it triggers reloading all the system wide access permissions thus apply the newly patched security policies.

When livepatching the VMK access domain fails, the following message will appear on th vCenter UI:

VMK access Live patch failure message

Loading default security policies has failed.

This failure can lead to a deprecated state and impact certain operations (incl. VM stop and migration). The resolution is to restart the host, since the earlier security policies cannot be reloaded(from the unpatched system).

Recommended action: Please refer to Failure to load patched security policies section of the resolution section below.

Other general errors/exceptions encountered during remediation process:

Exceptions encountered during remediation process:
- If there are any exceptions that are encountered during scan/apply stage of remediation process, the following message will appear on the vCenter UI:
  
  Encountering exception during patching process
  
  Live Patch - Daemon patching failed. Failure occurred due to an unexpected exception:<EXCEPTION_TYPE>. Manual remediation recommended. Please refer to KB article: KB 375947.
  
  Recommended action:
  - If the exception was encountered during scan stage, please refer to Failure due to exception during scan section below.
  - If the exception was encountered during apply stage, please refer to Failure due to exception during apply section below.
Cluster upgrade stops.
Compliance check will report the host as Non-compliant.

Environment

VMware vSphere ESX 9.x

Resolution

This section provides details on how to recover the ESX host from incomplete upgrades.

General recommendation for manual intervention:

Before evacuating the VM and reboot the host, stop and restart the daemons manually by logging into the host.
- If the remediation process failed to start the unpatched binaries ( daemon/application) during the rollback stage, please contact the support to obtain the steps to manually restart the unpatched binaries.
Evacuate the VM and reboot the host:
- Manually vMotion/suspend/reboot the VMs.
- Use Maintenance Mode and reboot the host.
Perform compliance scan post reboot to check that the intervention succeeded.
If the ESX host cannot be rebooted, a manual rollback should be attempted.
- Please refer to the following KB article on how to revert to a previous version of ESX: https://knowledge.broadcom.com/external/article/316592/

Handling failures during scanning stage:

If any errors/exceptions are encountered during the scan stage of remediation process, following actions can be performed to recover the Host.
Recommended action: Perform a traditional host update:
- Evacuate the VMs manually using vMotion.
- Apply the patch using the traditional host upgrade process.

Handling failures during patch apply stage:

Following section details the recovery steps for errors/exceptions encountered during the apply stage of remediation process.

Daemon stop failure:
- Manual intervention is needed to restart the problematic daemon.
- First, try to manually stop and start the failing daemon
  - Check if the daemon is running, by executing the following command in the local CLI:
    - /etc/init.d/<DAEMON_NAME> status
    - If the daemon is running, the following result from the above command can be observed:
      - <DAEMON_NAME> is running.
    - If the daemon is not running the following will be the result of the above command:
      - <DAEMON_NAME> is not running
  - If the daemon is not running, use the following command to start it.
    - /etc/init.d/<DAEMON_NAME> start

- If this attempt fails, contact the support to obtain the steps to manually restart the unpatched daemon/s.
- Note that retrying to apply the patch will not succeed since the earlier attempt to apply the patch had resulted in a failure.

Failure to start the patched daemon binary: This indicates that the daemon was not patched (remediation failure) and the remediation process rolled back the host to the earlier state by launching the unpatched version of the daemon successful. In this case the recommended action is:
- Migrate the existing workloads using vMotion or stop all VMs on the host.
- Perform a full system upgrade through the standard upgrade process.
Failure to start the unpatched daemon binary: In this case manual intervention is required to:
- Manual intervention is needed to restart the problematic daemon.
- Check if there is a running instance of the given daemon, stop it as needed and start the patched version manually.
  - Check if the daemon is running, by executing the following command in the local CLI:
    - /etc/init.d/<DAEMON_NAME> status
    - If the daemon is running, the following result from the above command can be observed:
      - <DAEMON_NAME> is running.
    - If the daemon is not running the following will be the result of the above command:
      - <DAEMON_NAME> is not running
  - If the daemon is not running, use the following command to start it.
    - /etc/init.d/<DAEMON_NAME> start
- If this attempt fails, contact support to obtain the steps to manually restart the unpatched daemon.
- Once the daemon is restarted, migrate the existing VMs/workloads using vMotoin.
- Reboot the server - This will ensure that the newly patched binaries are launched.
- Check if the daemon is running (see above steps)
- If the daemon doesn't come up after a full reboot, proceed with a rollback, see General recommendation for manual intervention
Failure due to exception during apply:
- Check the logs for any of the above error cases and handle those failure cases first.
- Then, migrate the existing VMs/workloads using vMotoin.
- Reboot the server - This will ensure that the newly patched binaries are launched.
Failure to verify patched daemon helper:
- Check the logs for any of the above error cases and handle those failure cases first.
- Migrate the existing VMs/workloads using vMotoin.
- Reboot the server - This will ensure that the newly patched binaries are launched.
Failure to launch unpatched dependent daemon:
- Manual intervention is needed to restart the problematic dependent daemon.
  - Refer to the var/run/log/syslog.log to find the dependent daemon that failed to launch.
  - The log message should indicate the dependent daemon that should be manually restarted.
- As a first attempt, you can start the daemon by:
  - Check if the daemon is running, by executing the following command in the local CLI:
    - /etc/init.d/<DAEMON_NAME> status
    - If the daemon is running, the following result from the above command can be observed:
      - <DAEMON_NAME> is running
    - If the daemon is not running the following will be the result of the above command:
    - <DAEMON_NAME> is not running
  - If the daemon is not running, use the following command to start it.
    - /etc/init.d/<DAEMON_NAME> start
- Restarting the daemon without further arguments, may still lead to a system state with reduced functionality.
- If this impacts the capability to evacuate the host and performing a full host restart, please contact support to obtain the steps to manually restart the unpatched dependent daemon.
- Once the daemon is restarted, migrate the existing VMs/workloads using vMotoin.
- Reboot the server
Failure to load patched security policies:
- Migrate the existing VMs/workloads using vMotion.
- Reboot the system.
- After the reboot, the access policies are loaded from the newly patched security policy files.

Feedback

thumb_up Yes

thumb_down No