Initiating Aria Automation upgrade / patch from Aria Suite Lifecycle fails with error "LCMVRAVACONFIG90030" at stage - upgrading Aria Automation.
Validating the status of the upgrade using command "vracli upgrade status --details" show:
Duration: 1 minutes
Result: Preparation Error
Description: Preparation for upgrade has discovered problems. Review to error report below to correct the problems and try again. The services remained in working order.
/var/log/vmware/prelude/upgrade-noop.log show errors similar to the below:
[ERROR][<timestamp>][<node>][Exit Code: 255] Attempt failed to run command: /opt/scripts/upgrade/ssh-noop.sh.
Pseudo-terminal will not be allocated because stdin is not a terminal.
Welcome to VMware Aria Automation Appliance 8.18.1
root@1083: Permission denied (publickey,password).
[ERROR][<timestamp>][<node>] Remote command failed: /opt/scripts/upgrade/ssh-noop.sh at host: <node>
[ERROR][<timestamp>][<node>] Remote command failed: /opt/scripts/upgrade/ssh-noop.sh at one or more nodes
The upgrade fails, despite having attempted the resolution steps stated in KB-312221,
Environment
Aria Automation 8.x
Cause
This issue may be observed in either of the below scenarios:
The SSH configurations having incorrect permissions:
Expected permission is '700'.
The SSH configurations mentioned in /etc/ssh/sshd_config_effective and/or /etc/ssh/sshd_config_desired are corrupted.
Files may be empty.
The SSH configurations contain keys following the order from an incorrect version (Example v1 keys on a version expecting to have v2 keys).
Steps to identify version mismatch:
The /etc/ssh/sshd_config_effective and/ or the /etc/ssh/sshd_config_desired files contain ssh keys of the order of v1 (version followed by earlier releases of Aria Automation), where as it is expected to be of order v2 (version followed by 8.16.x and later releases of Aria Automation)
Note: The version of ssh keys available on each of the nodes can be viewed under /etc/ssh/keys/v1 and /etc/ssh/keys/v2 and ca be reviewed using the command: vracli cluster exec -- bash -c "current_node; ls -laR /etc/ssh/"
This may be caused by an incomplete upgrade attempt in the past, leading to failure in updating the SSHD configurations.
Resolution
To resolve this issue, recreate the expected sshd configurations and re-run the upgrade:
On each of the Aria Automation node perform the below steps to create the ideal configuration and validate which sshd configuration file would need to be replaced.
Identify deviation in configuration:
Connect to the node using SSH with the root user credentials.
Use the below command to generate a temporary file - 'sshd_ideal' with the ideal configuration expected to be held on this version of Aria Automation (8.17 and later should hold v2 version of ssh algorithms).
Compare the sshd_effective and sshd_desired files with the ideal configuration to identify the modification: diff /tmp/sshd_ideal /etc/ssh/sshd_config_effective diff /tmp/sshd_ideal /etc/ssh/sshd_config_desired
Alternatively, we could compare the checksums of these configurations: Note: This can be run from only one node after the ideal configurations are created on all the nodes
Generate md5 checksum for the ideal sshd configuration: md5sum /tmp/sshd_ideal
Generate md5 checksum for the effective sshd configuration: vracli cluster exec -- bash -c "current_node; md5sum /etc/ssh/sshd_config_effective"
Generate md5 checksum for the desired sshd configuration: vracli cluster exec -- bash -c "current_node; md5sum /etc/ssh/sshd_config_desired"
Remediate the observed deviation:
If deviation is noticed for either of the configurations from the ideal config, they would need to be replaced with the ideal configuration:
Backup existing effective and/or desired sshd config (invalid - still using V1 keys or corrupted)
Verify /home/root and /home/root/.ssh have the expected permissions (0700 / drwx------) vracli cluster exec -- bash -c 'current_node; ls -la /home | grep -E "root root.*root"; ls -la /home/root | grep -E "\.ssh"'
Remediation, if the output does not match expected result: vracli cluster exec -- bash -c 'current_node; chmod 700 /home/root; chmod 700 /home/root/.ssh'
Exit the SSH session
Validate the configuration corrections performed:
Enter new SSH session
Verify the correct V2 version of the keys is in use: sshd -T
NOTE: The Below steps is purely for upgrade prechecks, and can be skipped if not intending to upgrade:
Prepare upgrade runtime dir /opt/scripts/upgrade/kube-check-health.sh "nodes, pods" "prep"
Prepare SSH channel /opt/scripts/upgrade/ssh-config-nodes.sh && echo "Completed successfully" || echo "Failed" #### Expected result - last line of output is: Verification that nodes are able to connect to one another and to this node succeeded.