After executing a bosh deployment, some virtual machines may appear hung. Users will not be able to login via SSH or direct console. This issue is commonly observed during a new deployment of the ha_proxy job from Pivotal RabbitMQ tile. Users may also observe this hung state after a service restart or reboot of a virtual machine which uses the Metron agent.
Common observable symptoms include:
Error 450002: Timed out sending `get_task'
ubuntu@pivotal-ops-manager:~$ bosh ssh rabbitmq-haproxy/0 Setting up ssh artifacts Director task 1193 Error 450001: Action Failed ssh: Creating user: Shelling out to useradd: Running command: 'useradd -m -b /var/vcap/bosh_ssh -s /bin/bash bosh_60tkmg3s4', stdout: '', stderr: 'useradd: cannot lock /etc/passwd; try again later. ': exit status 1
There is a known race condition that will cause rsyslog daemon to hang on start which blocks any application that attempts to write to /dev/log. For more detailed information see the GitHub issue #1188. Versions 8.21-8.22 are affected by this bug. Users can run the following command to see what version of rsyslog is installed
~$ rsyslogd -v rsyslogd 8.22.0, compiled with: PLATFORM: x86_64-pc-linux-gnu PLATFORM (lsb_release -d): FEATURE_REGEXP: Yes GSSAPI Kerberos 5 support: No FEATURE_DEBUG (debug build, slow code): No 32bit Atomic operations supported: Yes 64bit Atomic operations supported: Yes memory allocator: system default Runtime Instrumentation (slow code): No uuid support: Yes Number of Bits in RainerScript integers: 64 See http://www.rsyslog.com for more information.
Workaround
When stuck in the hung state a hard reboot of the virtual machine should restore access
~$bosh vms --details Deployment 'p-rabbitmq-5f35e0b36d7b817ae30d' Director task 1232 Task 1232 done +-----------------------------------------------------------+---------+-----+---------+---------------+-----------------------------------------+--------------------------------------+--------------+--------+ | VM | State | AZ | VM Type | IPs | CID | Agent ID | Resurrection | Ignore | +-----------------------------------------------------------+---------+-----+---------+---------------+-----------------------------------------+--------------------------------------+--------------+--------+ | rabbitmq-haproxy/0 (47828356-55fc-43fd-a98c-645b9e33c66b) | failing | az1 | small | 10.193.90.104 | vm-6ab1c8c9-8f93-4aeb-8837-cfde43a4d63a | 0016905a-04f9-41bc-afc9-9945f595e755 | paused | false | +-----------------------------------------------------------+---------+-----+---------+---------------+-----------------------------------------+--------------------------------------+--------------+--------+
Attempting the deployment again may result in a successful deployment. If that happens, users can apply this directive to the /etc.”./rsyslog.conf and restart rsyslog with command “service rsyslog resta.” Please note any redeployment of the modified VM will result in omission of the param change and could return the hang condition.
global(processInternalMessages="on")
Fix