Salt master overloaded after upgrade to 3005 branch
search cancel

Salt master overloaded after upgrade to 3005 branch

book

Article ID: 368788

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

A number of performance issues were detected in these versions causing Salt be unstable in certain situations. Some symptoms experienced included

  • More connections than expected from the Salt minion to the Salt master
  • High memory usage on the Salt master
  • Timeouts when attempting to publish commands from the Salt master CLI
  • Monitoring network traffic with TCPDUMP may show that some minions are reconnecting over and over to the Salt master
  • Running a "test.ping" function results in a lot of [Not Connected] results from minions

Environment

Salt versions greater than 3005.1 but less than 3006.8

Cause

A myriad of performance issues including Pillar related issues cause extra load on the Salt master leading to timeouts when processing commands, eventually 

Resolution

Customers should upgrade their Salt masters first and then their minions as soon as possible to a more recent and supported version of Salt if they experience any of these symptoms. See the SaltProject documentation for instructions on how to upgrade your Salt deployment, https://docs.saltproject.io/salt/install-guide/en/latest/topics/upgrade.html

Barring being able to upgrade immediately, some steps that may help to mitigate issues are as follows

  • Confirm that your Salt master OS has been tuned to handle the scale you are trying to achieve
    • See SaltProject documentation, https://docs.saltproject.io/en/latest/ref/configuration/master.html#master-large-scale-tuning-settings
  • Confirm that you have properly configured your minions for scale
  • Check disk space on your Salt masters
  • Try restarting the Salt master
    • It may take time for minions to reconnect to the Salt master post-restart, so wait 10 minutes before re-testing
  • Try restarting all of the Salt minions
    • A command like salt \* cmd.run 'sleep 30 && salt-call --local service.restart salt-minion && sleep 30' bg=True usually works, but test in a dev environment first as this command may leave minions disconnected from the Salt master. Depending on your environment, you may need to include the full path to the salt-call command, or on Windows the full path to the salt-call.bat command. 
    • Wait 10 to 20 minutes for minions to reconnect
    • Another possible method to restart minions is to place a script on the minion to restart the minion daemon, and then use cron or the Windows task scheduler to run the script in the near future. 
  • If you have recently updated from a version of Salt less than 3006 to a 3006+ version, you may run into an issue where the default user was changed from "root", to "salt". You can try manually setting the user in your configuration to "root" in order to overcome any possible permissions issues that may be causing some of your symptoms.
  • On the Salt master, try running salt-run state.event to see what events your Salt master is trying to process and help get a better idea of your Salt master workload.

Again, the above actions may only help mitigate the symptoms until you are able to upgrade. And these steps may need to be repeated if the symptoms occur again.