When a Patroni cluster reinitialization is made the replica sync process then experiences interruptions and becomes stuck in a loop after partially copying data.
Patroni logs show a lot of error messages as below:
ERROR: failed to update leader lock
INFO: demoted self because failed to update leader lock in DCS
Along with a call trace:
2024-06-15 04:05:03,915 ERROR:Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 562, in wrapper retval = func(self, *args, **kwargs) is not None File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 692, in _update_leader return self.retry(self._client.write, self.leader_path, self._name, prevValue=self._name, ttl=self._ttl) File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 443, in retry return retry(*args, **kwargs) File "/usr/local/lib/python3.6/site-packages/patroni/utils.py", line 334, in __call__ return func(*args, **kwargs) File "/usr/local/lib/python3.6/site-packages/etcd/client.py", line 500, in write response = self.api_execute(path, method, params=params) File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 271, in api_execute raise ex File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 255, in api_execute response = self._do_http_request(retry, machines_cache, request_executor, method, path, **kwargs) File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 232, in _do_http_request raise etcd.EtcdConnectionFailed('No more machines in the cluster')etcd.EtcdConnectionFailed: No more machines in the cluster
With etcd output messages we see a number of heartbeat connections failed:
etcd: failed to send out heartbeat on time (exceeded the 100ms timeout for 5.061946ms)
Likewise, there are also lots of:
etcd: server is likely overloaded.
The etcd server is overloaded which then causes it to demote/promote continually during re-init phase. The behavior looks most likely a typical issue mentioned in the article below, which indicates etcd server works not performantly:
https://www.crunchydata.com/blog/patroni-etcd-in-high-availability-environments