When a Patroni cluster reinitialization is made the replica sync process then experiences interruptions and becomes stuck in a loop after partially copying data.
Patroni logs show a lot of error messages as below:
ERROR: failed to update leader lock
INFO: demoted self because failed to update leader lock in DCS
Along with a call trace:
2024-06-15 04:05:03,915 ERROR:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 562, in wrapper
retval = func(self, *args, **kwargs) is not None
File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 692, in _update_leader
return self.retry(self._client.write, self.leader_path, self._name, prevValue=self._name, ttl=self._ttl)
File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 443, in retry
return retry(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/patroni/utils.py", line 334, in __call__
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/etcd/client.py", line 500, in write
response = self.api_execute(path, method, params=params)
File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 271, in api_execute
raise ex
File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 255, in api_execute
response = self._do_http_request(retry, machines_cache, request_executor, method, path, **kwargs)
File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 232, in _do_http_request
raise etcd.EtcdConnectionFailed('No more machines in the cluster')
etcd.EtcdConnectionFailed: No more machines in the cluster
With etcd output messages we see a number of heartbeat connections failed:
etcd: failed to send out heartbeat on time (exceeded the 100ms timeout for 5.061946ms)
Likewise, there are also lots of:
etcd: server is likely overloaded.
The etcd server is overloaded which then causes it to demote/promote continually during re-init phase. The behavior looks most likely a typical issue mentioned in the article below, which indicates etcd server works not performantly:
https://www.crunchydata.com/blog/patroni-etcd-in-high-availability-environments