Interruptions in Patroni reinit

search cancel

Interruptions in Patroni reinit

book

Article ID: 371463

calendar_today

Updated On: 07-04-2024

Products

VMware Tanzu SQL

Issue/Introduction

When a Patroni cluster reinitialization is made the replica sync process then experiences interruptions and becomes stuck in a loop after partially copying data.

Patroni logs show a lot of error messages as below:

ERROR: failed to update leader lock

INFO: demoted self because failed to update leader lock in DCS

Along with a call trace:

2024-06-15 04:05:03,915 ERROR:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 562, in wrapper
retval = func(self, *args, **kwargs) is not None
File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 692, in _update_leader
return self.retry(self._client.write, self.leader_path, self._name, prevValue=self._name, ttl=self._ttl)
File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 443, in retry
return retry(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/patroni/utils.py", line 334, in __call__
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/etcd/client.py", line 500, in write
response = self.api_execute(path, method, params=params)
File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 271, in api_execute
raise ex
File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 255, in api_execute
response = self._do_http_request(retry, machines_cache, request_executor, method, path, **kwargs)
File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 232, in _do_http_request
raise etcd.EtcdConnectionFailed('No more machines in the cluster')
etcd.EtcdConnectionFailed: No more machines in the cluster

With etcd output messages we see a number of heartbeat connections failed:

etcd: failed to send out heartbeat on time (exceeded the 100ms timeout for 5.061946ms)

Likewise, there are also lots of:

etcd: server is likely overloaded.

Resolution

The etcd server is overloaded which then causes it to demote/promote continually during re-init phase. The behavior looks most likely a typical issue mentioned in the article below, which indicates etcd server works not performantly:

https://www.crunchydata.com/blog/patroni-etcd-in-high-availability-environments

Feedback

Was this article helpful?

thumb_up Yes

thumb_down No