Interruptions in Patroni reinit
search cancel

Interruptions in Patroni reinit

book

Article ID: 371463

calendar_today

Updated On: 07-04-2024

Products

VMware Tanzu SQL

Issue/Introduction

When a Patroni cluster reinitialization is made the replica sync process then experiences interruptions and becomes stuck in a loop after partially copying data.

Patroni logs show a lot of error messages as below:

ERROR: failed to update leader lock

INFO: demoted self because failed to update leader lock in DCS

 

Along with a call trace:

2024-06-15 04:05:03,915 ERROR:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 562, in wrapper
    retval = func(self, *args, **kwargs) is not None
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 692, in _update_leader
    return self.retry(self._client.write, self.leader_path, self._name, prevValue=self._name, ttl=self._ttl)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 443, in retry
    return retry(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/patroni/utils.py", line 334, in __call__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/etcd/client.py", line 500, in write
    response = self.api_execute(path, method, params=params)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 271, in api_execute
    raise ex
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 255, in api_execute
    response = self._do_http_request(retry, machines_cache, request_executor, method, path, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 232, in _do_http_request
    raise etcd.EtcdConnectionFailed('No more machines in the cluster')
etcd.EtcdConnectionFailed: No more machines in the cluster

 

With etcd output messages we see a number of heartbeat connections failed:

etcd: failed to send out heartbeat on time (exceeded the 100ms timeout for 5.061946ms)

Likewise, there are also lots of:

etcd:   server is likely overloaded.

Resolution

The etcd server is overloaded which then causes it to demote/promote continually during re-init phase. The behavior looks most likely a typical issue mentioned in the article below, which indicates etcd server works not performantly:

https://www.crunchydata.com/blog/patroni-etcd-in-high-availability-environments