Deploying, scaling or upgrading DB clusters fails with etcd issues
search cancel

Deploying, scaling or upgrading DB clusters fails with etcd issues

book

Article ID: 372980

calendar_today

Updated On:

Products

VMware Data Services Manager

Issue/Introduction

Symptoms:


When deploying, scaling, or upgrading DB clusters (Postgres or MySQL) with DSM, sometimes the new workload cluster nodes fail to be added to the cluster. 

This is manifested with the following status conditions on the DB clusters:

"internal error creating Kubernetes cluster: number of ready replicas differ: expected=3, actual=2: error provisioning Kubernetes cluster"

 

When looking at the VM console you may see an error message like this:

"etcdserver: re-configuration failed due to not enough stated members"

Environment

VCF and Data Services Manager(DSM) 2.1

Cause

This has occurred because the second added member had not fully started by the time the third member was added.

Resolution

To remediate this, we need to take the following steps which will delete the problematic node and auto-generate a new replacement

  • ssh to the provider VM as root and list the addresses of the workload cluster nodes using: 
    • # kubectl get ipaddress -A 
  • ssh to one of the healthy workload cluster nodes using
    • ssh -i /opt/vmware/tdm-provider/provisioner/sshkey capv@<ip-of-node>
  • run sudo su -
  • list the containers running on this node and find the etcd container id:
    • # crictl ps | grep etcd
  • Exec into this container.  Note bash is not available, only sh.
    • crictl exec -it <container-id> sh
  • List the etcd members
    • # etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key member list
    • You will see a "unstarted" member in the status column
  • Remove the unstarted member
    • # etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key member remove <bad-member-id>
  • exit back to DSM provider shell
  • Delete the Bad Machine from the Kubernetes-service
    • # kubectl delete machine -n mysql-default "name of bad member"
  • A new VM will be created and added to the cluster.  To make the DSM provisioner reconcile the DB immediately, make a slight edit to it like changing the description.
    • # kubectl edit mysqlcluster mysql-01

Additional Information

This issue is scheduled to be fixed in DSM version 2.1.1