Replication broken in Multi-Site Replication service instance

search cancel

Replication broken in Multi-Site Replication service instance

book

Article ID: 293313

calendar_today

Updated On:

Products

VMware Tanzu SQL

Issue/Introduction

The secondary database in my Multi-Site Replication service instance has fallen behind the primary and no longer syncs. The /var/vcap/sys/log/pxc-mysql/mysql.err.log on the primary node has the following warning:

2021-04-01T19:39:09.784236Z 12 [Warning] Cannot replicate to server with server_uuid='0382797a-931e-11eb-8511-005056b75a4d' because the present server has purged required binary logs. The connecting server needs to replicate the missing transactions from elsewhere, or be replaced by a new server created from a more recent backup. To prevent this error in the future, consider increasing the binary log expiration period on the present server. The missing transactions are '40237815-931b-11eb-bed7-005056b76ba0:23-25'.

The /var/vcap/sys/log/pxc-mysql/mysql.err.log on the secondary node contains a section similar to the following:

2021-04-01T19:39:09.784757Z 1 [ERROR] Error reading packet from server for channel '': The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires. Replicate the missing transactions from elsewhere, or provision a new slave from backup. Consider increasing the master's binary log expiration period. The GTID set sent by the slave is '40237815-931b-11eb-bed7-005056b76ba0:1-22', and the missing transactions are '40237815-931b-11eb-bed7-005056b76ba0:23-25'. (server_errno=1236)
2021-04-01T19:39:09.784781Z 1 [ERROR] Slave I/O for channel '': Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires. Replicate the missing transactions from elsewhere, or provision a new slave from backup. Consider increasing the master's binary log expiration period. The GTID set sent by the slave is '40237815-931b-11eb-bed7-005056b76ba0:1-22', and the missing transactions are '40237815-931b-11eb-bed7-005056b76ba0:23-25'.', Error_code: 1236
2021-04-01T19:39:09.784789Z 1 [Note] Slave I/O thread exiting for channel '', read up to log 'mysql-bin.000004', position 5995

This article covers how to get the primary and secondary nodes back in sync.

Environment

Product Version: 10.0

Resolution

All of the following operations are done on the secondary node to get it back in sync with the primary node.

1. Stop the galera-init monit job, confirm it is stopped, and save the current data directory.

Note: This assumes the persistent disk is large enough for two copies of the database. If it isn't large enough, saving to some external location is recommended. This backup can be deleted following the successful syncing of the secondary.

# monit stop galera-init
# watch monit summary
# ps auxwww | grep mysqld
root     18076  0.0  0.0  12944  1088 pts/0    S+   20:15   0:00 grep --color=auto mysqld
# mv /var/vcap/store/pxc-mysql /var/vcap/store/pxc-mysql-BAK

2. The next step involves creating a new, empty database on the secondary, then populating the database via a jumpstart backup. This is done by running the pxc-mysql pre-start script.

# /var/vcap/jobs/pxc-mysql/bin/pre-start 
2021-04-01T20:24:13.408343266Z ----- waiting for bosh_dns
[wait] 2021-04-01T20:24:13.412031900Z INFO - using nameserver 169.254.0.2:53
[wait] 2021-04-01T20:24:13.412122168Z INFO - resolving upcheck.bosh-dns.
[wait] 2021-04-01T20:24:13.414198951Z DEBUG - [127.0.0.1 ::1]
[wait] 2021-04-01T20:24:13.414230044Z INFO - success
[wait] 2021-04-01T20:24:13.417648462Z INFO - using default resolver
[wait] 2021-04-01T20:24:13.417789122Z INFO - resolving upcheck.bosh-dns.
[wait] 2021-04-01T20:24:13.420345358Z DEBUG - [127.0.0.1 ::1]
[wait] 2021-04-01T20:24:13.420364510Z INFO - success
2021-04-01T20:24:13.427049125Z ----- pre-start setup script: set up ENV and logging
2021-04-01T20:24:13.445009101Z Total memory in bytes: 8364535808
2021-04-01T20:24:13.445047008Z Total disk in kilobytes: 20509308
2021-04-01T20:24:13.465653160Z ----- pre-start setup script: making /var/vcap/store/pxc-mysql and running /var/vcap/packages/pxc/bin/mysqld
2021-04-01T20:25:16.574085765Z ----- pre-start: galera-init started successfully

3. Start galera-init after prestart

monit start galera-init

4. Edit the /var/vcap/jobs/replication-setup/bin/post-start script to include access to monit. by default your process namespace will not have networking access to monit and jumpstart command will return this error when communicating to monit-actual process.

Error performing jumpstart Failed to monit stop galera-init: failed to make stop request for galera-init: Post "http://127.0.0.1:2822/galera-init": dial tcp 127.0.0.1:2822: i/o timeout

Insert two lines into the post-start script to workaround the timeout. See the two lines between "####_INSERT_MONIT_TIMEOUT_WORKAROUND_####" below

#!/bin/bash

set -euxo pipefail


export PATH="/var/vcap/packages/percona-xtrabackup-8.0/bin:${PATH}"

# Prepare permission for xtrabackup prepare
chown vcap:vcap /var/vcap/data/tmp

####_INSERT_MONIT_TIMEOUT_WORKAROUND_####
source /var/vcap/bosh/etc/monit-access-helper.sh 
permit_monit_access        
####_INSERT_MONIT_TIMEOUT_WORKAROUND_####

chpst -u vcap /var/vcap/packages/replication-setup/bin/jumpstart /var/vcap/jobs/replication-setup/config/jumpstart.yml

/var/vcap/packages/replication-setup/bin/replication-setup /var/vcap/jobs/replication-setup/config/replication.yml

5. run post-start

# /var/vcap/jobs/replication-setup/bin/post-start
+ chown vcap:vcap /var/vcap/data/tmp
+ su vcap -c '/var/vcap/packages/replication-setup/bin/jumpstart /var/vcap/jobs/replication-setup/config/jumpstart.yml'
2021/04/01 20:25:36 Starting jumpstart...
2021/04/01 20:25:36 Stopping galera-init...
2021/04/01 20:25:41 Clean up galera data directory...
2021/04/01 20:25:43 Restoring from jumpstart backup...
{"timestamp":"1617308743.437360287","source":"streaming-mysql-backup-client","message":"streaming-mysql-backup-client.Starting to take backup","log_level":1,"data":{"url":"https://3e687873-acda-4aaa-ae96-034b793f89a9.mysql.internal:8081/backup"}}
2021/04/01 20:25:45 Xtrabackup preparing...
2021/04/01 20:26:01 Starting galera-init...
2021/04/01 20:26:13 Galera-init started
2021/04/01 20:26:13 Waiting for db to be ready...
2021/04/01 20:26:43 Applying gtid purged...
2021/04/01 20:26:43 Done
+ /var/vcap/jobs/mysql-agent-user/bin/post-start
+ /var/vcap/packages/replication-setup/bin/replication-setup /var/vcap/jobs/replication-setup/config/replication.yml
+ /var/vcap/packages/pxc/bin/mysql --defaults-file=/var/vcap/jobs/pxc-mysql/config/mylogin.cnf -e 'SET SESSION SQL_LOG_BIN = 0;   Grant replication client on *.* to '\''mysql-metrics'\'';   SET SESSION SQL_LOG_BIN = 1;'

6. Following the successful execution of both scripts, the secondary node should be in sync with the primary node. To connect to the database to verify this, on the secondary run sudo mysql --defaults-file=/var/vcap/jobs/pxc-mysql/config/mylogin.cnf then at the "mysql> " prompt:

mysql> use service_instance_db;
[...]
Database changed
mysql> show tables;

Additional queries can be run on the tables to verify the data.

Additionally, the /var/vcap/sys/log/pxc-mysql/mysql.err.log will show that replication has been started:

2021-04-01T20:26:43.417236Z 8 [Note] @@GLOBAL.GTID_PURGED was changed from '' to '40237815-931b-11eb-bed7-005056b76ba0:1-26'.
2021-04-01T20:26:43.417264Z 8 [Note] @@GLOBAL.GTID_EXECUTED was changed from '' to '40237815-931b-11eb-bed7-005056b76ba0:1-26'.
2021-04-01T20:26:43.490309Z 10 [Note] 'CHANGE MASTER TO FOR CHANNEL '' executed'. Previous state master_host='', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='3e687873-acda-4aaa-ae96-034b793f89a9.mysql.internal', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''.
2021-04-01T20:26:43.546800Z 11 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2021-04-01T20:26:43.555275Z 11 [Note] Slave I/O thread for channel '': connected to master 'f8a8ce35ceeb4e099ce603772a4a131a@3e687873-acda-4aaa-ae96-034b793f89a9.mysql.internal:3306',replication started in log 'FIRST' at position 4
2021-04-01T20:26:43.558603Z 12 [Note] Slave SQL thread for channel '' initialized, starting replication in log 'FIRST' at position 0, relay log './mysql-relay.000001' position: 4

7. Remove the two lines you added in step 4 from the post-start script.

Feedback

thumb_up Yes

thumb_down No