How To Restart SQLFire Cluster When Failed Initial Replay For DDL?

Products

Pivotal GemFire XD

Environment

Additional Information

Applies To:

SQLFire 1.0.x to 1.1.x

Purpose:

This document contains a method for restarting a SQLFire cluster when the SQLFire servers have failed to restart due to failure on initial replay of DDL due to a dropped table.

Symptom:

A SQLFire server fails to restart when initializing replay of DDL with the following like exception:

-------------------
[severe 2014/05/08 08:40:19.707 JST <CacheServerLauncher#serverConnector> tid=0xe] FabricDatabase: failed initial replay for DDL [create table sampleschema.sampletable1
as select * from sampleschema.sampletable1_parent
WITH NO DATA
REPLICATE
PERSISTENT ASYNCHRONOUS
SERVER GROUPS(ServerGroup1)] due to exception with severity=20000
java.sql.SQLSyntaxErrorException(42X05): Table/View ’SAMPLESCHEMA.SAMPLETABLE1_PARENT’ does not exist.

-------------------

Solution:

The configuration variable 'config-scripts' can be used to execute any sql script before the DDL replay, so, using it, we can recreate a table that was dropped by a user.

Note: an important precondition before following these steps is that redundancy is enabled on any partitioned tables, or that all the tables are replicated. Otherwise, there is a possibility of data loss.

Step1
Shutdown sqlf cluster using 'shut-down-all' command.
For example:
>sqlf shut-down-all -locators=samplehost:10334

Step2
Create new sql script with both a drop and create statement for sampletable1 table.
For example:
>cat myscript.sql
DROP TABLE IF EXISTS sampletable1;
CREATE TABLE sampletable1 ( user_id VARCHAR(20) PRIMARY KEY, create_date DATE );

Step3
Start server1 using -config-scripts parameter.
For example:
>sqlf server start -locators=samplehost[10334] -bind-address=samplehost -client-port=1527 -dir=server1 -config-scripts=/samplepath/sqlf-cluster/myscript.sql
Starting SQLFire Server using locators for peer discovery: samplehost[10334]
Starting network server for SQLFire Server at address samplehost/10.10.10.13[1527]
Logs generated in /samplepath/sqlf-cluster/server1/sqlfserver.log
SQLFire Server pid: 3003 status: waiting
Waiting for table sampleschema.sampletable1_parent (DiskId: e5a22163-c1d1-4899-92f7-0ebd15532862, Location: /samplepath/sqlf-cluster/server1/.) on:
[DiskId: 64b5941c-f327-49cf-b015-62cb2c8a1bda, Location: /samplepath/sqlf-cluster/server2/.]

Step4
Revoke other server's disk store with revoke-missing-disk-store command. The disk UUID is the one on which server1 is waiting while restarting.
For example:
>sqlf revoke-missing-disk-store 64b5941c-f327-49cf-b015-62cb2c8a1bda -locators=samplehost:10334
Connecting to distributed system: locators=samplehost[10334]
revocation was successful and no disk stores are now missing

Step5
The server1 should now be in a running state.
For example:
>sqlf server status -dir=server1
SQLFire Server pid: 3003 status: running
Distributed system now has 2 members.
Other members: samplehost(2375:locator):17582

Step6
Start the other server as a new member in a separate, clean directory.
For example:
>mkdir server2-new
>sqlf server start -locators=samplehost[10334] -bind-address=samplehost -client-port=1528 -dir=server2-new -config-scripts=/samplepath/sqlf-cluster/myscript.sql
Starting SQLFire Server using locators for peer discovery: samplehost[10334]
Starting network server for SQLFire Server at address samplehost/10.112.204.13[1528]
Logs generated in /samplepath/sqlf-cluster/server2/sqlfserver.log
SQLFire Server pid: 5564 status: running
Distributed system now has 3 members.
Other members: samplehost(4866:datastore)<v6>:9352, samplehost(4293:locator):39805