Upgrade from PKS 1.5.1 to 1.6 fails while running flyway migration due to a deadlock
search cancel

Upgrade from PKS 1.5.1 to 1.6 fails while running flyway migration due to a deadlock

book

Article ID: 298729

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

When upgrading the Pivotal Container Service (PKS) tile from PKS 1.5.1 to PKS 1.6, the task fails while running post start scripts as seen below.
Task 735 | 19:11:04 | Preparing deployment: Preparing deployment (00:00:13)
Task 735 | 19:11:17 | Preparing deployment: Rendering templates (00:00:09)
Task 735 | 19:11:27 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 735 | 19:11:27 | Compiling packages: golang-1-linux/a58647561918778c52fb932074e6099497a9063bce3a1531e5c60ba643e1f4df (00:03:11)
Task 735 | 19:14:38 | Compiling packages: bosh-dns/58c7e157ad34bbd80c0d13dc6d1e3e073c8f0daf071bba8b4ca2ce0f10a3dbac (00:00:59)
Task 735 | 19:15:57 | Updating instance pivotal-container-service: pivotal-container-service/f7d3b2ff-8218-4ef0-a592-a12be7362d2d (0) (canary) (00:19:09)
                   L Error: Action Failed get_task: Task e707da8a-e82f-4636-5666-830647b5690a result: 1 of 7 post-start scripts failed. Failed Jobs: pks-api. Successful Jobs: galera-agent, broker, telemetry-server, bosh-dns, send-api-started-event, uaa.
Task 735 | 19:35:06 | Error: Action Failed get_task: Task e707da8a-e82f-4636-5666-830647b5690a result: 1 of 7 post-start scripts failed. Failed Jobs: pks-api. Successful Jobs: galera-agent, broker, telemetry-server, bosh-dns, send-api-started-event, uaa.

Task 735 Started  Thu Oct 31 19:11:04 UTC 2019
Task 735 Finished Thu Oct 31 19:35:06 UTC 2019
Task 735 Duration 00:24:02
Task 735 error
The post start logs fails with the following errors:
Waiting for PKS to start. Connection timeout
Waiting for PKS to start. Connection timeout
Waiting for PKS to start. Connection timeout
Waiting for PKS to start. Connection timeout
The PKS API process fails to start as the migration has failed midway. The below errors can be seen in /var/vcap/sys/log/pks-api/pks-api.log:
2019-10-31 19:25:31.953  INFO 15615 --- [           main] io.pivotal.pks.CleanupMigration1_13      : Try to clean up migration 1.13
2019-10-31 19:25:31.954  INFO 15615 --- [           main] io.pivotal.pks.CleanupMigration1_13      : Migration 1.13 not found
2019-10-31 19:25:31.957  WARN 15615 --- [           main] o.f.c.i.database.mysql.MySQLDatabase     : You are connected to a MySQL 5.7 database using the MariaDB driver. This is known to cause issues. An upgrade to Oracle's MySQL JDBC driver is highly recommended.
2019-10-31 19:25:32.039  INFO 15615 --- [           main] o.f.core.internal.command.DbValidate     : Successfully validated 28 migrations (execution time 00:00.065s)
2019-10-31 19:25:32.048  INFO 15615 --- [           main] o.f.core.internal.command.DbMigrate      : Current version of schema `pks`: 1.22
2019-10-31 19:25:32.053  INFO 15615 --- [           main] o.f.core.internal.command.DbMigrate      : Migrating schema `pks` to version 1.23 - add cluster limit to quota
2019-10-31 19:25:32.398  INFO 15615 --- [           main] o.f.core.internal.command.DbMigrate      : Migrating schema `pks` to version 1.24 - upgrade task initator
2019-10-31 19:25:33.003  INFO 15615 --- [           main] o.f.core.internal.command.DbMigrate      : Migrating schema `pks` to version 1.25 - add cluster last completed action columns
2019-10-31 19:25:34.293  WARN 15615 --- [           main] com.zaxxer.hikari.pool.ProxyConnection   : HikariPool-1 - Connection org.mariadb.jdbc.MariaDbConnection@4e517165 marked as broken because of SQLSTATE(08000), ErrorCode(1220)

java.sql.SQLNonTransientConnectionException: (conn=9) Connection is closed
        at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.get(ExceptionMapper.java:240) ~[mariadb-java-client-2.4.3.jar!/:na]
        at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.getException(ExceptionMapper.java:171) ~[mariadb-java-client-2.4.3.jar!/:na]
        at org.mariadb.jdbc.MariaDbStatement.executeExceptionEpilogue(MariaDbStatement.java:248) ~[mariadb-java-client-2.4.3.jar!/:na]
        at org.mariadb.jdbc.MariaDbStatement.executeInternal(MariaDbStatement.java:338) ~[mariadb-java-client-2.4.3.jar!/:na]
        at org.mariadb.jdbc.MariaDbStatement.execute(MariaDbStatement.java:389) ~[mariadb-java-client-2.4.3.jar!/:na]
        at org.mariadb.jdbc.MariaDbConnection.rollback(MariaDbConnection.java:773) ~[mariadb-java-client-2.4.3.jar!/:na]
        at com.zaxxer.hikari.pool.ProxyConnection.rollback(ProxyConnection.java:370) ~[HikariCP-3.2.0.jar!/:na]
        at com.zaxxer.hikari.pool.HikariProxyConnection.rollback(HikariProxyConnection.java) [HikariCP-3.2.0.jar!/:na
Due to the failure, the transaction never gets rolled back leaving the database in an inconsistent state.
2019-10-31 19:25:34.296 ERROR 15615 --- [           main] o.f.c.internal.jdbc.TransactionTemplate  : Unable to rollback transaction

java.sql.SQLNonTransientConnectionException: (conn=9) Connection is closed
        at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.get(ExceptionMapper.java:240) ~[mariadb-java-client-2.4.3.jar!/:na]
        at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.getException(ExceptionMapper.java:171) ~[mariadb-java-client-2.4.3.jar!/:na]
        at org.mariadb.jdbc.MariaDbStatement.executeExceptionEpilogue(MariaDbStatement.java:248) ~[mariadb-java-client-2.4.3.jar!/:na]
        at org.mariadb.jdbc.MariaDbStatement.executeInternal(MariaDbStatement.java:338) ~[mariadb-java-client-2.4.3.jar!/:na]

and later

2019-10-31 19:25:34.302 ERROR 15615 --- [           main] o.f.c.internal.jdbc.TransactionTemplate  : Unable to restore autocommit to original value for connection

java.sql.SQLException: Connection is closed
        at com.zaxxer.hikari.pool.ProxyConnection$ClosedConnection.lambda$getClosedConnection$0(ProxyConnection.java:489) ~[HikariCP-3.2.0.jar!/:na]
        at com.sun.proxy.$Proxy119.setAutoCommit(Unknown Source) ~[na:na]
        at com.zaxxer.hikari.pool.ProxyConnection.setAutoCommit(ProxyConnection.java:388) ~[HikariCP-3.2.0.jar!/:na]
        at com.zaxxer.hikari.pool.HikariProxyConnection.setAutoCommit(HikariProxyConnection.java) ~[HikariCP-3.2.0.jar!/:na]
This sequence of events put the PKS API and DB in an inconsistent state where the API process is not able to restart and repeatedly fails while applying migration. API startup fails with the message Duplicate column name 'last_completed_action':
2019-10-31 19:26:21.021  INFO 16136 --- [           main] o.f.core.internal.command.DbMigrate      : Current version of schema `pks`: 1.24
2019-10-31 19:26:21.023  INFO 16136 --- [           main] o.f.core.internal.command.DbMigrate      : Migrating schema `pks` to version 1.25 - add cluster last completed action columns
2019-10-31 19:26:21.050 ERROR 16136 --- [           main] o.f.core.internal.command.DbMigrate      : Migration of schema `pks` to version 1.25 - add cluster last completed action columns failed! Please restore backups and roll back database and code!
2019-10-31 19:26:21.097  WARN 16136 --- [           main] ConfigServletWebServerApplicationContext : Exception encountered during context initialization - cancelling refresh attempt: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'flywayInitializer' defined in class path resource [org/springframework/boot/autoconfigure/flyway/FlywayAutoConfiguration$FlywayConfiguration.class]: Invocation of init method failed; nested exception is org.flywaydb.core.internal.command.DbMigrate$FlywayMigrateException:
Migration V1.25__add_cluster_last_completed_action_columns.sql failed
---------------------------------------------------------------------
SQL State  : 42S21
Error Code : 1060
Message    : (conn=19) Duplicate column name 'last_completed_action'
Location   : db/migration/V1.25__add_cluster_last_completed_action_columns.sql (/var/vcap/data/packages/pks-api/175cdd42c4a69a99ca97c4a5921fe33096e008aa/file:/var/vcap/data/packages/pks-api/175cdd42c4a69a99ca97c4a5921fe33096e008aa/pks-api.jar!/BOOT-INF/classes!/db/migration/V1.25__add_cluster_last_completed_action_columns.sql)
Line       : 1
Statement  : ALTER TABLE cluster
  ADD COLUMN last_completed_action varchar(100)
This issue renders the PKS API unhealthy and unusable:
pks login -a pks.corp.local -u pksadmin -p VMware1! --skip-ssl-validation
Error: Unable to send a request to API: Head https://pks.corp.local:9021/actuator/info: dial tcp 10.40.14.4:9021: connect: connection refused


Environment

Product Version: 1.5
OS: Ubuntu

Resolution

To recover from this scenario, the following actions are needed.

Note: Before proceeding make sure the error messages and the migration number it failed on are same as described in the logs above.

1. bosh ssh to the PKS API VM, stop the pks-api process and connect to the MySQL database.
monit stop pks-api

mysql --defaults-file=/var/vcap/jobs/pxc-mysql/config/mylogin.cnf
2. Verify the success column of the flyway_schema_history table for migration version 1.25 is set to 0.
mysql> select * from flyway_schema_history where version='1.25';
+----------------+---------+-------------------------------------------+------+------------------------------------------------------+-------------+--------------+---------------------+----------------+---------+
| installed_rank | version | description                               | type | script                                               | checksum    | installed_by | installed_on        | execution_time | success |
+----------------+---------+-------------------------------------------+------+------------------------------------------------------+-------------+--------------+---------------------+----------------+---------+
|             25 | 1.25    | add cluster last completed action columns | SQL  | V1.25__add_cluster_last_completed_action_columns.sql | -1441600681 | pks          | 2019-10-31 19:26:21 |             27 |       0 |
+----------------+---------+-------------------------------------------+------+------------------------------------------------------+-------------+--------------+---------------------+----------------+---------+
1 row in set (0.00 sec)
3. Update the success column's value from 0 to 1 and verify before committing the transaction.
mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> update flyway_schema_history set success=1 where version='1.25';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select version,description,script,success from flyway_schema_history where version='1.25';
+---------+-------------------------------------------+------------------------------------------------------+---------+
| version | description                               | script                                               | success |
+---------+-------------------------------------------+------------------------------------------------------+---------+
| 1.25    | add cluster last completed action columns | V1.25__add_cluster_last_completed_action_columns.sql |       1 |
+---------+-------------------------------------------+------------------------------------------------------+---------+
1 row in set (0.00 sec)

mysql> commit;
Query OK, 0 rows affected (0.02 sec)
4. Restart all processes, monit restart all, and continue with the upgrade.