Duplicate key value violation while applying the compute profile with TKGI clusters
search cancel

Duplicate key value violation while applying the compute profile with TKGI clusters

book

Article ID: 345639

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid

Issue/Introduction

Symptoms:

  • Running the command 'tkgi update-cluster $cluster --compute-profile $profile' failed and the subsequent retries or bosh deploys fail with a database error about duplicate key violation.

  • bosh task shows an error similar to:

    Task 6840762 | 07:48:56 | Error: PG::UniqueViolation: ERROR:  duplicate key value violates unique constraint "link_providers_constraint"
    DETAIL:  Key (deployment_id, instance_group, name, type)=(32, worker-large-dev, docker, job) already exists.

     

Environment

VMware Tanzu Kubernetes Grid Integrated Edition 1.x

Cause

If a failure occurs, for any reason, when you initially apply the compute profile (can be due to unrelated timeouts, or unresponsive agent, or infra problems, etc.) the subsequent deploys would fail with the said database duplicate key violation error until the violation is addressed.

It appears that the initial apply, which failed, updated the Bosh Director database even though the bosh deployment failed.  Then, the retry would fail because it doesn't expect the certain records to be already existing in the database.

Resolution

This is a known issue and currently there is no permanent fix yet.


Workaround:

  1. Find the deployment ID in the logs (it is included in the error message):

    {"time":1632124136,"error":{"code":100,"message":"PG::UniqueViolation: ERROR:  duplicate key value violates unique constraint \"link_providers_constraint\"\nDETAIL:  Key (deployment_id, instance_group, name, type)=(32, worker-large-dev, docker, job) already exists.\n"}}', "result_output" = '', "context_id" = '225d2ac9-9415-4de4-ba7e-99696ce23484' WHERE ("id" = 6840762)


    From the above "Key" values: (deployment_id, instance_group, name, type)=(32, worker-large-dev, docker, job) , the deployment_id is 32.

  2. Next, ssh into Bosh Director VM, and become root.

  3. Start the director console: 

    /var/vcap/jobs/director/bin/console
  4. Execute the following commands in the console, where 32 should be replaced with the actual deployment id found in step 1.  These will delete the link records for the deployment.  These should be fine to delete because the succeeding bosh deployment will add these records.  Make sure to change '32' in the following commands accordingly.

    Bosh::Director::Models::Links::LinkProvider.where(deployment_id: 32).map(&:delete)
    
    Bosh::Director::Models::Links::LinkConsumer.where(deployment_id: 32).map(&:delete)
    
    exit

     

  5. Get a copy of the manifest used from the last failed deploy.

    bosh task {task_id_of_the_failed_deploy} --debug | awk '/Manifest:/{f=1;next} /D,|I,/{f=0} f' | bosh int - > manifest.yml

     

  6. Perform a normal bosh deploy with the retrieved manifest. 

    bosh -d $service-instance deploy manifest.yml --fix

     

  7. Once the 'bosh deploy' command completes successfully, run the original 'tkgi update-cluster --compute-profile $compute-profile-name' command again as this will continue to run the necessary errands on the cluster.