False alerts triggered in TMC SM 1.3 (alert of agent_gateway_liveness_critical and api_gateway_liveness_critical )
search cancel

False alerts triggered in TMC SM 1.3 (alert of agent_gateway_liveness_critical and api_gateway_liveness_critical )

book

Article ID: 375160

calendar_today

Updated On:

Products

VMware Tanzu Mission Control - SM

Issue/Introduction

Customers found that once the TMC SM is deployed, alert of agent_gateway_liveness_critical and api_gateway_liveness_critical are always triggered, which are fake alerts and annoying. Customers want to close such alert.

Environment

TMC SM 1.3 

Cause

alert rule of agent_gateway_liveness_critical is as sum by (namespace) ((avg_over_time(olympus_build_info{service="agent-gateway-service"}[5m])) or up * 0 ) <= 1. In the TMC SM env, prometheus has metric up without namespace as its label, so this exp will hit 0 value, so it triggers alert. 

Resolution

Fix:

The fix will be included in TMC SM 1.4.

 

Workaround:

1. Create ytt overlay secret which is to add annotation for the packageinstall tmc-local-stack    

a. kubectl apply -f stack-overlay.yaml

 

stack-overlay.yaml

apiVersion: v1
kind: Secret
metadata:
  name: stack-overlay
  namespace: tmc-local
stringData:
  tmc-pkgi-overlay.yml: |
    #@ load("@ytt:overlay", "overlay")
    #@overlay/match by=overlay.subset({"apiVersion":"packaging.carvel.dev/v1alpha1", "kind":"PackageInstall", "metadata": {"name": "tmc-local-stack"}}),expects="1+"
    ---
    metadata:
      #@overlay/match missing_ok=True
      annotations:
        #@overlay/match missing_ok=True
        ext.packaging.carvel.dev/ytt-paths-from-secret-name.0: alert-overlay

 

 

2. Create ytt overly secret which is to update prometheus alert rule

a. kubectl apply-f alert-overlay.yaml

 

alert-overlay.yaml
apiVersion: v1
kind: Secret
metadata:
  name: alert-overlay
  namespace: tmc-local
stringData:
  alert-overlay.yml: |
    #@ load("@ytt:overlay", "overlay")
    #@ load("@ytt:yaml", "yaml")
 
    #@ def remove_or_up(expr):
    #@   return expr.replace(' or up * 0', '')
    #@ end
 
    #@overlay/match by=overlay.subset({"kind":"ConfigMap", "metadata": {"name": "prometheus-alerts"}}),expects="1+"
    ---
    data:
      #@overlay/replace via=lambda left, _: remove_or_up(left)
      core-alerts-api-gateway-agent-alerts.yaml:
      #@overlay/replace via=lambda left, _: remove_or_up(left)
      core-alerts-api-gateway-user-alerts.yaml:

 

 

 

3. Patch PackageInstall tanzu-mission-control with the extension annotation:

 

a. kubectl patch pkgi tanzu-mission-control --type='merge' -p '{"metadata": {"annotations": {"ext.packaging.carvel.dev/ytt-paths-from-secret-name.0": "stack-overlay"}}}' -n tmc-local. tanzu package installed kick tanzu-mission-control -y

 

 

4. kick the  packageinstall

a.tanzu package installed kick tanzu-mission-control -y

 

tanzu package installed kick tanzu-mission-control -y
 
Triggering reconciliation for package install 'tanzu-mission-control' in namespace 'tmc-local'
 
7:14:15AM: Pausing reconciliation for package installation 'tanzu-mission-control' in namespace 'tmc-local'
7:14:17AM: Starting reconciliation for package install 'tanzu-mission-control' in namespace 'tmc-local'
7:14:17AM: Waiting for PackageInstall reconciliation for 'tanzu-mission-control'
7:14:18AM: Waiting for generation 6 to be observed
7:14:18AM: Fetch started
7:14:18AM: Fetching
        | apiVersion: vendir.k14s.io/v1alpha1
        | directories:
        | - contents:
        |   - imgpkgBundle:
        |       image: harbor.tanzu.io:8443/tmc/package-repository@sha256:2e89ebe16a771480d3770b402c2a3273a70be44049653007b2381f1ac9b1cd00
        |     path: .
        |   path: "0"
        | kind: LockConfig
        |
7:14:18AM: Fetch succeeded
7:14:19AM: Template succeeded
7:14:19AM: Deploy started (2s ago)
7:14:21AM: Deploying
        | Target cluster 'https://100.64.0.1:443' (nodes: wc-tmc-4f7ss-szvll, 3+)
        | Changes
        | Namespace  Name             Kind            Age  Op  Op st.  Wait to    Rs       Ri
        | tmc-local  tmc-local-stack  PackageInstall  1d   -   -       reconcile  ongoing  Reconciling
        | Op:      0 create, 0 delete, 0 update, 1 noop, 0 exists
        | Wait to: 1 reconcile, 0 delete, 0 noop
        | 7:14:23AM: ---- applying 1 changes [0/1 done] ----
        | 7:14:23AM: noop packageinstall/tmc-local-stack (packaging.carvel.dev/v1alpha1) namespace: tmc-local
        | 7:14:23AM: ---- waiting on 1 changes [0/1 done] ----
        | 7:14:23AM: ongoing: reconcile packageinstall/tmc-local-stack (packaging.carvel.dev/v1alpha1) namespace: tmc-local
        | 7:14:23AM:  ^ Reconciling
        | 7:14:32AM: ok: reconcile packageinstall/tmc-local-stack (packaging.carvel.dev/v1alpha1) namespace: tmc-local
        | 7:14:32AM: ---- applying complete [1/1 done] ----
        | 7:14:32AM: ---- waiting complete [1/1 done] ----
        | Succeeded
7:14:32AM: Deploy succeeded

 

b.tanzu package installed kick tmc-local-stack -y

 

 

tanzu package installed kick tmc-local-stack -y
 
Triggering reconciliation for package install 'tmc-local-stack' in namespace 'tmc-local'
 
7:14:39AM: Pausing reconciliation for package installation 'tmc-local-stack' in namespace 'tmc-local'
7:14:41AM: Starting reconciliation for package install 'tmc-local-stack' in namespace 'tmc-local'
7:14:41AM: Waiting for PackageInstall reconciliation for 'tmc-local-stack'
7:14:41AM: Waiting for generation 4 to be observed
7:14:41AM: Fetch started
7:14:41AM: Fetching
        | apiVersion: vendir.k14s.io/v1alpha1
        | directories:
        | - contents:
        |   - imgpkgBundle:
        |       image: harbor.tanzu.io:8443/tmc/package-repository@sha256:6ed349cc2a7ac8b4d13700146f31fd206f33c932e6eac3f385cbe33f351eb02d
        |     path: .
        |   path: "0"
        | kind: LockConfig
        |
7:14:41AM: Fetch succeeded
7:14:44AM: Template succeeded
7:14:44AM: Deploy started (2s ago)
7:14:46AM: Deploying
        | Target cluster 'https://100.64.0.1:443' (nodes: wc-tmc-4f7ss-szvll, 3+)
        | Changes
        | Namespace  Name  Kind  Age  Op  Op st.  Wait to  Rs  Ri
        | Op:      0 create, 0 delete, 0 update, 0 noop, 0 exists
        | Wait to: 0 reconcile, 0 delete, 0 noop
        | Succeeded
7:14:53AM: Deploy succeeded

5. Delete prometheus-pod to let it load the new prometheus-alert configmap

     a.kubectl delete pod -n tmc-local prometheus-server-tmc-local-monitoring-tmc-local-0