Intelligence Visualization Flow Clustering Job Fails, Job Status is Error or ContainerStatusUnknown

Products

SSP Advanced Threat Protection Platform

Issue/Introduction

To cluster the Visualization Canvas by flows, we create a graph of network activity in the datacenter before running an ML algorithm on the graph. In high scale environments we limit the number of edges in the graph such that it does not grow to the point that the system cannot handle. In some edge cases, such as environment where computes are members of many groups, the number of nodes in the graph can grow very large without the edge limit activating. This causes the graph size to grow in memory to the point of causing an OOM in the pod or for the pod to run out of ephemeral storage.

Environment

SSP 5.0

Cause

When we create a graph of the network communication in a customer environment, we often have many more edges than we do nodes in the graph. This is due to the fact that each compute is often talking to many more than one other compute over the course of a thirty day period. However, in some edge cases, the number of nodes in the graph can grow without the number of edges increasing. In these cases, the size of the graph we build can grow without tripping certain guardrails we have in place to limit the number of edges in the graph.

To check the status of the pod:

Login to SSPI via cli using root credentials and get the pod name which stuck in ContainerStatusUnknown using below command :

k -n nsxi-platform get pods -o wide | grep feature-service-flow-feature-creator

feature-service-flow-feature-creator-xxxxxxx

and check for the pod events using below command:

k -n nsxi-platform describe pod feature-service-flow-feature-creator-xxxxxxx

The events of the pod might look something like this in the ephemeral storage case:

Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Normal Pulled 14m kubelet Container image "sspi101.ans.local/clustering/third-party/wait-for@sha256:feacef52ef1a9b1654d680c53af00b8461be50b1db29c3e4115b439a0ec03008" already present on machine
  Normal Created 14m kubelet Created container wait-for-postgresql-ha-pgpool
  Normal Started 14m kubelet Started container wait-for-postgresql-ha-pgpool
  Normal Scheduled 14m default-scheduler Successfully assigned nsxi-platform/feature-service-flow-feature-creator-29052280-n2tzs to longevity-test-md-0-v5cdq-649zv
  Normal Started 14m kubelet Started container wait-for-postgresql-ha-postgresql
  Normal Created 14m kubelet Created container wait-for-postgresql-ha-postgresql
  Normal Pulled 14m kubelet Container image "sspi101.ans.local/clustering/third-party/wait-for@sha256:feacef52ef1a9b1654d680c53af00b8461be50b1db29c3e4115b439a0ec03008" already present on machine
  Normal Pulled 14m kubelet Container image "sspi101.ans.local/clustering/third-party/wait-for@sha256:feacef52ef1a9b1654d680c53af00b8461be50b1db29c3e4115b439a0ec03008" already present on machine
  Normal Created 14m kubelet Created container wait-for-feature-service-s3-provisioning
  Normal Started 14m kubelet Started container wait-for-feature-service-s3-provisioning
  Normal Started 14m kubelet Started container feature-service-data-service
  Normal Created 14m kubelet Created container feature-service-data-service
  Normal Pulled 14m kubelet Container image "sspi101.ans.local/clustering/feature-service@sha256:4eb6535d28d22f2019989148ccaaed8429e91dc0a6b0a4259d0acab8c0066aea" already present on machine
  Normal Pulled 12m kubelet Container image "sspi101.ans.local/clustering/feature-service@sha256:4eb6535d28d22f2019989148ccaaed8429e91dc0a6b0a4259d0acab8c0066aea" already present on machine
  Normal Created 12m kubelet Created container feature-service
  Normal Started 12m kubelet Started container feature-service
  Normal Pulled 11m kubelet Container image "sspi101.ans.local/clustering/visualization@sha256:65df6746a941093c84a552dcec7e62834764a154f8262f2c331222ea7e1ef3fa" already present on machine
  Normal Created 11m kubelet Created container feature-service-clustering-service
  Normal Started 11m kubelet Started container feature-service-clustering-service
  Warning Evicted 4m23s kubelet Pod ephemeral local storage usage exceeds the total limit of containers 1Gi.
  Normal Killing 4m23s kubelet Stopping container feature-service-clustering-service

Resolution

The issue can be addressed by dropping the edge guardrail in the configmap to a smaller number. Once the guardrail is applied, the existing flow feature objects need to be cleaned from minio and regenerated by flow feature service.

Steps to perform resolution:

ssh into the sspi VM to gain access to kubectl
k -n nsxi-platform edit cm feature-service-flow-feature-creator-feature-service-config-map
add the configuration to the configmap, under the key feature add flowFeatureUniqueEdgeLimit: 500000, save and exit
get the various minio pods with k -n nsxi-platform get pods | grep minio and note down the pods with name minio-<number>
for each minio pod listed above, exec into the pod with k -n nsxi-platform exec -it <pod name> -- bash and run the command rm -rf /data/minio/feature-service/FLOW*
clean up failing jobs
1. get failing jobs with k -n nsxi-platform get jobs | grep flow-feature
2. delete any job which is listed as 0/1 complete with: k -n nsxi-platform delete job <job name>
wait for the job to regenerate the objects and perform clustering
1. if you want to let the system recover naturally, you can simply wait 1-2 hours for the job to rerun on its own
2. you have the option to manually trigger a one time job if you do not want to wait for the cron schedule: k -n nsxi-platform create job manual-flow-feature-creator --from cronjob/feature-service-flow-feature-creator