Diego Cell showing in red from Ops manager
search cancel

Diego Cell showing in red from Ops manager

book

Article ID: 423916

calendar_today

Updated On:

Products

VMware Tanzu Application Service

Issue/Introduction

  • In TAS/Elastic Application Runtime deployments, one or more Diego Cells is showing in red in the Ops Manager > Status view.



  • The Diego Cell is within normal ranges for CPU, Memory, and Persistent disk consumption.
  • Using the Bosh CLI from Opsman VM, the Diego Cell is running and reports no errors.

Environment

This was observed in TAS versions 6.0.x, however, this condition might present itself in all versions.

Cause

Intermittent service failures on the Diego Cell might lead to the red highlighting of the VM from the Opsman Status page for the TAS tile. Deeper investigation is required to identify the cause.

Resolution

  1. Gather a log bundle from Opsman for review.
  2. Gather logging from the problem Diego Cell node.
  3. From the Opsman log bundle; review the "current" log for Linux Diego Cells, or the "service_wrapper.#.err.log" for Windows Diego Cells.
    • Search for "pid failed" and  "job_state":"failing" to identify any failing jobs on the problem node.
    • Example from a Windows Diego Cell:

      [NATS Handler] 2025/12/17 06:14:35 INFO - Sending hm message 'alert'

      [NATS Handler] 2025/12/17 06:14:35 DEBUG - Message Payload

      ********************

      {"id":"syslog_forwarder_windows","severity":2,"title":"syslog_forwarder_windows (<IP ADDRESS>) - pid failed - Start","summary":"exited with code 1","created_at":1765952075}

      ********************

       

      ********************

      {"state":"failing"}

      ********************

      [attemptRetryStrategy] 2025/12/17 06:14:36 DEBUG - Making attempt #0 for *retrystrategy.retryable

      [agent] 2025/12/17 06:14:36 INFO - Attempting to send Heartbeat

      [NATS Handler] 2025/12/17 06:14:36 INFO - Sending hm message 'heartbeat'

      [NATS Handler] 2025/12/17 06:14:36 DEBUG - Message Payload

      ********************

      {"deployment":"pas-windows-############","job":"windows_diego_cell","index":7,"job_state":"failing","vitals":{"cpu":{"sys":"8.8","user":"4.8","wait":"0.0"},"disk":{"ephemeral":{"inode_percent":"0","percent":"32"},"system":{"inode_percent":"0","percent":"19"}},"load":[""],"mem":{"kb":"10110320","percent":"20"},"swap":{"kb":"0","percent":"0"},"uptime":{"secs":106300}},"node_id":"########-####-####-####-############"}

      ********************

  4. The current or service_wrapper.#.err.log will help identify the individual job that is failing. Deeper investigation into the failing job logs will be required to identify exactly why the service is restarting. In the above example, review of the syslog_forwarder_windows/job-service-wrapper.err.log identified connections being forcibly closed by the remote host:

    2025/12/16 06:11:15 Starting to tail file: c:\var\vcap\sys\log\windows2019fs\pre-start.stdout.log

    2025/12/16 06:11:16 Error connecting on attempt 1: EOF. Will retry in 2 seconds.

     

    2025/12/16 06:13:29 Error connecting on attempt 1: read tcp <LOCAL_IP_ADDRESS>:55791-><REMOTE_SYSLOG_IP_ADDRESS>:443: wsarecv: An existing connection was forcibly closed by the remote host.. Will retry in 2 seconds.

  5. Investigation into the physical network was required to identify the remove disconnects.