NSX Edge get logical-router command fails with "The dataplane service is in error state, has failed or is disabled"
search cancel

NSX Edge get logical-router command fails with "The dataplane service is in error state, has failed or is disabled"

book

Article ID: 441192

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • NSX Edge node experiences a data-plane service failure where CLI commands such as get logical-router return the following error:

    The dataplane service is in error state, has failed or is disabled


  • However, checking the service status shows the service as running on the faulty Edge node via admin login

    get service dataplane
    Service State: running
  • /var/log/syslog of the Edge node shows the following :
[nsx@6876 comp="nsx-edge" subcomp="edge-appctl" s2comp="unixctl" level="WARN"] failed to connect to /var/run/vmware/edge/dpd.ctl edge-appctl: cannot connect to "/var/run/vmware/edge/dpd.ctl" (Permission denied)

Environment

VMware NSX

Cause

  • This discrepancy occurs because the edge-appctl utility cannot communicate with the dpd (Data Plane Daemon) via its control socket.
  • A corrupted or incomplete /etc/group file on the affected Edge node results in the loss of group configurations.
  • Specifically, the system is missing Group ID (GID) <##> as well as the critical member associations for core management components like nsxa, exporter, and mpa.
  • Without these correct group mappings, the core components lack the necessary privileges required to establish a connection with the dpd.ctl socket.

Resolution

  • Log in to the affected NSX Edge node as root. Enable ssh root access for NSX appliances

  • Compare /etc/passwd and /etc/group on the faulty node with a known healthy Edge node in the same cluster using diff and md5sum to confirm the mismatch.

  • Login to the Faulty Edge node and perform a backup of the existing corrupted file: cp /etc/group /etc/group.bak

  • Copy the /etc/group file from a healthy Edge node to the faulty node to restore missing GIDs and memberships.

  • Verify the permissions and ownership of the newly restored file using the commands below:

    chmod 644 /etc/group

    chown root:root /etc/group

  • Reboot the affected Edge node to re-initialize services with the corrected group permissions: reboot

  • After the reboot, verify the CLI functionality by running: get logical-router