NSX Network Detection and Response - Data Node Elasticsearch service will not start; index and alias names need to be unique

Products

VMware vDefend Network Detection and Response

Issue/Introduction

This article provides the steps to resolve the problem where the elasticsearch service on a Data Node that is part of a cluster will not start because the index and alias names are not unique.

Symptoms:

1. The automatic Lastline Test Appliance component check reports an error condition with the message "Unable to get Elasticsearch cluster status" in the Data Node appliance Monitoring Logs.

For the other scenario of "DanglingIndicesState", the elastisearch status reported yellow and you would see in output of lastline_test_appliance
output: > SOFTWARE:
output: > WARNING: The Elasticsearch cluster status is yellow. All primary shards are active, but not all replica shards are active: performance and reliability may be degraded.
This is observed in the monitoring log as well.

2. The elasticsearch log file /var/log/elasticsearch/lldns/lldns.log contains entries like below; make sure you use the scrolling feature to see the entire entry:

java.lang.IllegalStateException: index and alias names need to be unique, but the following duplicates were found [.kibana (alias of [.kibana_2/fD7MMzp-RhupZGr1nq3P-g])]
    at org.elasticsearch.cluster.metadata.MetaData$Builder.build(MetaData.java:1118) ~[elasticsearch-6.8.9.jar:6.8.9]

or

[WARN ][o.e.g.DanglingIndicesState] [sderums7632-lldns] [[.kibana-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6/rEtNOvZFRIWCgqtxpQPm7Q]] can not be imported as a dangling index, as index with same name already exists in cluster metadata

Note: The Data Node itself may be in OK (green) state in the Appliance Overview page in the Portal.

Cause

When two or more Data Nodes form a cluster, we have seen intermittent cases where the elasticsearch service on one Data Node was unresponsive at that time and the same index and alias names are created where they should be unique.

Resolution

A. Data Node Appliances

Execute the following steps on the Data Node you suspect to be affected by this issue, as the root user (via sudo su).

1. Determine if the system is in this state by searching for the error string in the elasticsearch log file.

Execute the command: grep -i "index and alias names need to be unique" /var/log/elasticsearch/lldns/lldns.log

Sample output:

java.lang.IllegalStateException: index and alias names need to be unique, but the following duplicates were found [.kibana (alias of [.kibana_2/fD7MMzp-RhupZGr1nq3P-g])]

or if it is due to DanglingIndicesState log entries with 'index with same name already exists', you would see them in /var/log/elasticsearch/lldns/lldns.log
Sample log:
[WARN ][o.e.g.DanglingIndicesState] [sderums7632-lldns] [[.kibana-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6/rEtNOvZFRIWCgqtxpQPm7Q]] can not be imported as a dangling index, as index with same name already exists in cluster metadata,

If no output is produced, the node is not effected by this issue and you should stop here.

2. In the output from the step above, take a note of the string following the "/" character
(in our case, fD7MMzp-RhupZGr1nq3P-g): it identifies a directory in the elasticsearch storage.

3. Move the directory identified in step 1 from the elasticsearch storage to the /tmp directory:

Execute the command: mv /data/elasticsearch/lldns/nodes/0/indices/fD7MMzp-RhupZGr1nq3P-g /tmp

Note: The actual directory name will be different from case to case; make sure to copy and paste the actual value from the output from step 2.

4. Restart elasticsearch

Execute command: service-lastline elasticsearch restart

5. Identify the kibana index/indices

Execute command: curl -s localhost:9200/_cat/indices | grep .kibana

Sample output: green open .kibana_1 035CoB-RTuyh9E82eqtLrw 1 0 130 0 422.5kb 422.5kb

In this case, the index name is

.kibana_1

Sample output for

DanglingIndicesState / duplicate indices
green open .kibana-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6

6. Delete the kibana indices identified in step 5

Execute command: curl -s -XDELETE localhost:9200/.kibana_1

The actual index name may be different from case to case: make sure to copy and paste the actual value from the output of step 5 (.kibana_1 in this case).

Execute command for DanglingIndicesState / duplicate indices sceneario sample : curl -s -XDELETE localhost:9200/.kibana-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6-reindexed-v6

7. Reconfigure the Data Node/Hunter from the Manager UI
1. Login to Manager UI
2. Click "Admin" tab at the top
3. Click "Appliances" on the left
4. Locate the Data Node appliance UUID and name where the above steps were performed.
Click the "Quick Links" (under Action column) and select "Re-Trigger Configuration"

The data node status will change to "In progress", wait for this to finish and
the appliance status to change back to "OK" (may take 5-15 minutes or more)

8. Check that the cluster is now correctly formed:

Execute command on the data node: curl -s localhost:9200/_cat/nodes

Sample output:

10.31.44.01 15 86 58 2.57 2.60 2.43 mdi * datanode01
10.31.44.02 15 86 58 2.57 2.60 2.43 mdi * datanode02

9. Now the elasticsearch status should be in green and lastline_test_appliance on the datanodes should report all 'OK' statuses.

B. Manager Appliance

Execute the following steps on the Manager that controls the affected Data Nodes, as the root user (via sudo su).

If you have more than one Manager appliance in your account, you need to identify which Manager the Data Nodes belong to. You can copy the license for the Data Node and paste it into the Quick Search box above the Appliances page in your Portal. The Manager and the two Data Nodes that are associated with that Manager will appear in the list (assuming you have the minimum two Data Node cluster configured). This is the Manager you should execute the following steps on.

1. Restart Kibana

Execute command: service kibana restart

2. Validate that the Kibana interface loads correctly.

Visit the Investigation » Network Explorer page on the Manager.

You may be presented with a page that says "In order to visualize and explore data in Kibana, you'll need to create an index pattern to retrieve data from Elasticsearch". In this case, click on the "all" link and then on the star icon to make the "all" index pattern the default one:

Workaround: None

Additional Information

Note: This article is applicable to the standalone NSX Network Detection and Response product (formerly Lastline) and is not intended to be applied to the NSX NDR feature of NSX-T.

Impact/Risks:

Without the Elasticsearch service running, data will not be indexed by this Data Node.