NSX Network Detection and Response - Troubleshooting Kibana On-Premises

Products

VMware

Issue/Introduction

Symptoms:

Status: Red

When Kibana cannot access Elasticsearch, it manifests the issue with an error page similar to the one below:

The syslog may contain entry lines like the following:

2019-10-14 14:35:21,644: output: Oct 14 13:16:10 lastline-manager kibana[31380]: {"type":"log","@timestamp":"2019-10-14T13:16:10Z","tags":["status","plugin:[email protected]","error"],"pid":31380,"state":"red
","message":"Status changed from red to red - Request Timeout after 60000ms","prevState":"red","prevMsg": "[es_rejected_execution_exception] rejected execution of org.elasticsearch.transport.TransportService$7@7
bc9508a on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@41f721f7[Running, pool size = 37, active threads = 37, queued tasks = 1000, completed tasks = 5463227]]"}
2019-10-14 14:35:21,644: output: Oct 14 13:16:43 lastline-manager kibana[31380]: {"type":"log","@timestamp":"2019-10-14T13:16:43Z","tags":["error","elasticsearch","admin"],"pid":31380,"message":"Request error, retrying\nPOST http://canonical.lastline-datanode.lastline.local:9200/.kibana/config/_search => socket hang up"}

In this case, the error message indicates that the query queue is full and cannot accept additional requests (queue capacity = 1000 ... queued tasks = 1000). Other error messages may indicate that Kibana cannot contact Elasticsearch on the Data Node.

Cause

Possible causes

So far, we have encountered two scenarios in which this can occur:

The data retention of Elasticsearch data (30 days) was not running correctly. As a consequence, the cluster was storing more data than it was supposed to index and as a result was completely overloaded
The Data Node is not reachable from the Manager. This may have many causes, including:
- There is a misconfiguration that prevents the Manager from communicating with the data node on port 9200 (see the Lastline Manager Installation and Administration guide for full details).
- There is an IP address space conflict problem.

Resolution

Troubleshooting

Here are some troubleshooting steps that may provide information useful to identify the actual problem.

Checking basic connectivity to the cluster from the Manager

On the Manager, run:

$ curl -s http://canonical.lastline-datanode.lastline.local:9200/

In normal conditions, this API call will return basic information about the cluster.

The call may fail completely if:

The canonical.lastline-datanode.lastline.local hostname does not resolve (NXDOMAIN).
This is a special hostname we add to /etc/hosts to identify one of the Data Nodes that appear to be online. It's set by the hunterkeeper service, which assesses the cluster status (and potentially updates this value) every 10 minutes or so.
If the hostname cannot be resolved, it may be that the hunterkeeper service has not run yet (give it 10 minutes) or it may mean that there's no Data Node online
The connection timeouts or is refused.
Check that the Data Node accepts connections from the Manager on port 9200 (via ufw status).
The ufw rules are updated by the hunterkeeper service (as above).
The API call returns an error message.
This indicates that connectivity between the Manager and the Data Node works, but there's a problem with the Elasticsearch cluster (for example, it's overloaded).

Checking the cluster status

A general good first step is to check the status of the Elasticsearch cluster. Run the following on the Manager:

$ curl -s http://canonical.lastline-datanode.lastline.local:9200/_cluster/health{
"active_primary_shards": 874,
"active_shards": 874,
"active_shards_percent_as_number": 100.0,
"cluster_name": "lldns",
"delayed_unassigned_shards": 0,
"initializing_shards": 0,
"number_of_data_nodes": 1,
"number_of_in_flight_fetch": 0,
"number_of_nodes": 1,
"number_of_pending_tasks": 0,
"relocating_shards": 0,
"status": "green",
"task_max_waiting_in_queue_millis": 0,
"timed_out": false,
"unassigned_shards": 0
}

Look for the expected number of nodes (number_of_nodes and number_of_data_nodes); the status should be green; there should be no unassigned shards (unassigned_shards).

Move on to checking the indexes:

$ curl -s http://canonical.lastline-datanode.lastline.local:9200/_cat/indices
2019-10-14 14:51:16,435: output: green open pdns-20190824       R7PI0VVaQI-xLuHaGgWYuA 3 0   705455 0  99.9mb  99.9mb
2019-10-14 14:51:16,435: output: green open netflow-20190731    QgsKtijtSgau9sEp4eTfDw 3 0  5037959 0 420.5mb 420.5mb
2019-10-14 14:51:16,435: output: green open webrequest-20190921 RRn10Rc5QQOz48LjUHt7XQ 3 0  8535661 0   4.4gb   4.4gb
2019-10-14 14:51:16,435: output: green open webrequest-20190928 eoCGcuwOTwWFoOcQFbKqjQ 3 0  7356420 0   4.1gb   4.1gb
2019-10-14 14:51:16,435: output: green open pdns-20190829       htxK3pE8Sm2RmWVexWHsZA 3 0   704027 0 110.6mb 110.6mb
...

We store data in daily indices by record type (for example, netflow-20190731 stores the netflow records for 2019-07-31). If you see more or less indices than expected (up to 30 days worth of data), it may be an indication that data retention or indexing respectively is not working. The relevant log files are on the Data Node:

/var/log/elasticsearch/lldns/curator.log: data retention log
/var/log/pdns/*_timeline_worker_[0-9].log: indexing workers log