When Kibana cannot access Elasticsearch, it manifests the issue with an error page similar to the one below:
The syslog may contain entry lines like the following:
2019-10-14 14:35:21,644: output: Oct 14 13:16:10 lastline-manager kibana[31380]: {"type":"log","@timestamp":"2019-10-14T13:16:10Z","tags":["status","plugin:[email protected]","error"],"pid":31380,"state":"red ","message":"Status changed from red to red - Request Timeout after 60000ms","prevState":"red","prevMsg": "[es_rejected_execution_exception] rejected execution of org.elasticsearch.transport.TransportService$7@7 bc9508a on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@41f721f7[Running, pool size = 37, active threads = 37, queued tasks = 1000, completed tasks = 5463227]]"} 2019-10-14 14:35:21,644: output: Oct 14 13:16:43 lastline-manager kibana[31380]: {"type":"log","@timestamp":"2019-10-14T13:16:43Z","tags":["error","elasticsearch","admin"],"pid":31380,"message":"Request error, retrying\nPOST http://canonical.lastline-datanode.lastline.local:9200/.kibana/config/_search => socket hang up"}
In this case, the error message indicates that the query queue is full and cannot accept additional requests (queue capacity = 1000 ... queued tasks = 1000
). Other error messages may indicate that Kibana cannot contact Elasticsearch on the Data Node.
So far, we have encountered two scenarios in which this can occur:
Here are some troubleshooting steps that may provide information useful to identify the actual problem.
On the Manager, run:
$ curl -s http://canonical.lastline-datanode.lastline.local:9200/
In normal conditions, this API call will return basic information about the cluster.
The call may fail completely if:
canonical.lastline-datanode.lastline.local
hostname does not resolve (NXDOMAIN
)./etc/hosts
to identify one of the Data Nodes that appear to be online. It's set by the hunterkeeper
service, which assesses the cluster status (and potentially updates this value) every 10 minutes or so.hunterkeeper
service has not run yet (give it 10 minutes) or it may mean that there's no Data Node onlineufw status
).ufw
rules are updated by the hunterkeeper
service (as above).A general good first step is to check the status of the Elasticsearch cluster. Run the following on the Manager:
$ curl -s http://canonical.lastline-datanode.lastline.local:9200/_cluster/health{ "active_primary_shards": 874, "active_shards": 874, "active_shards_percent_as_number": 100.0, "cluster_name": "lldns", "delayed_unassigned_shards": 0, "initializing_shards": 0, "number_of_data_nodes": 1, "number_of_in_flight_fetch": 0, "number_of_nodes": 1, "number_of_pending_tasks": 0, "relocating_shards": 0, "status": "green", "task_max_waiting_in_queue_millis": 0, "timed_out": false, "unassigned_shards": 0 }
Look for the expected number of nodes (number_of_nodes
and number_of_data_nodes
); the status
should be green; there should be no unassigned shards (unassigned_shards
).
Move on to checking the indexes:
$ curl -s http://canonical.lastline-datanode.lastline.local:9200/_cat/indices 2019-10-14 14:51:16,435: output: green open pdns-20190824 R7PI0VVaQI-xLuHaGgWYuA 3 0 705455 0 99.9mb 99.9mb 2019-10-14 14:51:16,435: output: green open netflow-20190731 QgsKtijtSgau9sEp4eTfDw 3 0 5037959 0 420.5mb 420.5mb 2019-10-14 14:51:16,435: output: green open webrequest-20190921 RRn10Rc5QQOz48LjUHt7XQ 3 0 8535661 0 4.4gb 4.4gb 2019-10-14 14:51:16,435: output: green open webrequest-20190928 eoCGcuwOTwWFoOcQFbKqjQ 3 0 7356420 0 4.1gb 4.1gb 2019-10-14 14:51:16,435: output: green open pdns-20190829 htxK3pE8Sm2RmWVexWHsZA 3 0 704027 0 110.6mb 110.6mb ...
We store data in daily indices by record type (for example, netflow-20190731
stores the netflow records for 2019-07-31). If you see more or less indices than expected (up to 30 days worth of data), it may be an indication that data retention or indexing respectively is not working. The relevant log files are on the Data Node:
/var/log/elasticsearch/lldns/curator.log
: data retention log /var/log/pdns/*_timeline_worker_[0-9].log
: indexing workers log