Critical YARN alerts in Ambari
search cancel

Critical YARN alerts in Ambari

book

Article ID: 294937

calendar_today

Updated On:

Products

Services Suite

Issue/Introduction

Symptoms:

Ambari is sending out critical YARN alerts followed by OK alerts sent minutes later.


Ambari notification email promotes an alert, producing the following messages:

Alert Summary: <ClusterName> - OK[0], Warning[0], Critical[1], Unknown[0]

Services Reporting Alerts
http://AmbariServer:8080/#/main/dashboard/metrics
CRITICAL [YARN]
YARN
CRITICAL App Timeline Web UI 
Connection failed to http://AppTimelineServer:8188 

Environment


Cause

This is a known issue of Ambari where the App Timeline Server isn't able to respond in a timely manner when it has big timeline Database to manage.

Resolution

This issue will be fixed with a new release of Ambari 2.2.x.


Workarounds 

There are two approaches to resolving this issue. 

1. Clean up or relocate the App Timeline Database. The Timeline Database will be recreated once the timeline server restarts:
a. Stop App Timeline Server from Ambari.
b. Clean up the App Timeline Database. You can locate the path from the property "yarn.timeline-service.leveldb-timeline-store.path" in yarn-site.xml.
c. Start App Timeline Server from Ambari.


2. Increase the associated timeout value using the method below:

Note: These steps need to be run from the Ambari server. 'hdp24a', in the example below, is the cluster name. Substitute it with your own cluster name.
 

a. Identify Alert ID of App Timeline Web UI.
[root@admin ~]# curl -H "X-Requested-By: ambari" -X GET -u admin:admin http://localhost:8080/api/v1/clusters/hdp24a/alert_definitions
{
"href" : "http://localhost:8080/api/v1/clusters/hdp24a/alert_definitions",
"items" : [
:
:
{
"href" : "http://localhost:8080/api/v1/clusters/hdp24a/alert_definitions/74",
"Aler
tDefinition" : {
"cluster_name" : "hdp24a",
"id" : 74, <<<!!! Note this id of App Timeline Web UI.
"label" : "App Timeline Web UI",
"name" : "yarn_app_timeline_server_webui"
}
},
:
:
]
} 
b. Retrieve the definition of the alert in JSON format.
[root@admin ~]# curl -H "X-Requested-By: ambari" -X GET -u admin:admin http://localhost:8080/api/v1/clusters/hdp24a/alert_definitions/74 >alert.json 

c. Edit the alert.json to increase the "connection_timeout" to 25 from default 5.
[root@admin ~]# vi alerts.json
"href" : "http://localhost:8080/api/v1/clusters/hdp24a/alert_definitions/74", <<!! Remove this line.*
:
:
"default_port" : 0.0, <<!! Remove this line
"connection_timeout" : 25.0 <<!! Change to 25 from default 5.
: 
d. Apply the edited JSON file back.
[root@admin ~]# curl -X PUT -d @alert.json -i -u admin:admin -H 'X-Requested-By: ambari' http://localhost:8080/api/v1/clusters/hdp24a/alert_definitions/74
HTTP/1.1 100 Continue
HTTP/1.1 200 OK
X-Frame-Options: DENY
X-XSS-Protection: 1; mode=block
User: admin
Set-Cookie: AMBARISESSIONID=1g1rebkc8aziuciu8vi0jwgk;Path=/;HttpOnly
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Content-Type: text/plain
Content-Length: 0
Server: Jetty(8.1.17.v20150415) 

e. Restart the Ambari server.

[root@admin ~]# ambari-server restart
Using python /usr/bin/python
Restarting ambari-server
Using python /usr/bin/python
Stopping ambari-server
Ambari Server stopped
Using python /usr/bin/python
Starting ambari-server
Ambari Server running with administrator privileges.
Organizing resource files at /var/lib/ambari-server/resources...
Server PID at: /var/run/ambari-server/ambari-server.pid
Server out at: /var/log/ambari-server/ambari-server.out
Server log at: /var/log/ambari-server/ambari-server.log
Waiting for server start....................
Ambari Server 'start' completed successfully.