[SOLVED] Unable to Connect to ambari-metrics-collector Issues

I was having some issues with the ambari-metrics family of services on a ‘pseudo-distributed’ cluster that I have installed on my workstation.

The symptoms were:

1.  Ambari indicated the following CRITICAL errors in the Ambari Dashboard under the Ambari Metrics section

Connection failed: [Errno 111] Connection refused to rchapin-wrkstn:6188

2.  After attempting to restart the ambari-metrics-collector via either the Ambari Dashboard or through the commandline (# ambari-metrics-collector [stop|start]) you see the following (similar) messages in the ambari-metrics-collector.log

2016-09-02 12:15:37,505 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.library.path=/usr/lib/ams-hbase/lib/hadoop-native -XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStam
ps -Xloggc:/var/log/ambari-metrics-collector/collector-gc.log-201609021215
2016-09-02 12:15:37,505 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp
2016-09-02 12:15:37,505 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
2016-09-02 12:15:37,505 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.name=Linux
2016-09-02 12:15:37,505 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.arch=amd64
2016-09-02 12:15:37,505 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.version=3.10.0-327.10.1.el7.x86_64
2016-09-02 12:15:37,505 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.name=ams
2016-09-02 12:15:37,505 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.home=/home/ams
2016-09-02 12:15:37,505 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.dir=/home/ams
2016-09-02 12:15:37,506 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=rchapin-wrkstn:2181 sessionTimeout=120000 watcher=hconnection-0x5524cca10x0, quorum=rchapin-wrkstn:2181, b
aseZNode=/ams-hbase-unsecure
2016-09-02 12:15:37,523 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server rchapin-wrkstn/172.19.64.2:2181. Will not attempt to authenticate using SASL (unknown error)
2016-09-02 12:15:37,531 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to rchapin-wrkstn/172.19.64.2:2181, initiating session
2016-09-02 12:15:37,536 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server rchapin-wrkstn/172.19.64.2:2181, sessionid = 0x155eefbab613617, negotiated timeout = 40000
2016-09-02 12:15:37,582 WARN org.apache.hadoop.hbase.io.util.HeapMemorySizeUtil: hbase.regionserver.global.memstore.upperLimit is deprecated by hbase.regionserver.global.memstore.size
2016-09-02 12:15:38,044 WARN org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
2016-09-02 12:15:38,089 INFO org.apache.phoenix.metrics.Metrics: Initializing metrics system: phoenix
2016-09-02 12:15:38,112 WARN org.apache.hadoop.metrics2.impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-phoenix.properties,hadoop-metrics2.properties
2016-09-02 12:15:38,167 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2016-09-02 12:15:38,168 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: phoenix metrics system started
2016-09-02 12:15:38,362 WARN org.apache.hadoop.hbase.io.util.HeapMemorySizeUtil: hbase.regionserver.global.memstore.upperLimit is deprecated by hbase.regionserver.global.memstore.size
2016-09-02 12:15:38,494 WARN org.apache.hadoop.hbase.io.util.HeapMemorySizeUtil: hbase.regionserver.global.memstore.upperLimit is deprecated by hbase.regionserver.global.memstore.size
2016-09-02 12:15:38,627 INFO org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x48d7ad8b connecting to ZooKeeper ensemble=rchapin-wrkstn:2181
2016-09-02 12:15:38,627 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=rchapin-wrkstn:2181 sessionTimeout=120000 watcher=hconnection-0x48d7ad8b0x0, quorum=rchapin-wrkstn:2181, b
aseZNode=/ams-hbase-unsecure
2016-09-02 12:15:38,629 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server rchapin-wrkstn/172.19.64.2:2181. Will not attempt to authenticate using SASL (unknown error)
2016-09-02 12:15:38,629 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to rchapin-wrkstn/172.19.64.2:2181, initiating session
2016-09-02 12:15:38,633 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server rchapin-wrkstn/172.19.64.2:2181, sessionid = 0x155eefbab613618, negotiated timeout = 40000
2016-09-02 12:16:27,001 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=10, retries=35, started=48362 ms ago, cancelled=false, msg=
2016-09-02 12:16:47,033 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=11, retries=35, started=68394 ms ago, cancelled=false, msg=
2016-09-02 12:17:07,148 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=12, retries=35, started=88509 ms ago, cancelled=false, msg=
2016-09-02 12:17:27,257 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=13, retries=35, started=108618 ms ago, cancelled=false, msg=
2016-09-02 12:17:47,276 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=14, retries=35, started=128637 ms ago, cancelled=false, msg=
2016-09-02 12:18:07,333 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=15, retries=35, started=148694 ms ago, cancelled=false, msg=
2016-09-02 12:18:27,397 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=16, retries=35, started=168758 ms ago, cancelled=false, msg=
2016-09-02 12:18:47,410 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=17, retries=35, started=188771 ms ago, cancelled=false, msg=
2016-09-02 12:19:07,555 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=18, retries=35, started=208916 ms ago, cancelled=false, msg=
2016-09-02 12:19:27,732 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=19, retries=35, started=229093 ms ago, cancelled=false, msg=
2016-09-02 12:19:47,923 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=20, retries=35, started=249284 ms ago, cancelled=false, msg=
2016-09-02 12:20:07,950 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=21, retries=35, started=269311 ms ago, cancelled=false, msg=
2016-09-02 12:20:28,051 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=22, retries=35, started=289412 ms ago, cancelled=false, msg=
2016-09-02 12:20:48,230 WARN org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation: Checking master connection
com.google.protobuf.ServiceException: org.apache.hadoop.hbase.exceptions.ConnectionClosingException: Call to rchapin-wrkstn/172.19.64.2:61300 failed on local exception: org.apache.hadoop.hbase.exceptions.Connect
ionClosingException: Connection to rchapin-wrkstn/172.19.64.2:61300 is closing. Call id=46, waitTime=1
        at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:223)
        at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
        at org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$BlockingStub.isMasterRunning(MasterProtos.java:55824)
        at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$MasterServiceState.isMasterRunning(ConnectionManager.java:1444)
        at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.isKeepAliveMasterConnectedAndRunning(ConnectionManager.java:2085)
        at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getKeepAliveMasterService(ConnectionManager.java:1708)
        at org.apache.hadoop.hbase.client.MasterCallable.prepare(MasterCallable.java:38)
        at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:124)
        at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:4082)
        at org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor(HBaseAdmin.java:527)
        at org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor(HBaseAdmin.java:549)
        at org.apache.phoenix.query.ConnectionQueryServicesImpl.ensureTableCreated(ConnectionQueryServicesImpl.java:810)

3.  You will see WARNings similar to the following in the hbase-ams-regionserver-<hostname>.log file

2016-09-02 12:52:56,589 INFO  [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=236.59 KB, freeSize=222.74 MB, max=222.98 MB, blockCount=1, accesses=4, hits=3, hitRatio=75.00%, , cachingAccesses=4, cac
hingHits=3, cachingHitsRatio=75.00%, evictions=119, evicted=0, evictedPerRun=0.0
2016-09-02 12:53:52,091 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://rchapin-wrkstn:6188/ws/v1/timeline/metrics
2016-09-02 12:53:52,092 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://rchapin-wrkstn:6188/ws/v1/timeline/metrics
2016-09-02 12:53:52,094 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://rchapin-wrkstn:6188/ws/v1/timeline/metrics
2016-09-02 12:53:52,096 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://rchapin-wrkstn:6188/ws/v1/timeline/metrics
2016-09-02 12:53:52,097 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://rchapin-wrkstn:6188/ws/v1/timeline/metrics
2016-09-02 12:54:52,091 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://rchapin-wrkstn:6188/ws/v1/timeline/metrics
2016-09-02 12:55:52,092 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://rchapin-wrkstn:6188/ws/v1/timeline/metrics
2016-09-02 12:55:52,093 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://rchapin-wrkstn:6188/ws/v1/timeline/metrics

It turned out that the ambari-metrics hbase instance had gotten into an unstable state and that what I needed to do was delete all of the metrics HBase data and restart all of the services.

Making that happen was not as straight-forward as you might imagine.

1. Shutdown all of the AMS (Ambari Metrics System) components.  Do so in the following order:

1. In the Ambari Dashboard, go to the ‘Ambari Metrics’ section and under the ‘Service Actions’ dropdown click ‘Stop’.  If you want, you can click the checkbox to turn on Maintenance Mode to suppress alerts.

2. After the background operation has completed, on the ambari host where you are having issues confirm that all of the AMS components have stopped:

# ambari-metrics-monitor status
psutil build directory is not empty, continuing...
ambari-metrics-monitor currently not running
Usage: /usr/sbin/ambari-metrics-monitor {start|stop|restart|status}

# ambari-metrics-collector status
AMS is running as process 1062.

If either of the processes still report as running issue a stop command.  In my case, I did as follows

# ambari-metrics-collector stop
WARNING: ambari-metrics-collector did not stop gracefully after 5 seconds: Trying to kill with kill -9
ERROR: Unable to kill 1062
Stopping HBase master
no master to stop because no pid file /var/run/ambari-metrics-collector//hbase-root-master.pid

I then confirmed that pid 1062 was indeed dead

# ps -ef | fgrep 1062

2. Delete the AMS HBase data

In the Ambari Dashboard, under the Ambari Metrics section do a search for the hbase.rootdir configuration value.

In my case, I have AMS configured to write to HDFS so the value of this config was: hdfs://rchapin-wrkstn:8020/user/ams/hbase

Then search for the the hbase.tmp.dir configuration value.

In my case, the setting was: /var/lib/ambari-metrics-collector/hbase-tmp

If your hbase.rootdir points to HDFS delete all of the data as follows (as the hdfs user)

[hdfs@rchapin-wrkstn ~]$ hdfs dfs -rm -r -skipTrash /user/ams/hbase/*

If it is on your local file system, simply rm -rf that data, leaving the parent directory

Delete the data in the hbase.tmp.dir (as the root user)

[root@rchapin-wrkstn ~]# rm -rf /var/lib/ambari-metrics-collector/hbase-tmp/*

3. Now restart the AMS services

First, via the Ambari Dashboard, go to the ‘Ambari Metrics’ section and under the ‘Service Actions’ dropdown click ‘Start’.  If you want, you can click the checkbox to turn off Maintenance Mode if you set it when stopping the services.

Then confirm that both the ambari-metrics-monitor and ambari-metrics-collector services have started

[root@rchapin-wrkstn ~]# ambari-metrics-collector status
AMS is running as process 12229.

[root@rchapin-wrkstn ~]# ambari-metrics-monitor status
psutil build directory is not empty, continuing...
Found ambari-metrics-monitor PID: 12916
ambari-metrics-monitor running.
Monitor PID at: /var/run/ambari-metrics-monitor/ambari-metrics-monitor.pid
Monitor out at: /var/log/ambari-metrics-monitor/ambari-metrics-monitor.out

If you see errors in the ambari-metrics-collector.log similar to the following it means that there was a problem re-creating the HBase tables for the metrics-collector and that the AMS HBase tables are missing:

2016-09-02 13:35:36,200 WARN org.apache.hadoop.yarn.server.applicationhistoryservice.metrics.timeline.query.DefaultPhoenixDataSource: Unable to connect to HBase store using Phoenix.
org.apache.phoenix.exception.PhoenixIOException: SYSTEM.CATALOG
        at org.apache.phoenix.util.ServerUtil.parseServerException(ServerUtil.java:108)
        at org.apache.phoenix.query.ConnectionQueryServicesImpl.metaDataCoprocessorExec(ConnectionQueryServicesImpl.java:1039)
        at org.apache.phoenix.query.ConnectionQueryServicesImpl.metaDataCoprocessorExec(ConnectionQueryServicesImpl.java:1002)
        at org.apache.phoenix.query.ConnectionQueryServicesImpl.createTable(ConnectionQueryServicesImpl.java:1208)
        at org.apache.phoenix.query.DelegateConnectionQueryServices.createTable(DelegateConnectionQueryServices.java:112)
        at org.apache.phoenix.schema.MetaDataClient.createTableInternal(MetaDataClient.java:1974)
....
Caused by: org.apache.hadoop.hbase.TableNotFoundException: SYSTEM.CATALOG
        at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1264)
        at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1162)
        at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1146)
        at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1103)

To fix, delete the node in zookeeper, and restart the AMS services

Check the current value of the Ambari Metrics config: zookeeper.znode.parent.  In my case the value was: /ams-hbase-unsecure

Stop all of the AMS services as described above.

Then, using the zookeeper cli connect to your zookeeper cluster and delete the zookeeper node

1. Start up the zookeeper client

# /usr/hdp/current/zookeeper-client/bin/zkCli.sh

2. Connect to the zookeeper cluster, and list the root nodes for the cluster

[zk: localhost:2181(CONNECTED) 1] connect 127.0.0.1:2181
[zk: 127.0.0.1:2181(CONNECTED) 2] ls /
[hiveserver2, zookeeper, hbase-unsecure, ams-hbase-unsecure, rmstore]

3. Delete the ams-hbase-unsecure node

[zk: 127.0.0.1:2181(CONNECTED) 3] rmr /ams-hbase-unsecure

    4. Restart the AMS services as described above and then confirm that the ams-hbase-unsecure node was recreated in zookeeper.

[zk: 127.0.0.1:2181(CONNECTED) 4] ls /
[hiveserver2, zookeeper, hbase-unsecure, ams-hbase-unsecure, rmstore]

    

Leave a Reply