Welcome to my website. I am always posting links to photo albums, art, technology and other creations. Everything that you will see on my numerous personal sites is powered by the formVistaTM Website Management Engine.

icon.linkedin.jpgicon.twitter.jpg

  • Subscribe to this RSS Feed
  • [SOLVED] Ambari There are no DataNodes to do rolling restarts when there are DataNodes that do need a restart
    09/16/2016 11:17AM

    When maintaining a Hadoop cluster, you will need to restart various service from time-to-time when/if you update Hadoop configurations.

    I ran into a problem today with Ambari where I wanted to do a rollling restart of all of my DataNodes, but when I clicked on the "Restart DataNodes" entry in the "Restart" drop down the dialog indicated "There are no DataNodes to do rolling restarts".

    ambari_qa_incorrect_dn_restart_dialog.png

    This was clearly incorrect.

    ambari_qa_datanode_restart_alert.png

    It did not take me too long to figure out that I had already put HDFS into Maintenance Mode.  As a result, Ambari does not see that there are any DataNodes that need to be restarted.

    Taking HDFS out of Maintenance mode allowed me to then execute a rolling restart through Ambari.

  • [SOLVED] Unable to Connect to ambari-metrics-collector Issues
    09/02/2016 11:28AM
    I was having some issues with the ambari-metrics family of services on a 'pseudo-distributed' cluster that I have installed on my workstation.

    The symptoms were:

    1.  Ambari indicated the following CRITICAL errors in the Ambari Dashboard under the Ambari Metrics section
    Connection failed: [Errno 111] Connection refused to rchapin-wrkstn:6188

    2.  After attempting to restart the ambari-metrics-collector via either the Ambari Dashboard or through the commandline (# ambari-metrics-collector [stop|start]) you see the following (similar) messages in the ambari-metrics-collector.log

    2016-09-02 12:15:37,505 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.library.path=/usr/lib/ams-hbase/lib/hadoop-native -XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStam
    ps -Xloggc:/var/log/ambari-metrics-collector/collector-gc.log-201609021215
    2016-09-02 12:15:37,505 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp
    2016-09-02 12:15:37,505 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
    2016-09-02 12:15:37,505 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.name=Linux
    2016-09-02 12:15:37,505 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.arch=amd64
    2016-09-02 12:15:37,505 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.version=3.10.0-327.10.1.el7.x86_64
    2016-09-02 12:15:37,505 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.name=ams
    2016-09-02 12:15:37,505 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.home=/home/ams
    2016-09-02 12:15:37,505 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.dir=/home/ams
    2016-09-02 12:15:37,506 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=rchapin-wrkstn:2181 sessionTimeout=120000 watcher=hconnection-0x5524cca10x0, quorum=rchapin-wrkstn:2181, b
    aseZNode=/ams-hbase-unsecure
    2016-09-02 12:15:37,523 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server rchapin-wrkstn/172.19.64.2:2181. Will not attempt to authenticate using SASL (unknown error)
    2016-09-02 12:15:37,531 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to rchapin-wrkstn/172.19.64.2:2181, initiating session
    2016-09-02 12:15:37,536 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server rchapin-wrkstn/172.19.64.2:2181, sessionid = 0x155eefbab613617, negotiated timeout = 40000
    2016-09-02 12:15:37,582 WARN org.apache.hadoop.hbase.io.util.HeapMemorySizeUtil: hbase.regionserver.global.memstore.upperLimit is deprecated by hbase.regionserver.global.memstore.size
    2016-09-02 12:15:38,044 WARN org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
    2016-09-02 12:15:38,089 INFO org.apache.phoenix.metrics.Metrics: Initializing metrics system: phoenix
    2016-09-02 12:15:38,112 WARN org.apache.hadoop.metrics2.impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-phoenix.properties,hadoop-metrics2.properties
    2016-09-02 12:15:38,167 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
    2016-09-02 12:15:38,168 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: phoenix metrics system started
    2016-09-02 12:15:38,362 WARN org.apache.hadoop.hbase.io.util.HeapMemorySizeUtil: hbase.regionserver.global.memstore.upperLimit is deprecated by hbase.regionserver.global.memstore.size
    2016-09-02 12:15:38,494 WARN org.apache.hadoop.hbase.io.util.HeapMemorySizeUtil: hbase.regionserver.global.memstore.upperLimit is deprecated by hbase.regionserver.global.memstore.size
    2016-09-02 12:15:38,627 INFO org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x48d7ad8b connecting to ZooKeeper ensemble=rchapin-wrkstn:2181
    2016-09-02 12:15:38,627 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=rchapin-wrkstn:2181 sessionTimeout=120000 watcher=hconnection-0x48d7ad8b0x0, quorum=rchapin-wrkstn:2181, b
    aseZNode=/ams-hbase-unsecure
    2016-09-02 12:15:38,629 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server rchapin-wrkstn/172.19.64.2:2181. Will not attempt to authenticate using SASL (unknown error)
    2016-09-02 12:15:38,629 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to rchapin-wrkstn/172.19.64.2:2181, initiating session
    2016-09-02 12:15:38,633 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server rchapin-wrkstn/172.19.64.2:2181, sessionid = 0x155eefbab613618, negotiated timeout = 40000
    2016-09-02 12:16:27,001 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=10, retries=35, started=48362 ms ago, cancelled=false, msg=
    2016-09-02 12:16:47,033 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=11, retries=35, started=68394 ms ago, cancelled=false, msg=
    2016-09-02 12:17:07,148 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=12, retries=35, started=88509 ms ago, cancelled=false, msg=
    2016-09-02 12:17:27,257 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=13, retries=35, started=108618 ms ago, cancelled=false, msg=
    2016-09-02 12:17:47,276 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=14, retries=35, started=128637 ms ago, cancelled=false, msg=
    2016-09-02 12:18:07,333 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=15, retries=35, started=148694 ms ago, cancelled=false, msg=
    2016-09-02 12:18:27,397 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=16, retries=35, started=168758 ms ago, cancelled=false, msg=
    2016-09-02 12:18:47,410 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=17, retries=35, started=188771 ms ago, cancelled=false, msg=
    2016-09-02 12:19:07,555 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=18, retries=35, started=208916 ms ago, cancelled=false, msg=
    2016-09-02 12:19:27,732 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=19, retries=35, started=229093 ms ago, cancelled=false, msg=
    2016-09-02 12:19:47,923 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=20, retries=35, started=249284 ms ago, cancelled=false, msg=
    2016-09-02 12:20:07,950 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=21, retries=35, started=269311 ms ago, cancelled=false, msg=
    2016-09-02 12:20:28,051 INFO org.apache.hadoop.hbase.client.RpcRetryingCaller: Call exception, tries=22, retries=35, started=289412 ms ago, cancelled=false, msg=
    2016-09-02 12:20:48,230 WARN org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation: Checking master connection
    com.google.protobuf.ServiceException: org.apache.hadoop.hbase.exceptions.ConnectionClosingException: Call to rchapin-wrkstn/172.19.64.2:61300 failed on local exception: org.apache.hadoop.hbase.exceptions.Connect
    ionClosingException: Connection to rchapin-wrkstn/172.19.64.2:61300 is closing. Call id=46, waitTime=1
            at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:223)
            at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
            at org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$BlockingStub.isMasterRunning(MasterProtos.java:55824)
            at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$MasterServiceState.isMasterRunning(ConnectionManager.java:1444)
            at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.isKeepAliveMasterConnectedAndRunning(ConnectionManager.java:2085)
            at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getKeepAliveMasterService(ConnectionManager.java:1708)
            at org.apache.hadoop.hbase.client.MasterCallable.prepare(MasterCallable.java:38)
            at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:124)
            at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:4082)
            at org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor(HBaseAdmin.java:527)
            at org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor(HBaseAdmin.java:549)
            at org.apache.phoenix.query.ConnectionQueryServicesImpl.ensureTableCreated(ConnectionQueryServicesImpl.java:810)

    3.  You will see WARNings similar to the following in the hbase-ams-regionserver-<hostname>.log file

    2016-09-02 12:52:56,589 INFO  [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=236.59 KB, freeSize=222.74 MB, max=222.98 MB, blockCount=1, accesses=4, hits=3, hitRatio=75.00%, , cachingAccesses=4, cac
    hingHits=3, cachingHitsRatio=75.00%, evictions=119, evicted=0, evictedPerRun=0.0
    2016-09-02 12:53:52,091 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://rchapin-wrkstn:6188/ws/v1/timeline/metrics
    2016-09-02 12:53:52,092 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://rchapin-wrkstn:6188/ws/v1/timeline/metrics
    2016-09-02 12:53:52,094 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://rchapin-wrkstn:6188/ws/v1/timeline/metrics
    2016-09-02 12:53:52,096 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://rchapin-wrkstn:6188/ws/v1/timeline/metrics
    2016-09-02 12:53:52,097 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://rchapin-wrkstn:6188/ws/v1/timeline/metrics
    2016-09-02 12:54:52,091 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://rchapin-wrkstn:6188/ws/v1/timeline/metrics
    2016-09-02 12:55:52,092 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://rchapin-wrkstn:6188/ws/v1/timeline/metrics
    2016-09-02 12:55:52,093 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://rchapin-wrkstn:6188/ws/v1/timeline/metrics



    It turned out that the ambari-metrics hbase instance had gotten into an unstable state and that what I needed to do was delete all of the metrics HBase data and restart all of the services.

    Making that happen was not as straight-forward as you might imagine.

    The first thing that needs to happen is to shutdown all of the AMS (Ambari Metrics System) components.  Do so in the following order:

    1. In the Ambari Dashboard, go to the 'Ambari Metrics' section and under the 'Service Actions' dropdown click 'Stop'.  If you want, you can click the checkbox to turn on Maintenance Mode to suppress alerts.

    2. After the background operation has completed, on the ambari host where you are having issues confirm that all of the AMS components have stopped:

    # ambari-metrics-monitor status
    psutil build directory is not empty, continuing...
    ambari-metrics-monitor currently not running
    Usage: /usr/sbin/ambari-metrics-monitor {start|stop|restart|status}

    # ambari-metrics-collector status
    AMS is running as process 1062.

    If either of the processes still report as running issue a stop command.  In my case, I did as follows

    # ambari-metrics-collector stop
    WARNING: ambari-metrics-collector did not stop gracefully after 5 seconds: Trying to kill with kill -9
    ERROR: Unable to kill 1062
    Stopping HBase master
    no master to stop because no pid file /var/run/ambari-metrics-collector//hbase-root-master.pid

    I then confirmed that pid 1062 was indeed dead

    # ps -ef | fgrep 1062

    2. Delete the AMS HBase data:

    In the Ambari Dashboard, under the Ambari Metrics section do a search for the following configuration values

        hbase.rootdir

        In my case, I have AMS configured to write to HDFS so the value of this config was: hdfs://rchapin-wrkstn:8020/user/ams/hbase

        hbase.tmp.dir

        In my case, the setting was: /var/lib/ambari-metrics-collector/hbase-tmp

    If your hbase.rootdir points to HDFS delete all of the data as follows (as the hdfs user)

        [hdfs@rchapin-wrkstn ~]$ hdfs dfs -rm -r -skipTrash /user/ams/hbase/*

    If it is on your local file system, simply rm -rf that data, leaving the parent directory

    Delete the data in the hbase.tmp.dir (as the root user)

        [root@rchapin-wrkstn ~]# rm -rf /var/lib/ambari-metrics-collector/hbase-tmp/*


    3. Now restart the AMS services

    First, via the Ambari Dashboard, go to the 'Ambari Metrics' section and under the 'Service Actions' dropdown click 'Start'.  If you want, you can click the checkbox to turn off Maintenance Mode if you set it when stopping the services.

    Then confirm that both the ambari-metrics-monitor and ambari-metrics-collector services have started

    [root@rchapin-wrkstn ~]# ambari-metrics-collector status
    AMS is running as process 12229.

    [root@rchapin-wrkstn ~]# ambari-metrics-monitor status
    psutil build directory is not empty, continuing...
    Found ambari-metrics-monitor PID: 12916
    ambari-metrics-monitor running.
    Monitor PID at: /var/run/ambari-metrics-monitor/ambari-metrics-monitor.pid
    Monitor out at: /var/log/ambari-metrics-monitor/ambari-metrics-monitor.out


    If you see errors in the ambari-metrics-collector.log similar to the following it means that there was a problem re-creating the HBase tables for the metrics-collector and that the AMS HBase tables are missing:

    2016-09-02 13:35:36,200 WARN org.apache.hadoop.yarn.server.applicationhistoryservice.metrics.timeline.query.DefaultPhoenixDataSource: Unable to connect to HBase store using Phoenix.
    org.apache.phoenix.exception.PhoenixIOException: SYSTEM.CATALOG
            at org.apache.phoenix.util.ServerUtil.parseServerException(ServerUtil.java:108)
            at org.apache.phoenix.query.ConnectionQueryServicesImpl.metaDataCoprocessorExec(ConnectionQueryServicesImpl.java:1039)
            at org.apache.phoenix.query.ConnectionQueryServicesImpl.metaDataCoprocessorExec(ConnectionQueryServicesImpl.java:1002)
            at org.apache.phoenix.query.ConnectionQueryServicesImpl.createTable(ConnectionQueryServicesImpl.java:1208)
            at org.apache.phoenix.query.DelegateConnectionQueryServices.createTable(DelegateConnectionQueryServices.java:112)
            at org.apache.phoenix.schema.MetaDataClient.createTableInternal(MetaDataClient.java:1974)
    ....
    Caused by: org.apache.hadoop.hbase.TableNotFoundException: SYSTEM.CATALOG
            at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1264)
            at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1162)
            at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1146)
            at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1103)

    To fix, delete the node in zookeeper, and restart the AMS services

    Check the current value of the Ambari Metrics config: zookeeper.znode.parent.  In my case the value was: /ams-hbase-unsecure

    Stop all of the AMS services as described above.

    Then, using the zookeeper cli connect to your zookeeper cluster and delete the zookeeper node

    1. Start up the zookeeper client

        # /usr/hdp/current/zookeeper-client/bin/zkCli.sh

    2. Connect to the zookeeper cluster, and list the root nodes for the cluster

        [zk: localhost:2181(CONNECTED) 1] connect 127.0.0.1:2181
        [zk: 127.0.0.1:2181(CONNECTED) 2] ls /
        [hiveserver2, zookeeper, hbase-unsecure, ams-hbase-unsecure, rmstore]

    3. Delete the ams-hbase-unsecure node

        [zk: 127.0.0.1:2181(CONNECTED) 3] rmr /ams-hbase-unsecure

    4. Restart the AMS services as described above and then confirm that the ams-hbase-unsecure node was recreated in zookeeper.

        [zk: 127.0.0.1:2181(CONNECTED) 4] ls /
        [hiveserver2, zookeeper, hbase-unsecure, ams-hbase-unsecure, rmstore]

  • Dynamically Instantiating Classes in Python
    08/11/2016 5:52PM
    A common design pattern is to define a class or list of classeses in configuration such that at runtime the classes can be dynamically instantiated.

    I've done this before in many other languages and had need to do so today in Python (2.7.11)

    It seems as though the clean way to do so is by using the Pyton importlib module.  By using it, it enables you to cleanly dynamically import sub modules

    Following is an example:

        import importlib

        klass_1_module = 'some.module.name'
        klass_1_classname = 'KlassNameOne'

        klass_2_module = 'klass2module'
        klass_2_classname = 'KlassNameOne'

        Klass1 = getattr(importlib.import_module(klass_1_module), klass_1_classname)
        klass1_instance = Klass1()

        Klass2 = getattr(importlib.import_module(klass_2_module), klass_2_classname)
        klass2_instance = Klass1('some_arg')
  • Java PowerMock Could not reconfigure JMX java.lang.LinkageError Solution
    08/09/2016 9:00AM

    If you are using Mockito and PowerMock to build mocks for your Java tests and you run into the following error:

    2016-05-05 17:31:20,204 main ERROR Could not reconfigure JMX java.lang.LinkageError: loader constraint violation: loader (instance of org/powermock/core/classloader/MockClassLoader) previously initiated loading for a different type with name "javax/management/MBeanServer"
            at java.lang.ClassLoader.defineClass1(Native Method)
            at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
            at org.powermock.core.classloader.MockClassLoader.loadUnmockedClass(MockClassLoader.java:238)
            at org.powermock.core.classloader.MockClassLoader.loadModifiedClass(MockClassLoader.java:182)
            at org.powermock.core.classloader.DeferSupportingClassLoader.loadClass(DeferSupportingClassLoader.java:70)
            at java.lang.ClassLoader.loadClass(ClassLoader.java:357)


    Simply add the following as a class level annotation:

    @PowerMockIgnore( {"javax.management.*"})

  • Blacklisting Kernel Modules
    06/12/2016 12:14PM

    Following is a walkthrough on how to blacklist a kernel module.  The specific example is blacklisting the nouveau driver so that I can install the OEM Nvidia driver.

    1. First, blacklist the nouveau driver: Add a line to the textfile /etc/modprobe.d/nouveau-blacklist.conf that contains they keyword 'blacklist' and the name of the driver

    blacklist nouveau

    2. Rebuild the initramfs image file.  First, backup existing initramfs


    # mv initramfs-3.10.0-327.18.2.el7.x86_64.img initramfs-3.10.0-327.18.2.el7.x86_64.img_2016-06-09.bak

    Build new initramfs
    # dracut -v /boot/initramfs-$(uname -r).img $(uname -r)

    3. Reboot the system and confirm that the driver in question is not loaded.
    # lsmod | fgrep nouveau

Advanced Search

Categories

Archives