Big Data – Ryan Chapin's Website

[SOLVED] Upgrading Apache Kafka 2.7 to Java 11 Changes authenticationID sent to ZooKeeper Enabling Only 1 Kafka Broker to r/w znodes

Posted on May 17, 2021May 19, 2021 by rchapin

The title of this post is a bit of mouthful and requires a bit more explanation.

I am running a pure open-source version of Kafka (currently running 2.7) and am using SASL/GSSAPI connections between all of the brokers and ZooKeeper. Currently, the whole system, including ZooKeeper, is running Java 8 and it is long-overdue to be upgraded to Java 11.

Upgrading Kafka to Java 11 causes the server to send an incorrect authenticationID String to ZooKeeper which results in the → Continue reading “[SOLVED] Upgrading Apache Kafka 2.7 to Java 11 Changes authenticationID sent to ZooKeeper Enabling Only 1 Kafka Broker to r/w znodes”

[SOLVED] Ambari There are no DataNodes to do rolling restarts when there are DataNodes that do need a restart

Posted on September 16, 2016January 28, 2021 by rchapin

When maintaining a Hadoop cluster, you will need to restart various service from time-to-time when/if you update Hadoop configurations.

I ran into a problem today with Ambari where I wanted to do a rollling restart of all of my DataNodes, but when I clicked on the “Restart DataNodes” entry in the “Restart” drop down the dialog indicated “There are no DataNodes to do rolling restarts”.

This was clearly incorrect.

It did not take me too long to figure out that → Continue reading “[SOLVED] Ambari There are no DataNodes to do rolling restarts when there are DataNodes that do need a restart”

[SOLVED] Unable to Connect to ambari-metrics-collector Issues

Posted on September 2, 2016January 28, 2021 by rchapin

I was having some issues with the ambari-metrics family of services on a ‘pseudo-distributed’ cluster that I have installed on my workstation.

The symptoms were:

1. Ambari indicated the following CRITICAL errors in the Ambari Dashboard under the Ambari Metrics section

Connection failed: [Errno 111] Connection refused to rchapin-wrkstn:6188

2. After attempting to restart the ambari-metrics-collector via either the Ambari Dashboard or through the commandline (# ambari-metrics-collector [stop|start]) you see the following (similar) messages in the ambari-metrics-collector.log

2016-09-02 12:15:37,505 INFO

→ Continue reading

How to Return Hive Query Results Similary to MySQL \G in One Vertical Column

Posted on June 9, 2016January 28, 2021 by rchapin

When trying to look at data in a database with really wide rows even just selecting 1 row to see the data is nearly impossible to understand when the single row wraps 7 or 8 times.

MySQL offers the ‘\G’ option to display the output in a single column.

The corresponding method in Hive is to execute the following set command:

!set outputformat vertical
SELECT something FROM some table;

→ Continue reading

[SOLVED] java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter When Using Avro Data with MapReduce

Posted on January 14, 2016January 30, 2021 by rchapin

I am working on a project and have decided to use Avro for the data serialization format.

I encountered the following error when trying to set up the unit test to test the mapper implementation through Eclipse:

java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
    at org.apache.avro.hadoop.io.AvroSerialization.getSerializer(AvroSerialization.java:114)
    at org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:82)
    at org.apache.hadoop.mrunit.internal.io.Serialization.copy(Serialization.java:67)
    at org.apache.hadoop.mrunit.internal.io.Serialization.copy(Serialization.java:98)
    at org.apache.hadoop.mrunit.internal.io.Serialization.copyWithConf(Serialization.java:111)
    at org.apache.hadoop.mrunit.TestDriver.copy(TestDriver.java:676)
    at org.apache.hadoop.mrunit.TestDriver.copyPair(TestDriver.java:680)
    at org.apache.hadoop.mrunit.MapDriverBase.addInput(MapDriverBase.java:120)
    at org.apache.hadoop.mrunit.MapDriverBase.addInput(MapDriverBase.java:130)
    at org.apache.hadoop.mrunit.MapDriverBase.addAll(MapDriverBase.java:141)
    at org.apache.hadoop.mrunit.MapDriverBase.withAll(MapDriverBase.java:247)
    at com.ryanchapin.hadoop.mapreduce.mrunit.UserDataSortTest.testMapper(UserDataSortTest.java:111)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)

→ Continue reading

Hadoop Cluster Sizing Wizard by Hortonworks

Posted on July 13, 2015February 3, 2021 by rchapin

Anyone who does any Hadoop development or systems engineering arrives at the “how should I size my cluster” question.

Hortonworks has a very nice cluster sizing calculator that takes into account the basic use-cases and data profile to help get you started with your hardware requirements.→ Continue reading “Hadoop Cluster Sizing Wizard by Hortonworks”

Debugging MapReduce MRv2 Code in Eclipse

Posted on March 24, 2015March 27, 2021 by rchapin

Following is how to set-up your environment to be able to set breakpoints, step-through, and debug your MapReduce code in Eclipse.

All of the this was done on a machine running Linux, but should work just fine for any *nix machine, and perhaps Windows running Cygwin (assuming that you can get Hadoop and its naitive libraries compiled under Windows).

This also assumes that you are building your project with maven.

Install a pseudo-distributed hadooop cluster on your development box. (Yes, → Continue reading “Debugging MapReduce MRv2 Code in Eclipse”

Passing an Array as an Argument to a Bash Function

Posted on October 20, 2014March 27, 2021 by rchapin

If you want to pass an array of items to a bash function, the simple answer is that you need to pass the expanded values. That means that you can pass the data as a quoted value, assuming that the elements are whitespace delimited, or you can pass it as a string and then split it using an updated IFS (Internal Field Separator) inside the function.

Following is an example of taking the output of a Hive query (a single → Continue reading “Passing an Array as an Argument to a Bash Function”

Restarting Individual Services or the Entire HDP Stack in the Hortornworks Virtual Sandbox

Posted on October 13, 2014March 27, 2021 by rchapin

I’m using the Hortonworks Virtual Sandbox for development and testing and wanted to restart the HDP stack without (of course) having to restart the VM.

It took me a little while to figure out how to go about it as Internet searches on the topic revealed very little.

It turns out that Hortonworks have set up their own service on the box, startup_script.

If you take a look at /etc/init.d/startup_script you will see that it calls a number of other → Continue reading “Restarting Individual Services or the Entire HDP Stack in the Hortornworks Virtual Sandbox”

Running Dynamically Generated Hive Queries From a Shell Script

Posted on March 26, 2014March 28, 2021 by rchapin

If you want to write a HQL hive query and run it mulitple times from a shell script, each time passing it different data for the query, here is a quick example that should get you started.

The first thing to know is that by specifying n number of -hivevar key value pairs when invoking hive on the command line will allow you to pass that data into the hive process.

For example, if you do the following

$ hive

→ Continue reading