Running Dynamically Generated Hive Queries From a Shell Script

If you want to write a HQL hive query and run it mulitple times from a shell script, each time passing it different data for the query, here is a quick example that should get you started.

The first thing to know is that by specifying n number of -hivevar key value pairs when invoking hive on the command line will allow you to pass that data into the hive process.

For example, if you do the following

$ hive -e 'SELECT * FROM some_table' -hivevar FOO=blah

You will have passed in a key of FOO with the value of ‘blah’ to the hive process.

A more practical example would be wanting to run the same hive query over multiple data partitions.

In this example, I’ve got a hive database that has a ‘packets’ table partitioned by hours which looks like 2014032601.

The hive query file (dest_ip_hive.sql) would look like:

SELECT packets.sourceip FROM packets
WHERE packets.destip = "${hivevar:DEST_IP}"
AND packets.hour = ${hivevar:HOUR}
GROUP BY packets.sourceip

And a shell script that would dynamically set those values for each invocation of hive would look like:

#!/bin/bash

#
# Destination IP that we are using to determine which
# packets we will examine.
#
DEST_IP="10.0.1.10"

for HOUR in 2014032209 2014032210 2014032211 2014032212
do

   echo "Running hive query for HOUR $HOUR"

   # Run a hive query from the command line setting variables that will be
   # expaned in the .sql file.
   hive -hivevar HOUR=$HOUR -hivevar DEST_IP=$DEST_IP \
   -f dest_ip_hive.sql > ${DEST_IP}-{$HOUR}.out

done

For each hour defined in the for loop, we will execute a hive command telling it to run the query contained in the file dest_ip_hive.sql.  The DEST_IP and HOUR variables that will be expanded in the query are passed to hive via the

-hivevar HOUR=$HOUR -hivevar DEST_IP=$DEST_IP

part of the hive command.  And the output for each query will be written to a different file for each query.

Leave a Reply