Enabling JMX Monitoring for Hadoop And Hive

September 21, 2012

Hadoop's NameNode and JobTracker expose interesting metrics and statistics over the JMX. Hive seems not to expose anything intersting but it still might be useful to monitor its JVM or do simpler profiling/sampling on it. Let's see how to enable JMX and how to access it securely, over SSH.

Background: We run NameNode, JobTracker and Hive on the same server. Monitoring og TaskTrackers and DataNodes isn't that interesting but still might be useful to have.

Configuration

/etc/hadoop/hadoop-env.sh


diff --git a/etc/hadoop/hadoop-env.sh b/etc/hadoop/hadoop-env.sh
index 69a13b1..e8ca596 100644
--- a/etc/hadoop/hadoop-env.sh
+++ b/etc/hadoop/hadoop-env.sh
@@ -14,7 +14,8 @@ export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}
 #export HADOOP_NAMENODE_INIT_HEAPSIZE=""

 # Extra Java runtime options. Empty by default.
-export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true $HADOOP_CLIENT_OPTS"
+# Added $HIVE_OPTS that is set by hive-env.sh when starting hiveserver
+export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true $HADOOP_CLIENT_OPTS $HIVE_OPTS"

 # Command specific options appended to HADOOP_OPTS when specified
 export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT $HADOOP_NAMENODE_OPTS"
@@ -43,3 +44,16 @@ export HADOOP_SECURE_DN_PID_DIR=/var/run/hadoop

 # A string representing this instance of hadoop. $USER by default.
 export HADOOP_IDENT_STRING=$USER
+
+### JMX settings
+export JMX_OPTS=" -Dcom.sun.management.jmxremote.authenticate=false \
+    -Dcom.sun.management.jmxremote.ssl=false \
+    -Dcom.sun.management.jmxremote.port"
+#    -Dcom.sun.management.jmxremote.password.file=$HADOOP_HOME/conf/jmxremote.password \
+#    -Dcom.sun.management.jmxremote.access.file=$HADOOP_HOME/conf/jmxremote.access"
+export HADOOP_NAMENODE_OPTS="$JMX_OPTS=8006 $HADOOP_NAMENODE_OPTS"
+export HADOOP_SECONDARYNAMENODE_OPTS="$HADOOP_SECONDARYNAMENODE_OPTS"
+export HADOOP_DATANODE_OPTS="$JMX_OPTS=8006 $HADOOP_DATANODE_OPTS"
+export HADOOP_BALANCER_OPTS="$HADOOP_BALANCER_OPTS"
+export HADOOP_JOBTRACKER_OPTS="$JMX_OPTS=8007 $HADOOP_JOBTRACKER_OPTS"
+export HADOOP_TASKTRACKER_OPTS="$JMX_OPTS=8007 $HADOOP_TASKTRACKER_OPTS"

The JMX setting is used for Hadoop's daemons while the HIVE_OPTS was added for Hive.

<hive home>/conf/hive-env.sh

Enable JMX when running the Hive thrift server (we don't want it when running the command-line client etc. since it's pointless and we wouldn't need to make sure that each of them has a unique port):


if [ "$SERVICE" = "hiveserver" ]; then
  JMX_OPTS="-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=8008"
  export HIVE_OPTS="$HIVE_OPTS $JMX_OPTS"
fi

Pitfalls

When you start Hive server via hive --service hiveserver then it actually executes "hadoop jar ..." so to be able to pass options from hive-env.sh to the JVM we had to add $HIVE_OPTS in hadoop-env.sh. (I haven't found a cleaner way to do it.)

Effects

When we now start Hive or any of the Hadoop daemons, they will expose their metrics at their respective ports (NameNode - 8006, JobTracker - 8007, Hive - 8008).

(If you are running DataNode and/or TaskTracker on the same machine then you'll need to change their ports to be unique.)

Secure Connection Over SSH

Read the post VisualVM: Monitoring Remote JVM Over SSH (JMX Or Not) to find out how to connect securely to the JMX ports over ssh, f.ex. with VisualVM (spolier: ssh -D 9696 hostname; use proxy at localhost:9696).

Note: Accessing the Metrics Via The Hadoop JMX JSON Servlet

You can get the metrics also without JMX, through the NameNode/JobTracker's web interface (JMXJsonServlet; at least in Hadoop 1.0.1):

curl -i http://localhost:50070/jmx

Which will return lot of information about both Hadoop and the JVM as JSON:


...
{
    "name" : "Hadoop:service=NameNode,name=NameNodeInfo",
    "modelerType" : "org.apache.hadoop.hdfs.server.namenode.FSNamesystem",
    "Used" : 1648071774208,
    "UpgradeFinalized" : true,
    "Free" : 613441536,
    "Safemode" : "",
    "NonDfsUsedSpace" : 92258828288,
    "PercentUsed" : 94.665405,
    "PercentRemaining" : 0.035236143,
    "TotalBlocks" : 35009,
    "TotalFiles" : 98441,
    "LiveNodes" : "{\"staging-data01.example.com\":{\"usedSpace\":399837089792,\"lastContact\":0},\"staging-data02.example.com\":{\"usedSpace\":148883341312,\"lastContact\":0}}",
    "DeadNodes" : "{}",
    "DecomNodes" : "{}",
    "Threads" : 35,
    "HostName" : "staging-name.example.com",
    "Version" : "1.0.1, r1243785",
    "Total" : 1740944044032
  },
...

You can fetch only a particular key with the qry parameter:

curl -i http://localhost:50070/jmx?qry=Hadoop:service=NameNode,name=NameNodeInfo

An the expectable response:


HTTP/1.1 200 OK
Content-Type: application/json; charset=utf8
Content-Length: 1417
Server: Jetty(6.1.26)

{
  "beans" : [ {
    "name" : "Hadoop:service=NameNode,name=NameNodeInfo",
    "modelerType" : "org.apache.hadoop.hdfs.server.namenode.FSNamesystem",
    ...
  } ]

The examples above use NameNode's port 50070. Change it to JobTracker's 50030 to get information about Map-Reduce.

Some keys of interest:

NameNode
- Hadoop:service=NameNode,name=RpcActivityForPort8020
  - RpcQueueTime_avg_time, RpcProcessingTime_avg_time - is the latency increasing?
- Hadoop:service=NameNode,name=FSNamesystemState
  - CapacityTotal, CapacityUsed, CapacityRemaining, TotalLoad, UnderReplicatedBlocks, FSState
- Hadoop:service=NameNode,name=FSNamesystemMetrics
  - CorruptBlocks, MissingBlocks (not sure it is for the whole FS, though)
- Hadoop:service=NameNode,name=NameNodeInfo
  - LiveNodes (incl. usedSpace), DeadNodes, DecomNodes, PercentRemaining / PercentUsed
JobTracker
- Hadoop:service=JobTracker,name=RpcActivityForPort8021 - as for NameNode
- Hadoop:service=JobTracker,name=JobTrackerMetrics
  - jobs_submitted, jobs_completed, jobs_failed, jobs_killed, jobs_running

Tags: monitoring DevOps data