Hadoop dashboard overview
When working properly, a Hadoop cluster can handle a truly massive amount of data — there are plenty of production clusters managing petabytes of data each. Monitoring each of Hadoop’s subcomponents — HDFS, MapReduce and YARN—is essential to keeping jobs running and the cluster humming.
Datadog’s comprehensive Hadoop dashboard displays key pieces of data to monitor for each subcomponent in a single pane of glass. This page breaks down the metrics featured on that dashboard to provide a starting point for anyone looking to monitor Hadoop.
What is Hadoop?
Apache Hadoop is an open source framework for distributed storage and processing of very large data sets on computer clusters.
Hadoop began as a project to implement Google’s MapReduce programming model, and has become synonymous with a rich ecosystem of related technologies, not limited to: Apache Pig, Apache Hive, Apache Spark, Apache HBase.
Set up real-time Hadoop monitoring in minutes with Datadog's out-of-the-box Hadoop dashboard.
Hadoop dashboard metrics breakdown
HDFS metrics
The Hadoop Distributed File System (HDFS) is the underlying file system of a Hadoop cluster. It provides scalable, fault-tolerant, rack-aware data storage designed to be deployed on commodity hardware.
Total nodes
The number of alive data nodes in your cluster. Ideally this number will be equal to the number of DataNodes you’ve provisioned for the cluster.
Dead nodes
The number of dead data nodes in your cluster. It is important to alert on the NumDeadDataNodes
metric because the death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes.
Losing multiple DataNodes will start to be very taxing on cluster resources, and could result in data loss.
Volume failures
This monitors the number of failed volumes in your Hadoop cluster. Though a failed volume will not bring your cluster to a grinding halt, you most likely want to know when hardware failures occur, if only so that you can replace the failed hardware.
Total blocks
HDFS split large files into manageable pieces known as blocks. Tracking the total number of blocks across the cluster is essential to continued operation.
Under replicated blocks
The the number of blocks with an insufficient number of replicas. If you see a large, sudden spike in the number of under-replicated blocks, it is likely that a DataNode has died — this can be verified by correlating under-replicated block metric values with the status of DataNodes.
HDFS disk usage
This metric tracks the total disk usage across the entire HDFS cluster.
HDFS remaining/node
The disk space remaining for a particular DataNode. If left unrectified, a single DataNode running out of space could quickly cascade into failures across the entire cluster as data is written to an increasingly-shrinking pool of available DataNodes.
See your real-time Hadoop data in minutes with Datadog's out-of-the-box Hadoop dashboard.
YARN metrics
YARN (Yet Another Resource Negotiator) is the framework responsible for assigning computational resources for application execution. The YARN metrics below provide information on the execution of individual applications as well as the cluster and node level.
YARN application metrics
Progress
Progress gives you a real-time window into the execution of a YARN application. Because application execution can often be opaque when running hundreds of applications on thousands of nodes, tracking progress alongside other metrics can better help you to determine the cause of any performance degradation.
Tracking a particular execution status, such as submitted
, running
, done
, pending
, killed
and failed
, offers context that can help clarify progress metric values. Applications that go extended periods without making progress should be investigated.
Allocated memory/app
This is a high-level view of the amount of RAM allocated per application.
Allocated vCores/app
This is the number of virtual cores allocated per application.
YARN cluster metrics
Memory usage
By tracking the combination of the totalMB
and allocatedMB
metrics you can gain a high-level view of your cluster’s memory usage. Keep in mind that YARN may over-commit resources, which can occasionally translate to reported values of allocatedMB
which are higher than totalMB
.
Total vCores
The total number of virtual cores in the cluster.
Containers
The number of containers in the cluster, where containers represent a collection of physical resources—an abstraction used to bundle resources into distinct, allocatable units.
Virtual core usage
This metric tracks the virtual core usage across the cluster.
Container usage
The total number of containers allocated, aggregated by cluster.
YARN node metrics
Active
The number of currently active nodes. These are the normally operating nodes given by the activeNodes
metric.
Sick
This indicates the unhealthy nodes in the cluster. YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage
to be unhealthy.
Rebooted
The number of nodes in the cluster that have rebooted.
Lost
If a NodeManager fails to maintain contact with the ResourceManager, it will eventually be marked as “lost” and its resources will become unavailable for allocation. This tracks the lostNodes
metric.
Avg containers/node
This tracks the average number of containers running per host.
See your real-time Hadoop data in minutes with Datadog's out-of-the-box Hadoop dashboard.
Memory usage
This monitors the memory in use broken down by node.
Virtual core usage
This metric measures the virtual core usage of each node.
MapReduce metrics
The MapReduce framework exposes a number of statistics to track on MapReduce job execution. These metrics can provide and invaluable mechanism that lets you see what is actually happening during a MapReduce job run.
Maps running/completed
The number of maps that are running and have successfully run on the cluster.
Pending maps/reduces
The number of maps and reduces that are queued for processing.
Reduces running/completed
The volume of reduce tasks that are running, and that have successfully run on the cluster.
Configure a Hadoop dashboard in minutes with Datadog
If you’d like to start visualizing your Hadoop metrics in our out-of-the-box dashboard, you can try Datadog for free. The Hadoop dashboard will be populated immediately after you set up the integration.
For a deep dive on Hadoop metrics and how to monitor them, check out our four-part How to Monitor Hadoop series.