Data Jobs Monitoring | Datadog
Data Jobs Monitoring

Data Jobs Monitoring

Observe, troubleshoot, and cost-optimize your Spark and Databricks jobs across data pipelines

Data Jobs Monitoring (DJM) helps data platform teams and data engineers detect problematic Spark and Databricks jobs anywhere in their data pipelines, remediate failed and long-running jobs faster, and proactively optimize overprovisioned compute resources to reduce costs. Unlike traditional infrastructure monitoring tools, native interfaces, and log analysis, DJM is the only solution that enables teams to drill down into job execution traces at the Spark stage and task level to easily resolve issues and seamlessly correlate their job telemetry to their cloud infrastructure—in context with the rest of their data stack.


Detect job failures and latency spikes anywhere in your data pipelines

  • Notify teams immediately when their jobs have failed or are still running beyond the expected completion time using out-of-the-box alerts
  • Visualize trends and anomalies in job performance to quickly analyze your data platform’s reliability and estimated costs
  • Prioritize job issue resolution more efficiently by using recommended filters to surface the important issues impacting job and cluster health, such as failures, latency, cost spikes, and more
Detect job failures and latency spikes anywhere in your data pipelines

Pinpoint and resolve failed and long-running jobs faster

  • Drill down into a detailed trace view of a job to see the full execution flow (i.e., job, stages, and tasks) and where it failed for full troubleshooting context
  • Uncover root cause of slow jobs by identifying inefficient Spark stages or SQL queries that could be impacted by data skew, disk spill, or other common factors
  • Compare recent runs of a job to expedite root cause analysis, surfacing trends and changes in run duration, Spark performance metrics, cluster utilization, and configuration
Pinpoint and resolve failed and long-running jobs faster

Reduce costs by optimizing misallocated clusters and inefficient jobs

  • Lower compute costs by identifying overprovisioned clusters and changing the number of worker nodes and instance types
  • Increase job run efficiency at the application level by using Spark execution metrics to determine improvements in the code or configuration
  • Surface the largest savings opportunities by viewing the idle compute for the largest jobs and cluster utilization over time, segmented by different data teams or environments to see which is incorrectly provisioned
Reduce costs by optimizing misallocated clusters and inefficient jobs

Centralize data pipeline visibility with the rest of your cloud infrastructure

  • Gain complete data pipeline visibility in a unified dashboard, viewing data storage, warehouse, and orchestrator metrics from other key technologies such as Snowflake and Airflow in the same place as your job telemetry
  • Pivot seamlessly between key data pipeline metrics to understand what influenced job failures or latency spikes, such as infrastructure metrics, Spark metrics, logs, configuration
  • Accelerate incident response and debugging with flexible tagging that route alerts for data pipeline issues to the right teams
Centralize data pipeline visibility with the rest of your cloud infrastructure

Technologies and Platforms Supported

databricks spark amazon-emr kubernetes
Data Jobs Monitoring enables my organization to centralize our data workloads in a single place–with the rest of our applications and infrastructure–which has dramatically improved our confidence in the platform we are scaling. As a result, my team is able to resolve our Databricks job failures 20% faster with DJM because of how easy it is to set up real-time alerting and find the root cause of the failing job.
Matt Camilli
Matt Camilli Head of Engineering at Rhythm Energy

Customer Testimonials

Data Jobs Monitoring enables my organization to centralize our data workloads in a single place–with the rest of our applications and infrastructure–which has dramatically improved our confidence in the platform we are scaling. As a result, my team is able to resolve our Databricks job failures 20% faster with DJM because of how easy it is to set up real-time alerting and find the root cause of the failing job.
Matt Camilli

Matt Camilli

Head of Engineering at Rhythm Energy

Resources

products/data-jobs-monitoring/data_jobs_monitoring_product_hero_desktop_v2

official docs

Data Jobs Monitoring
/blog/monitoring-spark/160518-hadoop-intro-final

BLOG

Hadoop & Spark monitoring with Datadog
/blog/databricks-monitoring-datadog/databricks_hero

BLOG

Monitor Databricks with Datadog
/blog/data-jobs-monitoring/djm-hero

BLOG

Troubleshoot and optimize data processing workloads with Data Jobs Monitoring