When you have a complex IT environment with many disparate tools, data sources, and teams, alert noise becomes overwhelming. This can delay incident response and cause missed alerts, ultimately leading to critical incidents and outages. Datadog Event Management’s Event Correlation groups and deduplicates events and alerts, reducing noise and helping response teams act on alerts faster. Traditionally, setting up and maintaining effective correlations required advanced knowledge in all of the services in your organization and teams and understanding what alerts might relate to the same issue. To help with this, Event Management provides an AIOps-driven solution that automates the setup and management of alert grouping and continuously improves existing correlations.
In this post, we’ll look at how Datadog Event Management:
- Automatically creates event correlations and enhances pattern-based groupings
- Provides deep context around the scope and impact of incidents across teams, services, and infrastructure to reduce response time
Automatically group alerts across your environment
Managing correlation rules manually can be labor-intensive and challenging to keep up to date in evolving environments. Central Ops teams must consider cross-service dependencies, which often requires deep system knowledge. Event Management helps reduce this overhead by using AI for correlations in two ways:
Automated Intelligent Correlation
Intelligent Correlation uses machine learning algorithms to detect correlation patterns in your environment and automatically group Datadog alerts together into cases. To create more precise correlations, Datadog factors in the topology of your services and looks at key telemetry. For example, Datadog might notice error rate alerts from two services that are connected and identify that they are related. Intelligent Correlations can also be created based on other heuristics, such as events sharing tags, attributes, triggering time, and other associated resources like dashboards.
Enriched pattern-based correlations
Event Management also utilizes AI to enrich pattern-based correlations by detecting events that may be related to a pattern you’ve created to improve correlations over time. Enriched alerts are indicated with the binocular icon, as seen below.
Machine learning enables Datadog Event Management to refine its understanding of event patterns to continuously improve and suggest additional events that may be related. This means progressively fewer false alerts and more relevant incidents reaching your team’s attention, enabling better focus and reduced alert fatigue. Customers can also manually remove alerts from intelligently correlated cases to provide feedback for model retraining.
Quickly triage and prioritize incident response
Intelligent Correlation helps you accelerate MTTR by reducing duplicated efforts and enabling better prioritization of incidents. Event Management allows you to create specific views based on team, data center, or however an NOC team divides their work. Once a view is curated that is relevant to a team’s responsibilities, they can get a bird’s eye view of all the issues happening at one time and prioritize which correlations to investigate or escalate first by seeing the alert count and priority level of each case.
Once you click into the correlated case, Intelligent Correlation provides teams with immediate visibility into the full scope and impact of issues so you can quickly determine if it needs to be escalated. By understanding dependencies across services, infrastructure, and telemetry, teams can quickly pinpoint root causes and implement fixes, cutting down on the time spent gathering and cross-referencing data. For example, the below event shows a waterfall structure in correlated error rate alerts across service dependencies, highlighting where the symptoms started and indicating where to begin your investigation. A visualization of the topology of the relevant services further shows you how this correlation was created and the relationships between the group of alerts.
With deeper insights into how services and systems interconnect, Intelligent Correlation helps teams investigate and respond in less time, reducing delays from unnecessary back-and-forth.
Accelerate issue response with automated AI correlation
By automatically correlating events across multiple systems and services, Datadog Event Management provides a comprehensive view of an incident’s impact, giving more context on the issue that may not have been clear from a single alert. This helps Central Ops and NOC teams create more complete incidents for the proper teams to triage, and reduces duplicated efforts from responders. Users can also enable automated routing and remediation through workflow automation or enable NOC teams, facilitating smoother cross-team coordination and faster, more unified responses.
See our documentation for more information on Datadog Event Management’s Intelligent Correlation. If you’re not already a customer, sign up for a free 14-day trial.