Go1 transforms workflows and significantly reduces service outages with Datadog observability | Datadog
Go1 transforms workflows and significantly reduces service outages with Datadog observability

case study

Go1 transforms workflows and significantly reduces service outages with Datadog observability

About go1

Go1 empowers millions of learners with expertly curated content from diverse, industry-leading providers. Delivered through seamless integrations with learning tools that customers already use, Go1 enables learners to take charge of their own learning journeys.

Education and 
Training
603 Employees
Brisbane, 
Australia
“Datadog has become embedded into our way of working 
—it’s the norm at Go1. Teams are proactively notified 
about any potential issues before they escalate and 
interrupt services.”
case-studies/jon-d
“Datadog has become embedded into our way of working 
—it’s the norm at Go1. Teams are proactively notified 
about any potential issues before they escalate and 
interrupt services.”
Jon Ducrou Executive Vice President 
of Engineering and Data,
Go1

なぜDatadogなのか?

  • Creates visibility across all back-end services
  • Drives new ways of working, including culture of ownership and accountability across teams
  • The platform seamlessly fits into our own ways of working, allowing Go1 to express its unique business problems with customized dashboards

Challenge

Debugging failures and identifying issues across a vast number of test runs was time-consuming and complex.

Use Case

Application Performance Monitoring

Key results

↑75% monthly active users

By reducing outages from weekly to twice in six months

↓92 to 19 days

Decreased time to resolution for all bugs

↓28%

Reduced infrastructure costs

Outages restricted teams’ ability to build new features

Go1 connects millions of professionals with more than 80,000 curated courses from over 250 content partners, with over 75+ learning integrations, to aid their ongoing learning and development. An average of three courses are completed on Go1’s platform every second. Uninterrupted application performance is crucial to support asynchronous learning at this scale.

When Jon Ducrou, executive vice president of Engineering and Data, joined Go1 in early 2021, the company commonly experienced what he calls ‘high severity events’—weekly outages that prevented users from completing courses.

These incidents also left Go1’s teams fighting uphill battles to identify causes and determine the best way to fix them. In addition, when employees went on holiday, there wasn’t a system in place that effectively supported the work of Go1’s team. There were key person risks in their way of working, and some features or processes were only known to a particular person. This made it difficult to keep employees on target, and teams often spent hours focusing on what was broken rather than building features for customers.

Ducrou set out to develop a roadmap to operational excellence supported by new workflows, based on ownership, and accountability among teams. Datadog was a vital part of this program to improve application performance and service uptime and enhance the customer experience.

go1 case study illustrative image

Reducing outages and critical incidents

Go1 introduced Datadog’s Observability Platform prior to Ducrou’s tenure. The solution was initially deployed to establish visibility into logs. Go1 faced issues where containers would die, and the logs would be lost—Datadog’s first use case was to capture those logs.

Priority quickly shifted to application performance monitoring (APM) and the ability to view and analyze metrics and spans to inform engineering investigations and trigger alerts before issues escalated. Once teams were assigned ownership over their services, they were encouraged to think about application health and what constitutes a happy customer.

“Datadog’s APM was the start of getting away from a world of hurt,” says Ducrou. “It has given us observability as we scale, aggregating data by looking at our back-end infrastructure stack, and flagging when something is going wrong through APM tracing across all services simultaneously.”

Three years ago, Go1 experienced one major outage per week. With the help of Datadog, it has only seen two outages in six months. In addition, Go1 has gone from 28 to 17 bugs per developer per year, with critical incidents down from 0.8 to 0.15 per developer per year. Crucially, the average time to resolution for all bugs has plummeted from 92 to 19 days.

“We are now finding issues before customers are impacted, as part of either launch testing or just not having issues in the first place,” says Ducrou. “This also reduces team interruptions, allowing developers to focus on value-creating development rather than swapping back-and-forth to maintenance tasks.”

Nurturing operational excellence

Datadog has enabled Go1’s new workflows by helping establish ownership, accountability, and visibility as the company scales.

“You can’t just create an operationally excellent culture. It’s something you need to build and nurture—something that Datadog underscores,” says Ducrou. “It’s less about features and more about how the technology shapes how we work, what we do, and how we deliver it to ensure customers continue to come back and find positive outcomes from our services.”

Ducrou adds that Datadog has become a way of working and the norm at Go1. “Teams are proactively notified about any potential issues before they escalate and interrupt services, allowing employees to quickly identify the correct data to resolve an issue and engage other teams when needed—even before something is broken,” he says.

The combination of an ambitious strategy to reach operational excellence and the Datadog Observability platform have increased monthly active users by 75 percent and reduced infrastructure costs by 28 percent.

Ultimately, Go1’s transformation has given it the foundation to effectively integrate a recent acquisition, Blinkist, and its 30 million users into the business and replicate the same APM standards to mitigate outages.

リソース

gated-asset/il-1437-2024-gartner-dem-opengraph-image

guide

2024 Gartner® Magic Quadrant™ for Digital Experience Monitoring
gated-asset/il-1316-2024-gartner-mq-opengraph-image

guide

2024 Gartner® Magic Quadrant™ for Observability Platforms
apm/apm-hero-2020_desktop-product-md

official docs

Getting started with Datadog APM
/blog/real-time-performance-monitoring-with-datadog-distributed-tracing/datadog_distributed_tracing_hero_v6

BLOG

Datadog Distributed Tracing: live-query all ingested traces, retain only the ones you need