Outages restricted teams’ ability to build new features
Go1 connects millions of professionals with more than 80,000 curated courses from over 250 content partners, with over 75+ learning integrations, to aid their ongoing learning and development. An average of three courses are completed on Go1’s platform every second. Uninterrupted application performance is crucial to support asynchronous learning at this scale.
When Jon Ducrou, executive vice president of Engineering and Data, joined Go1 in early 2021, the company commonly experienced what he calls ‘high severity events’—weekly outages that prevented users from completing courses.
These incidents also left Go1’s teams fighting uphill battles to identify causes and determine the best way to fix them. In addition, when employees went on holiday, there wasn’t a system in place that effectively supported the work of Go1’s team. There were key person risks in their way of working, and some features or processes were only known to a particular person. This made it difficult to keep employees on target, and teams often spent hours focusing on what was broken rather than building features for customers.
Ducrou set out to develop a roadmap to operational excellence supported by new workflows, based on ownership, and accountability among teams. Datadog was a vital part of this program to improve application performance and service uptime and enhance the customer experience.
Reducing outages and critical incidents
Go1 introduced Datadog’s Observability Platform prior to Ducrou’s tenure. The solution was initially deployed to establish visibility into logs. Go1 faced issues where containers would die, and the logs would be lost—Datadog’s first use case was to capture those logs.
Priority quickly shifted to application performance monitoring (APM) and the ability to view and analyze metrics and spans to inform engineering investigations and trigger alerts before issues escalated. Once teams were assigned ownership over their services, they were encouraged to think about application health and what constitutes a happy customer.
“Datadog’s APM was the start of getting away from a world of hurt,” says Ducrou. “It has given us observability as we scale, aggregating data by looking at our back-end infrastructure stack, and flagging when something is going wrong through APM tracing across all services simultaneously.”
Three years ago, Go1 experienced one major outage per week. With the help of Datadog, it has only seen two outages in six months. In addition, Go1 has gone from 28 to 17 bugs per developer per year, with critical incidents down from 0.8 to 0.15 per developer per year. Crucially, the average time to resolution for all bugs has plummeted from 92 to 19 days.
“We are now finding issues before customers are impacted, as part of either launch testing or just not having issues in the first place,” says Ducrou. “This also reduces team interruptions, allowing developers to focus on value-creating development rather than swapping back-and-forth to maintenance tasks.”
Nurturing operational excellence
Datadog has enabled Go1’s new workflows by helping establish ownership, accountability, and visibility as the company scales.
“You can’t just create an operationally excellent culture. It’s something you need to build and nurture—something that Datadog underscores,” says Ducrou. “It’s less about features and more about how the technology shapes how we work, what we do, and how we deliver it to ensure customers continue to come back and find positive outcomes from our services.”
Ducrou adds that Datadog has become a way of working and the norm at Go1. “Teams are proactively notified about any potential issues before they escalate and interrupt services, allowing employees to quickly identify the correct data to resolve an issue and engage other teams when needed—even before something is broken,” he says.
The combination of an ambitious strategy to reach operational excellence and the Datadog Observability platform have increased monthly active users by 75 percent and reduced infrastructure costs by 28 percent.
Ultimately, Go1’s transformation has given it the foundation to effectively integrate a recent acquisition, Blinkist, and its 30 million users into the business and replicate the same APM standards to mitigate outages.