Identifying issues across a vast number of test runs proves complex
Sonos is committed to delivering “sound as it should sound” to 15 million customers around the world. As the inventor of multi-room wireless home audio, Sonos helps the world listen better by giving people access to the content they love and allowing them to control it however and wherever they choose. The Sonos app is key to enabling customers to use their Sonos systems to stream music, radio, and audiobooks from a variety of services, including Spotify, Amazon Music, and Apple Music.
To help it deliver unparalleled service to its customers, Sonos manages approximately 2,000 Jenkins jobs daily across a complex build, test, and deploy cycle for cross-platform native applications. This was a daunting task for engineers, as debugging failures and identifying issues across such a vast number of test runs was both time-consuming and complex. CI pipelines typically ran at night in the US, which was daytime for Sonos’s EU team. This often meant that the EU team would have to wait entire days to get feedback from CI. Replicating its CI infrastructure in the EU was an option, but that would have prevented the organization from attaining its goal of having one unified pipeline for the entire organization.
The company experimented with various tools and manual processes, including tracking CI executions and results in spreadsheets, but found it difficult to pinpoint the exact code changes or features that may have led to a failure. Sesha Gudimella, senior manager for software development at Sonos, wanted to empower the development and quality assurance teams to monitor and receive alerts tailored to their specific needs. “Previously, we relied on a monthly tracker to monitor job progress and analyze details such as infrastructure versus code failure rates,” says Gudimella. “However, this process became cumbersome due to the vast amount of data and limited time for retrospection.”
Gudimella also sought to optimize infrastructure costs and unblock teams. “Streamlining CI and testing was essential for enhancing productivity and ensuring timely response to any issues that may arise,” he says.
Real-time data enables team to pinpoint failures instantly
Sonos selected Datadog CI Pipeline Visibility because it delivers a unified platform for monitoring and visualization that could seamlessly integrate with existing systems and tools, providing a single interface for monitoring processes. With the implementation of CI Pipeline Visibility, Gudimella and his team are now gradually shifting away from traditional methods of documenting or tracking job progress. Today, Sonos engineers have real-time data that allows them to pinpoint failures instantly. This has enabled them to establish Service Level Objectives (SLOs) on a daily basis, a significant improvement from monthly assessments.
Datadog has also helped the Sonos engineering team improve visibility into the builds and jobs they perform, leading to a better ability to debug performance and reliability issues in CI. Previously, identifying the root cause of failures required investigating multiple sources. Now, it’s easier to identify and address failures promptly. “The CI Pipeline Visibility tool pulls data from all of our Jenkins jobs across build, test, and deploy, which makes it easy to have a deeper view from Datadog instead of looking at multiple sources or platforms,” says Gudimella. “We get better visibility on Jenkins runs and day-to-day operations along with success/failure values and reasons for improvement per pipeline.”
Reducing CI costs by 50 percent
Consolidating multiple tools into one monitoring tool across platforms has been a game changer. As a result, Sonos’s engineering costs related to CI have been reduced by approximately 50 percent. “Previously, we were spending hundreds of dollars per day on failure-related expenses,” says Gudimella. “Now, this expense is redirected to Datadog and maintenance costs have significantly decreased. This shift toward proactive monitoring is indicative of our move towards a ‘shift-left’ approach. We now have daily insights into execution time, queue time, and blocked time, which aids in better infrastructure management and tooling.”
Time to results has also improved. Rather than replicate its CI infrastructure in the EU, Sonos used Datadog to optimize its CI infrastructure so their pipeline runs are done fast. Developers in both the US and EU now get near real-time feedback from their code commits, builds, and tests instead of delaying important work due to slow CI. “We now see results immediately, so we can reduce the time required to fix the problems,” adds Gudimella.
Today, Sonos has built out an extensive infrastructure upgrade roadmap, including a modernization plan using CI Pipeline Visibility. The upgrade will enable Sonos to continue to deliver “sound as it should sound” to millions of audio enthusiasts around the globe. Building on this success, Sonos is now expanding the adoption of Datadog across more teams and integrating additional tools into the platform. “This expansion is proving beneficial for us,” says Gudimella. “It’s also a learning process as we navigate the increased volume of data compared to what we’ve dealt with in the past.”