Ensuring a Highly Available and Highly Scalable Platform | Datadog
CASE STUDY

Ensuring a Highly Available and Highly Scalable Platform

Learn how Datadog enabled Braze’s customer-facing teams to shorten time spent resolving incidents and accelerate customer communication

About Braze

Braze is an American cloud-based company that develops customer relationship management software to help businesses with multichannel marketing. Its software is used by 2.5 billion users and processes approximately 1.5 billion messages per month.


Key Results

90%

Reduction in processing time on Braze's email rendering service.

1 solution

All eight product teams can now use a single, unified monitoring platform with Datadog.


Challenge

The teams at Braze had multiple tools for evaluating application performance and scaling their systems. Many of these tools were not effective; for instance, tickets submitted to the technical customer support team were often escalated directly to the Product and DevOps engineering teams, which led to an increase in MTTR and impacted customer experience.


Why Datadog?

Datadog’s unified observability platform doesn't require extensive onboarding or knowledge of a specialized query language, which enabled Braze to standardize their process across teams of all technical levels. The technical support team was able to use Datadog's out-of-the-box integrations with Slack and PagerDuty to stay on top of tickets and rapidly resolve client-facing issues. Meanwhile, product teams were able to utilize dashboards to both forecast future needs and investigate performance-related issues.


Twin challenges: Scaling observability & empowering customer support

On a daily basis, Braze processes more than 8 billion API requests and sends more than 2.7 billion daily messages to a network of over 2.4 billion monthly active users. Jamie Doheny, Chief of Staff for Braze’s engineering organization, notes, “We needed a solution that helped us overcome the challenge of scaling our organization, including debugging incidents, and rapidly resolving customer support tickets.” To put this in perspective, Braze grew its employee headcount by 57% in 2019, ending the year with almost 500 employees. The company also has more than 900 customers and is one of the few companies that continues to grow its employee headcount and customer base despite the pandemic.

In order to maintain performance at this scale, the team needed to find a way to handle all this traffic and inbound customer requests.

  • Scaling observability: With a growing engineering organization, teams had different techniques and tools for evaluating the performance of their applications and scaling their systems. The organization needed a uniform way of determining whether provisioned infrastructure was appropriate for the traffic, forecasting future needs, and investigating performance-related issues.

  • Enabling customer support: Historically, many technical customer support tickets were escalated directly to the Product and DevOps engineering teams, effectively bypassing the Global Services and Support team when the questions were about performance, uptime, or throughput. This led to distractions for engineering and deprived Support and Success of the ability to quickly resolve customer tickets. Braze’s award-winning Global Services and Support team needed a solution to better visualize, and more rapidly resolve client-facing issues.

Solution: A unified observability platform

Braze was looking to find a unified observability platform to deploy throughout the organization. By adopting one observability platform, Braze could standardize their engineering team’s scaling process and enable the support team to solve most customer tickets. Braze chose to deploy Datadog and saw the platform as the path to achieving maximum observability, delivering a stellar user experience and establishing the proper processes to confidently scale. Datadog is accessible to users of all technical levels—without the need for extensive onboarding or specialized query languages.

Business outcomes

  • Establishing processes to address scaling challenges
    Cloud-native from the beginning, over the years Braze’s DevOps team has built an enterprise grade, multi-cloud infrastructure that is designed to scale. Datadog has empowered Braze’s engineers to observe the state of that system, and rapidly identify and respond to state changes, via dashboards, monitors, and integrations with tools like Slack and Pagerduty. As another top priority is ensuring that best practices are in place around incident management. On-call engineers can quickly detect and diagnose problems, helping the organization meet its service level objectives (SLO).

    Each of the eight product teams has the same, standardized dashboards and monitors, which is important as Braze’s engineering organization continues to grow rapidly. Datadog APM is essential for understanding each service called upon in a request. For example, the team found that in one customer’s case, 90% of the time spent to process a particular request was spent on email rendering. The team was then able to dig into the code to optimize this process. Further, the ability to dive into granular aspects of the code allows Braze to understand how their users are interacting with the product and ensure the company is, and will continue to, make all interactions with the product as seamless as possible.

  • Providing data and information to improve customer support
    The engineering team gave key customer-facing teams—Customer Success, and Global Services and Support—access to Datadog in order to quickly resolve customer problems. To do this, the team created dashboards that displayed metrics around incidents and performance. They were able to scale this process using template variables, which allowed the teams to filter the widgets at a glance on each board. Some variables, such as company ID, allow Customer Success Managers (CSMs) to quickly check these dashboards and relay technical information to their customers, all without having to go through additional channels. Then, when discussing the metrics with each customer, the CSMs can compare to other customers that have similar configurations in place and use that information to short circuit time spent in diagnostics. CSMs now have the tools they need to solve front-line problems to effectively deliver a best-in-class product to their customers.

Moving forward with confidence

Braze’s engineering department continually drives SRE and DevOps best practices, with an emphasis on breaking down the silos between and within departments—and promoting individual and team accountability. By enabling so many different teams to be directly involved in triaging, Datadog has empowered Braze to resolve incidents immediately and adhere to SLAs, thus ensuring high customer satisfaction.

“ Over the past few years, Braze has built a best-in-class customer engagement platform that is used by the world’s leading brands. As a company, we will continue scaling our platform and expanding our organization to support rapidly increasing market demand, which presents new and exciting challenges. Datadog has enabled us to rally around one platform that will help us scale over the near and long-term.”

Jamie Doheny
Chief Of Staff, Braze

Resources

/blog/programmatically-manage-your-datadog-integrations/api_managing_integrations_hero_180201_3

BLOG

New in Datadog: Managing integrations via API calls
/blog/end-to-end-application-monitoring/full_context_apm_201210_v5b

BLOG

End-to-end application monitoring with Datadog