Twin challenges: Scaling observability & empowering customer support
On a daily basis, Braze processes more than 8 billion API requests and sends more than 2.7 billion daily messages to a network of over 2.4 billion monthly active users. Jamie Doheny, Chief of Staff for Braze’s engineering organization, notes, “We needed a solution that helped us overcome the challenge of scaling our organization, including debugging incidents, and rapidly resolving customer support tickets.” To put this in perspective, Braze grew its employee headcount by 57% in 2019, ending the year with almost 500 employees. The company also has more than 900 customers and is one of the few companies that continues to grow its employee headcount and customer base despite the pandemic.
In order to maintain performance at this scale, the team needed to find a way to handle all this traffic and inbound customer requests.
Scaling observability:
With a growing engineering organization, teams had different techniques and tools for evaluating the performance of their applications and scaling their systems. The organization needed a uniform way of determining whether provisioned infrastructure was appropriate for the traffic, forecasting future needs, and investigating performance-related issues.
Enabling customer support:
Historically, many technical customer support tickets were escalated directly to the Product and DevOps engineering teams, effectively bypassing the Global Services and Support team when the questions were about performance, uptime, or throughput. This led to distractions for engineering and deprived Support and Success of the ability to quickly resolve customer tickets. Braze’s award-winning Global Services and Support team needed a solution to better visualize, and more rapidly resolve client-facing issues.
Braze was looking to find a unified observability platform to deploy throughout the organization. By adopting one observability platform, Braze could standardize their engineering team’s scaling process and enable the support team to solve most customer tickets. Braze chose to deploy Datadog and saw the platform as the path to achieving maximum observability, delivering a stellar user experience and establishing the proper processes to confidently scale. Datadog is accessible to users of all technical levels—without the need for extensive onboarding or specialized query languages.
Business outcomes
Establishing processes to address scaling challenges
Cloud-native from the beginning, over the years Braze’s DevOps team has built an enterprise grade, multi-cloud infrastructure that is designed to scale. Datadog has empowered Braze’s engineers to observe the state of that system, and rapidly identify and respond to state changes, via dashboards, monitors, and integrations with tools like Slack and Pagerduty. As another top priority is ensuring that best practices are in place around incident management. On-call engineers can quickly detect and diagnose problems, helping the organization meet its service level objectives (SLO).
Each of the eight product teams has the same, standardized dashboards and monitors, which is important as Braze’s engineering organization continues to grow rapidly. Datadog APM is essential for understanding each service called upon in a request. For example, the team found that in one customer’s case, 90% of the time spent to process a particular request was spent on email rendering. The team was then able to dig into the code to optimize this process. Further, the ability to dive into granular aspects of the code allows Braze to understand how their users are interacting with the product and ensure the company is, and will continue to, make all interactions with the product as seamless as possible.
Providing data and information to improve customer support
The engineering team gave key customer-facing teams—Customer Success, and Global Services and Support—access to Datadog in order to quickly resolve customer problems. To do this, the team created dashboards that displayed metrics around incidents and performance. They were able to scale this process using template variables, which allowed the teams to filter the widgets at a glance on each board. Some variables, such as company ID, allow Customer Success Managers (CSMs) to quickly check these dashboards and relay technical information to their customers, all without having to go through additional channels. Then, when discussing the metrics with each customer, the CSMs can compare to other customers that have similar configurations in place and use that information to short circuit time spent in diagnostics. CSMs now have the tools they need to solve front-line problems to effectively deliver a best-in-class product to their customers.
Moving forward with confidence
Braze’s engineering department continually drives SRE and DevOps best practices, with an emphasis
on breaking down the silos between and within departments—and promoting individual and
team accountability. By enabling so many different teams to be directly involved in triaging, Datadog
has empowered Braze to resolve incidents immediately and adhere to SLAs, thus ensuring high
customer satisfaction.
“ Over the past few years, Braze has built a best-in-class customer engagement platform that is used by the world’s leading brands. As a company, we will continue scaling our platform and expanding our organization to support rapidly increasing market demand, which presents new and exciting challenges. Datadog has enabled us to rally around one platform that will help us scale over the near and long-term.”
Jamie Doheny
Chief Of Staff, Braze