Achieving Observability Excellence: Key Central Responsibilities
11月 14, 2024
Over the course of nearly five years, my role as a Technical Account Manager at Datadog has immersed me in the realm of observability, allowing me to support, advise, and engage with various organizations on their observability journeys. Throughout this period, I’ve not only witnessed the evolution of observability practices but also identified key responsibilities that underpin successful observability initiatives. In this comprehensive exploration, I’ll delve into these responsibilities, ranging from foundational to advanced, shedding light on their significance and impact on organizational maturity.
Foundational Responsibilities
At the heart of any central observability team lies a set of core responsibilities that form the bedrock of effective observability practices. As organizations increasingly prioritize their core business functions, the adoption of observability solutions from Software as a Service (SaaS) providers has become commonplace. Consequently, one of the primary responsibilities of central teams is to define requirements, evaluate vendor offerings, and negotiate terms to ensure alignment with organizational needs and budget constraints.
This task entails a thorough examination of various factors, including the scalability, flexibility, and cost-effectiveness of each solution. For instance, consider a scenario where a company seeks an observability platform capable of handling the diverse needs of its microservices architecture as well as various technologies. In such cases, the central team must assess the scalability of each vendor’s offering to accommodate the dynamic nature of microservices environments.
Once a suitable observability solution is selected and deployed, another critical responsibility emerges: providing comprehensive training for organizational teams. From introductory sessions to advanced workshops, the central team must equip users with the knowledge and skills needed to leverage the observability platform effectively. This includes leveraging the vendor’s free resources as well as identifying services which helps increase knowledge and application of best practices. Moreover, considerations must be made for ongoing training and support to accommodate new hires and evolving platform features. In some cases, the access to the platform will be restricted to users who completed the training.
Elevating Maturity
As organizations progress along their observability journey, the focus shifts towards advancing maturity levels through additional responsibilities and initiatives. Among these is the imperative to uphold security compliance standards across the observability stack. With security becoming more and more important, central teams must rigorously test each version of the observability solution to identify vulnerabilities and ensure adherence to internal security policies. In those situations, the central team will often investigate new CVEs and enforce upgrades if necessary.
Furthermore, cost management raises as a crucial aspect of observability maturity. Beyond monitoring of expenses, central teams must implement mechanisms for accurately allocating costs to individual teams and cost centers. This granular approach not only promotes accountability but also facilitates optimization efforts by identifying areas of overspending or inefficiency. For the most advanced teams, the observability cost will be compared to the perceived value, but also to the cloud cost as a reference. At this stage, it is common to compare cloud cost vs observability cost. Even though such a metric is not absolute, it gives a good indication of the services with a lack of observability and the services which leverage more capabilities for more value.
In parallel, the establishment of Role-Based Access Control (RBAC) policies plays a pivotal role in enhancing platform security and governance. By defining roles and permissions based on user responsibilities and organizational hierarchies, central teams can mitigate the risk of unauthorized access, misuse of observability data, or simple mistakes. For instance, it is then common to review with Audit Trail the usage of API, advocate for Service Accounts or simply provide a Terraform repository which can create assets protected via RBAC.
Advance Initiatives
As organizations reach advanced stages of observability maturity, they embark on initiatives that go beyond basic operational requirements. One such initiative involves standardizing data collection methods and defining Service Level Indicators (SLIs) to ensure consistency and reliability across diverse service offerings. SLIs are critical in identifying indicators of good user (internal and external) experience. Those will eventually impact Service Level Objectives (SLOs) as well as error budgets.
Consider the example of a company operating a complex ecosystem of interconnected services. By implementing a standardized approach to data collection and SLI definition, teams can effectively monitor and measure performance metrics, enabling proactive issue resolution and continuous improvement. For instance, when the frontend team aims for a p90 latency below 300ms, dependent services will have to also measure their p90 to ensure compliance and effectiveness throughout the app.
Another advanced responsibility entails the enforcement of tagging standards and the monitoring of outliers within the observability platform. This ensures correlation is automated and the navigation through multiple data sources is seamless. By maintaining consistency in tagging practices and promptly identifying deviations from established norms, central teams can ensure data integrity and facilitate accurate analysis across the organization. This ultimately ensures that the platform delivers the highest value to all users.
Evaluation and Expansion
At this stage, observability is seen more as an investment than a cost, offering considerable benefits to the business by identifying expansion areas, inefficiencies and user analytics.
During this pursuit of observability maturity, central teams face the challenge of evaluating the readiness and effectiveness of individual teams. This task involves assessing not only the adoption of observability practices but also the proficiency and alignment of teams with organizational objectives.
Tools like Datadog offer valuable insights into team maturity levels through metrics such as product adoption rates, analytics, and custom rules that can be applied on various assets. With this data, central teams can tailor expansion efforts and training initiatives to address areas of low adoption or proficiency effectively.
Moreover, central teams can empower teams with lower maturity levels by providing templates and best practices for observability instrumentation. By offering ready-made solutions for common monitoring and alerting scenarios, teams can expedite the setup process and focus on deriving value from observability data.
Fostering Community
As organizations continue to invest in observability, the formation of internal communities emerges as a strong catalyst for innovation and knowledge sharing. These communities serve as incubators for ideas, enabling cross-sharing of insights and best practices across teams and departments.
Consider the example of a central observability team facilitating the exchange of success stories and use cases within the organization through community, standardization, and tool building. By showcasing the tangible benefits of observability initiatives, such as improved system reliability or enhanced customer experiences, central teams can inspire and motivate teams to embrace observability practices. A clear example can be seen at one of my customers where a tool built by a single team to optimize Kubernetes clusters with clear cost savings has been shared with the rest of the organization.
In conclusion, the journey towards observability maturity is characterized by a series of evolving responsibilities and initiatives undertaken by central teams. From foundational tasks such as vendor evaluation and training provision to advanced initiatives like SLI standardization and community building, each step contributes to the overall effectiveness and impact of observability practices within organizations. The TAM (Technical Account Manager) service offering in Datadog supports such evolution of maturity through continuous guidance, shared experience, and strategy building. If any of the above sounds interesting to you, we’re looking for talented candidates.