Attaining rapid and tangible results
Within 75 days of the project start date, the ECCO Select team completed an automated deployment of the Datadog Agent to 4,800 hosts running in both on-prem data centers and cloud-hosted environments. This gave them full monitoring coverage of over 95 percent of their infrastructure, including around 4,000 containers as well as network and storage devices and databases. They were also able to seamlessly transition over 1,100 monitor templates, covering infrastructure services, logs, and synthetic tests, from the legacy monitoring system. This process was validated through a carefully phased approach to ensure accuracy and reliability.
“We now have a comprehensive solution that not only speeds up root cause analysis when there’s an issue, but continuously provides the visibility we need to keep our systems secure and resilient.”
ECCO Select’s infrastructure support team realized the value of Datadog almost immediately when it helped them detect and resolve a recurring memory issue that had been plaguing them for months. They had received more than 700 service desk tickets, each requiring manual investigation. The longer the problem persisted, the more time and resources were needlessly spent at all levels of the Enterprise Service Desk team.
Whereas the legacy monitoring tool’s limitations left them in the dark, only offering "high" or "low" thresholds for memory utilization, Datadog immediately revealed a problematic process caught in a loop due to a (at the time) seemingly harmless change in a disaster recovery location. The change disrupted replication efforts, causing a widespread memory issue affecting the system enterprise wide.
Armed with this information, the team quickly identified the problem and used insights from Datadog to correlate standard high/low memory alerts with process log information to implement a straightforward fix. The previously problematic process, undetectable with older monitoring tools, was successfully identified and permanently resolved.
“Reflecting on our journey, we started with limited visibility—unable to see beyond top-level alerts, facing spotty access, and lacking meaningful log data,” says Condon.