The ops team in this tech startup monitors thousands of servers and the services running on them.
Each server and service was programmed to send multiple notifications every day with status updates, warnings and alerts.
The email- and sms-based “fire-hose” of ops notifications was frustrating the team. They were inundated and cluttered with a high number of alerts, making it hard to find the signal in the noise.
Users outside the operations team were interested in receiving high-priority alerts (e.g. website down), but received way too many irrelevant notifications.
Escalation processes were difficult to automate. There was no efficient way to update all stakeholders during fire-fighting. This was particularly challenging when users were not in front of the online dashboard.
The company needed a better way to manage ops notifications.
Teamchat was deployed by the company for the Ops Team. They integrated Teamchat with their Nagios monitoring server. The Nagios “bot” (now, a Teamchat user) would send alerts via Teamchat. All alerts were posted as “smart” messages with additional context such as host, service and severity among others.
Messages related to the same host or service were threaded and aggregated, dramatically reducing clutter. Messages were color-coded based on severity making them easier to track.
Messages with specific priorities were sent to relevant people; a failure to respond triggered the escalation process automatically.
The ops team setup workflows for handling escalation paths, status updates and more.
In fire-fighting mode, the engineers had to just update in one place; all stakeholders were auto-updated via Teamchat.
The ops team became much more productive as the ops notification became more manageable.
High severity alerts were not missed. Low severity alerts were responded to immediately as they were routed to the right person immediately.
Even when the right systems administrator was occasionally unavailable, the escalation process kicked in automatically.
Notifications also came with corrective action offering a set of pre-defined actions for initiation by the OPS manager leading to faster resolution of issues.
The OPS team did not have to constantly be at the dashboard because they could get alerts on the phone.
Other stakeholders outside the ops team got full visibility too: they received the relevant high-severity alerts, as well us real-time updates on corrective measures being taken. This eliminated the communication overhead from concerned stakeholders.
Connect with Teamchat to know how it can benefit your business