Pelle, Infrastructure Specialist at Sinch writes about how we monitor the Sinch backend!
Here at Sinch, we naturally take infrastructure monitoring seriously. If our system comes to a halt, we lose customers, partners, and reputation. To keep our uptime, we monitor a lot of things, and some examples are:
To monitor and graph everything is very good, but it is of no use unless we get alerted when something goes wrong, and we all know that in this business, things do go wrong. There are bugs, errors, hardware failures, network outages and so on. When something happens, we must be notified as soon as possible so we can take appropriate actions to keep our systems going, whether it is to re-route traffic, fix a broken server, notify our providers about upstream/downstream problems among other things.
To achieve this, we use an arsenal of infrastructure monitoring tools; home-brewed, open sourced and commercial. All of these are tied together in our main monitoring and alerting tool – op5 Monitor.
In op5 Monitor, we monitor our servers and operating systems, that is what it is built for, but we also have our other systems to report into op5 Monitor through various channels. We utilize database queries in custom shell scripts to get a “live view” on what’s going on in different parts of the telephony system (PSTN, VoIP, Payments, fraud). We also utilize another tool from op5 – op5 Trapper. Op5 Trapper is a SNMP trap receiver in which you can create custom handlers (in LUA) to execute actions in op5 Monitor based on the OID of the trap that arrives.
We have developed our own monitoring system that keeps track of almost everything that goes on in all parts of the system. It has counters and alerts which we use in a variety of ways.We use it to graph everything from number of active calls, how many SIP responses of type YYY a certain provider and/or country have right now to network latency in our signaling system and RTP streams on our RTP servers.
This system also sends SNMP traps to Op5 trapper constantly. If a monitored service is OK, it sends OK traps, and if things go fubar, it sends critical traps for the service and Op5 Monitor will handle the notification part.
When something does go wrong, our 24×7 NOC personnel is notified through SMS, mail, and Pebble smart watch and they respond to the alert within the different timeframes that are set up, depending on affected customer SLAs, type of problem and the severity of the problem.
These alerts and responses are reported as incidents which we are following up upon the next business day to make sure that a certain problem won’t occur again if it is within our power to do so.
We are constantly improving and developing our infrastructure monitoring environment with new tools, scripts, and solutions to give our customers and partners even better service uptime.
In the next part, I will write a bit about how we use SMS and Pebble in our monitoring setup.
If you have any questions or want more information about infrastructure monitoring, please contact email@example.com
03 July, 2018 – San Francisco and Stockholm: Brands and businesses are increasingly integrating Video Calling functionality into their applications to help improve customer service, reduce costs, and acquire new customers – finds new research released today by Sinch, part… read more
Check out what’s been happening here at Sinch and industry-wide recently… CallKit Support Now Available Great news! In our latest SDK release, CallKit support for iOS is now (finally!) available from Sinch. With an improved answer rate and overall calling… read more
Today, Sinch is launching our first ever online hackathon on ChallengePost! We challenge you to leverage the Sinch SDK in the most innovative way, and help revolutionize app communications. Why, you may ask, should you participate in this epic hackathon?… read more