Inbound SMS not delivered via SMPP
Incident Report for TSG Global
Postmortem

Again, we sincerely apologize for the recent outage you may have experienced with inbound SMS traffic towards SMPP customers having been stopped due to the faulty application version deploy. Below is the post-mortem:

Overview
Inbound SMS traffic towards SMPP customers stopped due to the faulty application version deploy.

What Happened?
Due to the faulty application version deploy, all inbound SMS traffic to SMPP customers stopped between ~5 AM EST on 1/30/23 through ~2 AM EST on 1/31/23. This newest release was tested on staging with new unit testing/data, and due to errors with these new tests, the issue was not caught.

Resolution
As soon as the issue was identified, the previous version was restored and the traffic resumed.

Root Causes
This issue was caused by a faulty application release deploy whose issues were not caught on staging. There were multiple causes why this issue was not noticed earlier:

  1. an issue with one large client having SMPP connection issues over the weekend prior to the faulty release being deployed to staging
  2. an issue with one of our upstream vendor binds at the same time
  3. joint metrics monitoring both HTTP API and SMPP clients that did not show a traffic drop to 0, since HTTP API traffic resumed/was unaffected, and for that reason alarms were not being triggered

Impact
All SMPP customers inbound SMS traffic was delayed few hours, queued, and delivered in a large batch once connections resumed.

What went well?

  • As soon as the issue was correctly detected, reverting applications to the previous version quickly resolved the issue.

What didn't go so well?

  • Alarms did not go off and the issue was not immediately noticed by our team via Slack or PagerDuty
  • Staging testing did not catch the issue (again, this was newer test data used for traffic mocking on staging)

Action items

  • Additional metrics will be added to SMSC and API endpoints to monitor those separately and alarms should be added accordingly
  • Additional metrics will report client binds being restarted and the alarm should follow if that rate exceeds some reasonable threshold
  • Staging mocking apps should be improved (already work in progress, partially already done) to catch errors like these
Posted Jan 31, 2023 - 14:01 EST

Resolved
We sincerely apologize for the recent outage you may have experienced with inbound SMS traffic towards SMPP customers having been stopped due to the faulty application version deploy. Please review the post-mortem of what happened, and how we learned from the situation.
Posted Jan 30, 2023 - 05:00 EST