Inbound SMS Messaging Not Being Delivered
Incident Report for TSG Global
Postmortem

We apologize for the issues you may have encounter with inbound SMS delivery yesterday morning. Please find the RFO/postmortem below, and if you have any additional questions, please email us at: support@tsgglobal.com

SMS Inbound Partial Outage

Overview

In the early AM, inbound SMS traffic to customers was reduced due to a message queuing system configuration error. Once the issue was identified around 9:45 AM PST, it was resolved by deleting and recreating the queues by 11:05 AM PST.

What Happened

Due to a known bug in the previous version of our message queuing engine, we had to perform an emergency update early in the AM to avoid a potential full disk issue (which we were alerted to over the weekend). In short, some queue snapshot cleanups did not complete correctly, which was causing disk usage to rapidly increase. The version update was performed according to the official manual, and was supposed to cause ZERO impact or downtime due to built-in redundancy. After the update was performed, one “corrupt” queue had to be recreated to force the deletion of old snapshots and restore disk space. Once the queue was recreated, everything seamed normal and traffic was flowing, and no alarms were triggered.

Were notified later in the morning by some TSG Global clients that some (but not all) inbound messages are being delayed, not received at all, or were unaffected. The TSG Global Response Team hopped on a call within 5 minutes of the first report, and began to diagnose the partial outage issue.

Resolution

The TSG Global team investigated both our aggregator partners systems (who also had an unplanned extended maintenance window) as well as our internal systems. After some diagnosis, we determined that it was our message queuing system that was dropping/misrouting messages. To fix the issue, we reset the entire configuration by deleting and recreating the routing rules for our queues, which restored expected functionality and normal traffic resumed as expected.

Root Causes

There are two possible root causes:

  1. deleting the “corrupt” queue which is part of a "queues cluster" caused routing issues
  2. the queuing engine version upgrade itself had undocumented issues

We are leaning towards the first root cause as the culprit due to the behaviors exhibited. There is no related documentation about this issue in the upgrade documentation, and it never happened when we performed upgrades in the past, nor did it affect other queues with similar or same configurations.

Impact

Some customers experienced delayed inbound deliveries or no inbound delivery at all for some part of their traffic.

What Went Well

  • Quick Response: the Team responded promptly upon reporting from customers, confirmed that more DLRs being routed to customers than SMS messages, and we identified the queue causing the issue and recreated it to restore normal operation.

What Didn’t Go So Well

  • We had no specific alerts in place to notice this partial outage (e.g. that only some vs. all queues were not receiving traffic, or that some unroutable errors were being received, or that our overall outbound traffic processing had deviated by a large percentage vs. a typical weekday morning).

Action Items For Our Team

  • PagerDuty alerts should be raised when a significant amount of messages are not being routed properly, or some thresholds of normal business operations are not being met unexpectedly.
  • When using consistent hash exchange routing, we should investigate/build a “dead letter queue” to route all messages not able to route to our round-robin queues (it can be any of the existing queues - probably the one with lowest index). Will add to our roadmap.
Posted Aug 22, 2023 - 13:37 EDT

Resolved
This incident has been resolved.
Posted Aug 21, 2023 - 16:34 EDT
Monitoring
The root cause has been identified, and working with our peering partner, a fix has been implemented. Some messages may not have been corrupted and not delivered to your endpoint. An official post-mortem will be available by COB Tuesday, 08/22/2023. We will continue to monitor to ensure message delivery continues as expected.
Posted Aug 21, 2023 - 14:22 EDT
Update
We are continuing to investigate this issue.
Posted Aug 21, 2023 - 13:30 EDT
Update
We are continuing to investigate this issue.
Posted Aug 21, 2023 - 13:29 EDT
Investigating
We are currently investigating an issue with inbound messages not being delivered. We will update once we have more information.
Posted Aug 21, 2023 - 13:15 EDT
This incident affected: Messaging Services (SMPP, Local Inbound/Outbound SMS, Toll-Free Inbound/Outbound SMS).