We apologize for the issues you may have encounter with inbound SMS delivery yesterday morning. Please find the RFO/postmortem below, and if you have any additional questions, please email us at: firstname.lastname@example.org
In the early AM, inbound SMS traffic to customers was reduced due to a message queuing system configuration error. Once the issue was identified around 9:45 AM PST, it was resolved by deleting and recreating the queues by 11:05 AM PST.
Due to a known bug in the previous version of our message queuing engine, we had to perform an emergency update early in the AM to avoid a potential full disk issue (which we were alerted to over the weekend). In short, some queue snapshot cleanups did not complete correctly, which was causing disk usage to rapidly increase. The version update was performed according to the official manual, and was supposed to cause ZERO impact or downtime due to built-in redundancy. After the update was performed, one “corrupt” queue had to be recreated to force the deletion of old snapshots and restore disk space. Once the queue was recreated, everything seamed normal and traffic was flowing, and no alarms were triggered.
Were notified later in the morning by some TSG Global clients that some (but not all) inbound messages are being delayed, not received at all, or were unaffected. The TSG Global Response Team hopped on a call within 5 minutes of the first report, and began to diagnose the partial outage issue.
The TSG Global team investigated both our aggregator partners systems (who also had an unplanned extended maintenance window) as well as our internal systems. After some diagnosis, we determined that it was our message queuing system that was dropping/misrouting messages. To fix the issue, we reset the entire configuration by deleting and recreating the routing rules for our queues, which restored expected functionality and normal traffic resumed as expected.
There are two possible root causes:
We are leaning towards the first root cause as the culprit due to the behaviors exhibited. There is no related documentation about this issue in the upgrade documentation, and it never happened when we performed upgrades in the past, nor did it affect other queues with similar or same configurations.
Some customers experienced delayed inbound deliveries or no inbound delivery at all for some part of their traffic.