Inbound/Outbound SMS & DLR Deliverability Outage
Incident Report for TSG Global
Postmortem

Hi there,

First and foremost, we apologize for the message deliverability issues experienced by SMS customers on September 29th, 2022 beginning around 3:15 PM PST. We understand message deliverability is an important aspect of many of our customer’s applications, and that every lost message, or minute of unavailability, impacts your business. Below is our postmortem from the Event, and our Remediation Plan / Action Items that we plan to take in order to prevent this particular issue from happening again.

If you have any questions about the below postmortem, or would like more details, please email the team at: support@tsgglobal.com

Overview

From about 3:10 PM PST to 5:51 PM PST on Thursday, September 29th, 2022, there was an outage related to our SMS stack that impacted the transporting of both inbound and outbound SMS messages, as well as delivery receipts (DLRs). The root cause was due to the crashing (and unfortunate subsequent crash/reboot/crash cycles) of what we dub our “Nimbus” pipe that handles all SMS messages. MMS messages and voice services are handled via a separate pipe/system, and were unaffected.

We use a Kubernetes-based architecture, and usually a crashed pod/application is able to recover without any human intervention, and so we need to investigate why that was not the case in this instance. After several attempts to get the service to reboot manually, and attempting to disable several other applications that may have been a possible catalyst for the Event, we discovered that this particular instance had some legacy code relating to how this particular pod was mounted and configured. We implemented a quick fix by stripping these legacy boot requirements, deploying a new version, and the pod was able to boot normally and resume traffic.

What Exactly Happened

At 3:10 PM PST several automated alarms (Prometheus/Grafana feeding into PagerDuty/Slack) were triggered notifying TSG Global staff that there was an issue with our SMS service (specifically, that a K8s pod was not healthy, followed by alarms that queues were getting larger and not emptying appropriately). TSG Global technical staff quickly began investigating the issue. We attempted to reboot the pod / make it healthy, but the pod continued to crash and not be available. We attempted to disable several other applications associated with the Nimbus pipe that we suspected may be playing a role in our errors, ruled out any external/vendor issues, attempted to roll-back to several older versions of the application, rebooted the queue - and we were still experiencing crashes. We then needed to dig into some legacy code associated with the application. Our “Nimbus” pipe is an application that is deployed as a singleton stateful set (STS) in Kubernetes (K8s). Again, the core issue is that it was crashing and failed to boot continuously. After additional review, we found that this particular STS was configured to mount the disk in 'ReadWriteOnce' mode, meaning only one K8s pod can access the disk at the same time. As a result, when the old pod (which crashed) did not release the disk correctly (which is still being investigated), when the new pod spawned, it was continuously unable to mount it and access the necessary data from the disk.

Resolution

Since the data read from the disk is not critical for normal operation, there was a quick fix published/deployed that omitted reading data from the disk on boot, which enabled the application to boot successfully without accessing the disk. Once the “Nimbus” pipe was booted properly, SMS messaging and DLRs resumed flowing as normal (and some queued messages were delivered in a “burst” of traffic immediately upon revival of the service).

Root Cause

The “Nimbus” application was unable to read data from the mounted disk because disk was not properly released by the previous instance that crashed. Crashing is something that can always happen, but resources should be released and not left attached to a “zombie” instance that did not shutdown completely - thereby preventing a new healthy instance from spawning and booting appropriately.

Impact

  • Sending outbound SMS messages was delayed for all clients during the outage period
  • Receiving inbound SMS messages and DLRs was delayed for all clients during the outage period
  • While attempting to troubleshoot and resolve issues, one group of inbound (MO) messages was intentionally dropped from the queues and were never delivered. However, clients can access those messages in the TSG system via API since those were stored in our databases.

What Went Well

  • Pre-programmed alerts in Slack/via PagerDuty went off immediately as a result of the Event, and on-call staff quickly notified relevant parties immediately of the severity level so that we could work towards resolution.
  • Investigation and work towards resolving the issue started within minutes after the Event first being triggered.

What Did Not Go So well

  • Determining the root cause took much longer than we wanted it to.

    • There was some misleading information in some logs, and a human factor in misreading some of the logs. It was very, very early in the AM for individuals handling this particular issue.
  • Once the root cause was identified, there was a period of time misused by the team in believing the instance would eventually release the resource, and that it would be able to boot.

  • The final resolution of disabling the problematic part of the code could have been done much earlier, since that particular function of the boot sequence was not critical to the overall health of the pod/function of our services.

Our Remediation Plan / Action Items

  • Review all existing settings and current architecture to prevent similar issues in the future.

    • We commit to immediately reviewing any other applications deployed to K8s and notating any inconsistencies in how those applications boot/deploy. We also commit to making some long-term architectural changes that will first be rigorously tested on our staging environment, including possible fail-over applications to provide better redundancy.
    • We will also investigate/scope what is needed to replay messages back to clients in the Event a queue does need to be emptied for any reason, so that customer applications can continue to function as expected without manual intervention by customers.
  • Clearer logging and less noise.

    • We have employed AWS CloudWatch, Bugsnag, and Prometheus/Grafana in various states to alert our team about issues and allow us to investigate issues. We need to consolidate and build around one solution in both our staging and production environments to allow for better quality logging and simplified investigation processes.
    • Consolidate alerts to our developer-managed email inbox.
Posted Oct 03, 2022 - 18:15 EDT

Resolved
A fix has been implemented and messages should be flowing normally. We will continue monitoring for any additional issues.
Posted Sep 29, 2022 - 20:57 EDT
Identified
We have identified the core services affected and we are continuing to work on a fix. No ETA at this time. Stay tuned for updates.
Posted Sep 29, 2022 - 19:56 EDT
Investigating
We are currently investigating this issue.
Posted Sep 29, 2022 - 18:52 EDT
This incident affected: Messaging Services (SMPP, Local Inbound/Outbound SMS, Toll-Free Inbound/Outbound SMS, Local Inbound/Outbound MMS, Toll-Free Inbound/Outbound MMS).