Hi there,
First and foremost, we apologize for the message deliverability issues experienced by SMS customers on September 29th, 2022 beginning around 3:15 PM PST. We understand message deliverability is an important aspect of many of our customer’s applications, and that every lost message, or minute of unavailability, impacts your business. Below is our postmortem from the Event, and our Remediation Plan / Action Items that we plan to take in order to prevent this particular issue from happening again.
If you have any questions about the below postmortem, or would like more details, please email the team at: support@tsgglobal.com
From about 3:10 PM PST to 5:51 PM PST on Thursday, September 29th, 2022, there was an outage related to our SMS stack that impacted the transporting of both inbound and outbound SMS messages, as well as delivery receipts (DLRs). The root cause was due to the crashing (and unfortunate subsequent crash/reboot/crash cycles) of what we dub our “Nimbus” pipe that handles all SMS messages. MMS messages and voice services are handled via a separate pipe/system, and were unaffected.
We use a Kubernetes-based architecture, and usually a crashed pod/application is able to recover without any human intervention, and so we need to investigate why that was not the case in this instance. After several attempts to get the service to reboot manually, and attempting to disable several other applications that may have been a possible catalyst for the Event, we discovered that this particular instance had some legacy code relating to how this particular pod was mounted and configured. We implemented a quick fix by stripping these legacy boot requirements, deploying a new version, and the pod was able to boot normally and resume traffic.
At 3:10 PM PST several automated alarms (Prometheus/Grafana feeding into PagerDuty/Slack) were triggered notifying TSG Global staff that there was an issue with our SMS service (specifically, that a K8s pod was not healthy, followed by alarms that queues were getting larger and not emptying appropriately). TSG Global technical staff quickly began investigating the issue. We attempted to reboot the pod / make it healthy, but the pod continued to crash and not be available. We attempted to disable several other applications associated with the Nimbus pipe that we suspected may be playing a role in our errors, ruled out any external/vendor issues, attempted to roll-back to several older versions of the application, rebooted the queue - and we were still experiencing crashes. We then needed to dig into some legacy code associated with the application. Our “Nimbus” pipe is an application that is deployed as a singleton stateful set (STS) in Kubernetes (K8s). Again, the core issue is that it was crashing and failed to boot continuously. After additional review, we found that this particular STS was configured to mount the disk in 'ReadWriteOnce' mode, meaning only one K8s pod can access the disk at the same time. As a result, when the old pod (which crashed) did not release the disk correctly (which is still being investigated), when the new pod spawned, it was continuously unable to mount it and access the necessary data from the disk.
Since the data read from the disk is not critical for normal operation, there was a quick fix published/deployed that omitted reading data from the disk on boot, which enabled the application to boot successfully without accessing the disk. Once the “Nimbus” pipe was booted properly, SMS messaging and DLRs resumed flowing as normal (and some queued messages were delivered in a “burst” of traffic immediately upon revival of the service).
The “Nimbus” application was unable to read data from the mounted disk because disk was not properly released by the previous instance that crashed. Crashing is something that can always happen, but resources should be released and not left attached to a “zombie” instance that did not shutdown completely - thereby preventing a new healthy instance from spawning and booting appropriately.
Determining the root cause took much longer than we wanted it to.
Once the root cause was identified, there was a period of time misused by the team in believing the instance would eventually release the resource, and that it would be able to boot.
The final resolution of disabling the problematic part of the code could have been done much earlier, since that particular function of the boot sequence was not critical to the overall health of the pod/function of our services.
Review all existing settings and current architecture to prevent similar issues in the future.
Clearer logging and less noise.