Inbound SMS to webhook customers stopped
Incident Report for TSG Global
Inbound SMS traffic towards webhook customers partially stopped due to the database read replica lag.

What happened
Due to an increase in our database read replica lag, inbound traffic towards webhook customers stopped. Our SMS application was unable to fetch messages from the database since those messages were not yet available in the read replica due to the lag spike.

As soon as the issue was identified, the quickest resolution was to deploy a hotfix to reconfigure all applications to read from the writer replica as temporary solution. Later, a hotfix was implemented to read from writer replica as fallback, in case the record is not found in the reader instance, if the lag ever increases again.

Root Causes
The root cause was due to the increase in database read replica lag. Applications were processing messages faster than records were propagated to the read replica. Applications tried to fetch messages and since those were not available they went into the retry queue so were delivered with a long delay.

Some HTTP webhook inbound traffic was delayed in the evening/early AM hours PST between 6/15/23 and 6/16/23.

What did we learn?
Since the outage was only partial, our existing metrics/alarms did not catch the issue and escalate it appropriately. We have added additional metrics and new alarms to alert for this kind of issue to prevent it from occurring again. We will also be performing some database maintenance in the near future to address the root cause.
Posted Jun 15, 2023 - 23:00 EDT