Overview Inbound SMS traffic towards webhook customers partially stopped due to the database read replica lag.
What happened Due to an increase in our database read replica lag, inbound traffic towards webhook customers stopped. Our SMS application was unable to fetch messages from the database since those messages were not yet available in the read replica due to the lag spike.
Resolution As soon as the issue was identified, the quickest resolution was to deploy a hotfix to reconfigure all applications to read from the writer replica as temporary solution. Later, a hotfix was implemented to read from writer replica as fallback, in case the record is not found in the reader instance, if the lag ever increases again.
Root Causes The root cause was due to the increase in database read replica lag. Applications were processing messages faster than records were propagated to the read replica. Applications tried to fetch messages and since those were not available they went into the retry queue so were delivered with a long delay.
Impact Some HTTP webhook inbound traffic was delayed in the evening/early AM hours PST between 6/15/23 and 6/16/23.
What did we learn? Since the outage was only partial, our existing metrics/alarms did not catch the issue and escalate it appropriately. We have added additional metrics and new alarms to alert for this kind of issue to prevent it from occurring again. We will also be performing some database maintenance in the near future to address the root cause.