At work, I work on a platform where users create and run tasks that do some arbitrary computation, like a ingesting from a database or a REST API call. Users can subscribe to the state of these tasks, such that if a task fails (or succeeds) they will get notified by email. Our system at the time sent an email as soon as an event was recieved in the backend, which worked fine, but as our platform scaled up, we had an incident where hundreds of tasks failed at the same time due to a transitive error, causing every subscribed user to recieve a flood of emails in their inbox. At this point, a batching solution became an immediate priority to avoid this kind of thing in the future.
Design Overview
So I started tinkering. Our events were coming in from an Azure Service Bus queue, and when a message was recieved we would process it immediatley. So the first thing I did was create a couple of new tables in our database.
Notification Table
notification_id | event_type | timestamp |
---|---|---|
uniqueidentifier | varchar(30) | datetimeoffset |
Notification Batch Table
notification_batch_id | batch_key | closes_at | processed_at |
---|---|---|---|
uniqueidentifier | varchar(30) | datetimeoffset | datetimeoffset |
Notification Batch Notification Table (Linking Table)
notification_id | notification_batch_id | uniqueidentifier | uniqueidentifier |
---|
Now, instead of processing the message from the queue straight away, we pop it and create a notification record in the Notification table. Then, we try to add the notification to a currently open batch, if no such batch exists, we create a new
empty batch with a default expiry of 5 minutes. The batch_key
is important, because we need to somehow classify a batch belonging to an org and an environment. Like notifications for the developement environment of one org should not have
notifications from the production environment of another org. There are then two cron jobs that run. The first one for processing every open batch, which runs every 5 minutes, and grabs each open batch, processes them and marks them as closed (by updating a proceesed_at
column in the DB to the current timestamp). The second job is for cleaning the database records that have been processed, this includes the notification records and notification batches. Since this is a many-to-many relationship between notification to notification batch,
we have a linking table in here that has the composite key of notification_Id
and notification_batch_Id
.
Sending emails
Let's talk a little more about the first background job, when it comes to actually sending emails, there is a little bit of computation we have to do. Let's say the batch we're about to process has the following notifications
[
{ event_type = "Task Failed", recipients = ["bob@acme.com", "sarah@acme.com", "john@acme.com"], timestamp = 11231231 },
{ event_type = "Task Succeeded", recipients = ["bob@acme.com"], timestamp = 1234211222 },
{ event_type = "Task Failed", recipients = ["sarah@acme.com", "john@acme.com"], timestamp = 112313125 },
]
Now how are we going to send the emails? Sending one email for each notification kind of defeats the purpose of this whole thing, so that's a no go. We could group the notification by recipient, so we send three emails, to bob, sarah and john individually, and include the notifications they've subscribed to. So like this:
emails_to_send = [
{ [{ event_type = "Task Failed", timestamp = 11231231 }, { event_type = "Task Succeeded", timestamp = 1234211222 }], recipients = ["bob@acme.com"] },
{ [{ event_type = "Task Failed", timestamp = 11231231 }, { event_type = "Task Failed", timestamp = 112313125 }], recipients = ["sarah@acme.com"] },
{ [{ event_type = "Task Failed", timestamp = 11231231 }, { event_type = "Task Failed", timestamp = 112313125 }], recipients = ["john@acme.com"] },
]
This is a bit better, because now each recipient is getting multiple events in their email, so there's batching, but we're still sending three emails. Ideally, we want to minimise the number of emails we're sending to reduce the load on our own servers too.
The solution I went with is to group the previous payload again, but this time on the notification event itself, so we send emails to recipients that share the same event subscriptions. This ends up looking like this:
emails_to_send = [
{ [{ event_type = "Task Failed", timestamp = 11231231 }, { event_type = "Task Succeeded", timestamp = 1234211222 }], recipients = ["bob@acme.com"] },
{ [{ event_type = "Task Failed", timestamp = 11231231 }, { event_type = "Task Failed", timestamp = 112313125 }], recipients = ["sarah@acme.com", "john@acme.com] },
]
Optimisations
At this stage, the system works fine, but we can make some small tweaks to make things better.
The first one is the oldest trick in the book: caching. We cache the ID and key of the batches that we open, and set it to be evicted when the batch window closes. This way we hit the database less as we check for open batches. If a batch exists in the cache, we just use it.
The second one is more interesting. Let's say a batch window has just closed, but we get five more notifications coming through with the same batch key, so now we have to open a new batch, close it in five minutes, and send emails again. This is ok, but in such situtations we can infer that clearly there are a lot of notifications coming in for whatever reason. Thus, we can extend the batch window by a fixed amount of time if we get a notification event around the time when a batch is about to close. Need to be careful here though, there is probably a storage limit (and also a UX factor) on how many things we can actually sent through in one email, so we should put a cap on the number of notifications that can realistically exist in a single email.