Don’t Take Cover, Reduce the “Blast Radius” Instead

Eitan Yanovsky

Eitan Yanovsky

December 15, 2021

At some point, things will go wrong in production. It’s inevitable, and you should plan for it to assure that, when it does happen, you can contain the impact as much as possible. AWS coined the term Blast Radius as a mental model that represents the maximal impact that might be sustained in the event of a system failure.

When building a SaaS platform, especially a multi-tenant one, you should always think about the blast radius when making architecture decisions. You shouldn’t focus solely on how you can make your architecture bullet-proof – that’s not practical. Underlying hardware will inevitably fail, bugs will be introduced, and you just can’t possibly think about everything. Clearly, you should put emphasis on making the system robust to failures, but you should still think about how you can make sure that when something bad happens, its collateral damage will be minimized.

You don’t want an architecture flaw – or a bug triggered by one or a few tenants – to choke the system for all the others. What happens if one tenant has triggered an unexpected issue, causing a high load on the system? You want to build your platform foundations in a way that makes sure that other tenants are not affected by this, or at least the impact is very minimal.

In my previous article, I described our process of building an MVP for a mission critical system, and how we de-risked the first release – which had a comprehensive required feature set. Here, I demonstrate (by showing a few examples) the types of dilemmas and considerations one might experience when faced with blast radius questions.

Let’s take one real-life example. In a B2B multi-tenant system it is common to have a single database node shared by multiple tenants to utilize resources efficiently. But what happens if a bug was introduced in production that only appeared for one tenant? And what if that bug causes a high load on the database – of many consecutive and poorly executed queries (for example, missing an index) and consumes most of its CPU?

This design has a large blast radius since the impact is not contained to that tenant and effectively cripples the system for all users. Should we then have a database per tenant?

There is no right or wrong answer, as going with this approach might be very cost-inefficient and harder to manage. But maybe one can find the right balance between using a single node and using multiple shards and spreading the tenants across them. Doing so will contain such scenarios and affect a limited number of tenants instead of all of them.

Another thing to consider – what happens if you need to do a full database restore for one tenant only because its data got corrupted? It will be much easier to achieve if each tenant has its own database instance, instead of developing a partial recovery from snapshots containing more than one tenant. One approach could be to separate each tenant with its own dedicated tables.

No alt text provided for this image

Another example would be when using Node.js running in a pod as your backend server, that has a code path that, when given specific input, takes a lot of CPU and holds the event loop for longer than what you planned. While it’s not a good practice to have this in your code, sometimes it accidentally happens because the first time encountering this type of input happens in production due to an unforeseen edge case. This Node server is now busy on one task and not serving others.

A possible solution would be to replace this service backend with a serverless function, such as AWS Lambda. This way, each invocation is running on its own container – and if it takes too long and more invocations are inbound, AWS will immediately spawn more Lambda instances and their cold start is sub-second with minimal effect on the latency.

While if you would rely on Kubernetes to scale, a cold start can be 20-30 seconds if a new node (server) is needed to host the pod. However, at some point Lambda execution becomes more expensive than dedicated pods when you have constant high load – which needs to be taken into account as well.

These examples are just a taste of challenges that you might face – there are endless amounts of examples. The important thing is to take this into consideration when planning or improving your architecture. You should think about where you should put your “Fire Doors” and at what cost. There is no simple solution for this and, as always, you need to weigh the pros and cons.

I consistently ask my team – what is the Blast Radius of this decision, and what can we do to minimize it?

 

Read more:

➤Building an MVP for a Nuclear Plant

To Open Source or Not To Open Source? That Is the Question

➤How to Make Smart Decisions When So Much Is Unknown

Topics: Engineering