Designing Building systems- Part 2

Santosh Rangarajan

2 min readNov 7, 2020

This is second article in series on designing building systems. You can look at first part here

In this article i would like to discuss about Resilience

Source and inspiration for this article is from Book - Building Secure & Reliable Systems

Resilience

When designing systems , below points should be given consideration for resilience

Layered architecture
Each Layer should be independently resilient
Prioritise features/cost to understand, in case of load you can know which ones to keep on and which ones to keep off, when running in degraded mode. Further this should be automated without human intervention
Compartmentalise systems with clear boundaries
Automate resilience measures
Maintain effectiveness with validations. If there are multiple instances of application/service, Primary and HA, make sure HA is in sync , validated, tested and ready to perform if need arises.

Defence in Depth

Each layer in system should provide defence . Classic example of same was Trojan Horse. Entire attack could have been subverted had adequate protection been taken at various levels , example

Thorough monitoring /inspection of horse , which was placed by an adversary, before brining it inside the city premises.
Once inside - it could have been isolated, kept separately , to minimise damage if any.
Adversary was able to open gates easily once inside, for other army men to penetrate the system. There could have been more robust mechanism to handle this

When translated in systems parlance, this can be interpreted as below

Threat modelling - monitor systems for port scan, application scan, dns registrations similar to yours, monitor infra
Deployment of the attack- Monitor network traffic, anti viruses, sandboxes etc
Execution of attack - Limit blast radius

Controlling degradation

You should be able to select which properties to keep on/off when running in degraded mode

2 FA for banking system might be riskier to turn off
Some backend jobs could be turned off and can be run later when systems are up and running
Using different crypto algorithm which can free up some resources
monitoring and telemetry , maybe they could be turned off ?
Reports etc can be turned off

Cost of Failures should be evaluated

User experience
Computing resources

Some Techniques to controlled degradation

Load Shedding
Throttling

Above measures should be automated and regulated

Controlling Blast Radius

Below are some techniques using which damage can be minimised

Network Segmentation
Compartmentalising the event

Further compartmentalising can be based on below categories

By Location- isolating impact to 1 Container, 1 Rack, 1 DC etc
By Role- isolating impact based on roles
By Time- changing keys after fixed duration of time, so that if they are compromised it will only be for fixed duration

Controlling Redundancies

Automated failover should be implemented , keeping in mind that HA is in sync with Primary.

Continuous Validations

Validations are continuous process and should be automated. Below are key points in this regard

New modes of failures should be monitored/identified
Validations should be present for each mode of failures
Validators should be executed repeatedly
Validators should be phased out if features don’t exist any more

Designing Building systems- Part 2

Resilience

Defence in Depth

Written by Santosh Rangarajan