Designing Building systems- Part 2

Santosh Rangarajan
2 min readNov 7, 2020

This is second article in series on designing building systems. You can look at first part here

In this article i would like to discuss about Resilience

Source and inspiration for this article is from Book - Building Secure & Reliable Systems

Resilience

When designing systems , below points should be given consideration for resilience

  • Layered architecture
  • Each Layer should be independently resilient
  • Prioritise features/cost to understand, in case of load you can know which ones to keep on and which ones to keep off, when running in degraded mode. Further this should be automated without human intervention
  • Compartmentalise systems with clear boundaries
  • Automate resilience measures
  • Maintain effectiveness with validations. If there are multiple instances of application/service, Primary and HA, make sure HA is in sync , validated, tested and ready to perform if need arises.

Defence in Depth

Each layer in system should provide defence . Classic example of same was Trojan Horse. Entire attack could have been subverted had adequate protection been taken at various levels , example

  • Thorough monitoring /inspection of horse , which was placed by an adversary, before brining it inside the city premises.
  • Once inside - it could have been isolated, kept separately , to minimise damage if any.
  • Adversary was able to open gates easily once inside, for other army men to penetrate the system. There could have been more robust mechanism to handle this

When translated in systems parlance, this can be interpreted as below

  • Threat modelling - monitor systems for port scan, application scan, dns registrations similar to yours, monitor infra
  • Deployment of the attack- Monitor network traffic, anti viruses, sandboxes etc
  • Execution of attack - Limit blast radius

Controlling degradation

You should be able to select which properties to keep on/off when running in degraded mode

  • 2 FA for banking system might be riskier to turn off
  • Some backend jobs could be turned off and can be run later when systems are up and running
  • Using different crypto algorithm which can free up some resources
  • monitoring and telemetry , maybe they could be turned off ?
  • Reports etc can be turned off

Cost of Failures should be evaluated

  • User experience
  • Computing resources

Some Techniques to controlled degradation

  • Load Shedding
  • Throttling

Above measures should be automated and regulated

Controlling Blast Radius

Below are some techniques using which damage can be minimised

  • Network Segmentation
  • Compartmentalising the event

Further compartmentalising can be based on below categories

  • By Location- isolating impact to 1 Container, 1 Rack, 1 DC etc
  • By Role- isolating impact based on roles
  • By Time- changing keys after fixed duration of time, so that if they are compromised it will only be for fixed duration

Controlling Redundancies

Automated failover should be implemented , keeping in mind that HA is in sync with Primary.

Continuous Validations

Validations are continuous process and should be automated. Below are key points in this regard

  • New modes of failures should be monitored/identified
  • Validations should be present for each mode of failures
  • Validators should be executed repeatedly
  • Validators should be phased out if features don’t exist any more

--

--

Santosh Rangarajan

Software Engineer. Interests include — Distributed Systems, Data Storage , Programming languages