Production Outage — Database deleted
One of dreadful incidents which happened with one of our clients was , a Jr developer accidentally deleted the production database of financial systems/transactions.
All data for that working day was lost. Historical could be recovered from backup, but all current data was lost.
Someone asked him to create space, so he took the db offline and deleted the mdf file. He thought he was doing it for UAT/Back up systems. He wasn’t aware of db being production instance.
There was nightly backup, so we had data till previous night. However — about 12–14 hrs of financial data was lost.
Actual Delete happened in evening when business was about to close. So there would be minimal impact for customers per say, But there were other challenges
- System had to be up and running before start of business next day which was 7:00 in morning
- Once recovered, Integrity of system/data had to be validated, which is very essence of any financial systems
How did we recover?
After following protocols all systems where shut to start the recovery process.
Initially for 4–5 hours we tried/purchased various tools available online/off line to help recover deleted data. But no success.
Most tools would give perception that data is being recovered , however after couple of hours would stop responding
This went on for 4–5 hours , till we realized we had to take a different path/strategy
This was banking system and our clients did have physical copies of receipts and payments. However going this route would mean business would have to be shut for day or two which was not viable
Fortunately just couple of weeks before incident we had released Reliability feature- where in all queries executing in production would be logged at application level. This was in essence transaction journal. You can read about it here
So we had all the queries, i log format, it was just matter of processing it and executing it
In addition there were channel based transactions( that happens through various file formats) which we could replay.
So followed below steps
- We restored previous nights back-up
- For non channel based transactions we executed the queries from transaction journal
- For channel based transactions we replayed the files
- More or less we were able to recover system to 98% of where it should have been
- All client employees were requested to do the tally of system and physical receipts next day and requested to flag any alerts
- Business was open as usual 7:00 AM next day, with no visible impact whatsoever
However actors who went through the night, knew what the had come out of.
As always after major incident, some changes were made to avoid incidents like these. Below was the essence
- Transaction Journal goes out of box for all installations of product ( all queries logged at application level)
- LOG Shipping enabled first day to minimize data loss
- Stricter access policies were formed ,restricting developer access to production database and clear separation between Production database and Test database