Santosh Rangarajan
3 min readOct 19, 2020

--

Systematically Troubleshooting Production Issue/Outage

“When you hear hoofbeats behind you, don’t expect to see a zebra”

If You Google the QUOTE , It takes you to wikipedia article , talking about QUOTE used in MEDICAL CIRCLES, telling novices to weigh symptoms against evidences before jumping to conclusion as to whats going on.

This quote applies well in Context of Trouble shooting in Production environment. I first heard about it in Google SRE Book, and its so befitting.

Trouble shooting is a skill which is acquired by experience. Of Course Architects/ Developers have - insider information having privilege/curse of building the system . They have insights which a maintainer may not be aware of.However these insights are only useful in early days. Over period of time , when system normalizes, everyone becomes equals - Architect/Developer/Maintainer

Below were the Usual suspects /Line of thinking we followed, In order of ELIMINATION, systematically moving across the Stack to trouble shoot the issue.

Identifying Problem ⇒ Infrastructure ⇒ Database ⇒ Application ⇒ UI

  1. Understanding the problem/ Asking right questions
  • What is the current behaviour is and what is the expected behaviour ?
  • Users of systems have tendency to evaluate and suggest causes of problems - which can be wrong. Be wary of not getting lost or misdirected with same.
  • When was first occurrence of problem reported ?
  • What is frequency of occurrence ?
  • Is it environment specific Issue - Is it possible to reproduce the same in QA
  • Could Load be a factor ?
  • How serious of problem is it ? Can resolution wait till close of business ? How much damage and risk will it be to continue running services in this state ?

2. Any Deployment or Recent Release or changes made to production

  • If there was deployment made, and system behaved fine before deployment, 90% chances issue/incident could be due to release/changes
  • Was deployment made correctly ? Any config misses ? Any lib missing ?
  • Can problem be reasoned about and fix easy to apply?
  • Can new functionality/service be turned off via Configuration/Feature flags ?
  • What is cost of RollBack ?

3. Infrastructure Issue

  • Disk space - Application Servers and Database Servers, Web Servers
  • Memory -Application Servers and Database Servers, Web Servers
  • Is Bandwidth cause of concern ?
  • Any firewall policy changes ?
  • Router or Switch configuration changes ?
  • Any upgrades/patches installed in system?

4. Database

  • Any new tables added or structure of existing tables changed ?
  • Any new stored procedures created ?
  • Any query getting stuck repeatedly ?
  • Multiple queries blocking each other ?
  • If New query is stuck - are appropriate indexes created ?
  • If Existing query is stuck - could it be due to fragmentation/ data size /missing index ?
  • Does query plan provide any insights ?
  • How fast are queries moving ?
  • Any anomaly by looking at size of Log files ?

4. Application - Services

  • Review Logs to see anything suspicious — TimeOuts, Null Pointers , Errors, Exceptions. Be cautious of false positives.
  • Can heap dump be generated ? will it provide anything useful?
  • Any specific event causing trigger — Clicking on particular link/page ?
  • Specific service or api causing problem ?
  • UI issue or Backend Issue ?
  • Is it due to dependency on Third party services ? Internal or External ?
  • Is it due to dependency of Third party libraries ? Internal or External ?
  • Can services be turned Off ? What is the Cost ?
  • Is it happening for all users or particular set user ?
  • Is it happening for all data-points or particular data-point ?
  • Is Log files grown out of proportion?
  • Any configuration changes ?
  • Could it be related to Caching ?

5. Application - UI

  • Any server side/ client side javascript issue ?
  • Any packages/modules causing problem ?
  • Log files grown out of proportion ?
  • Exceptions/Errors not handled properly ?

Again these are some pointers/guidelines. Over period of time, one gets to understand how system behaves and just by looking at symptoms - maintainers are quickly able to diagnose the issue - UNLESS OF COURSE IF IT WAS REALLY A ZEBRA.

References

  1. Google SRE Book - Chapter on Trouble shooting

--

--

Santosh Rangarajan

Software Engineer. Interests include — Distributed Systems, Data Storage , Programming languages