Debugging unfamiliar systems

Guess what?

Your ‘in the know’ team members aren’t available right now, so it’s down to you to find the problem. And of course, you’ve been drafted in because the fix is top-priority and needs to be looked at right away. It’s time for your hero moment!

But…You feel like you don’t know anything. All you’ve got is a screenshot from someone showing a web page you’ve never seen with some error text on it. Where do you start?

Don’t panic! Here are my top five tips to help you find the problem (and be the hero of the day):

1. Ask the question – when did it last work?

This is really important and could stop you falling down a rabbit hole. Why has it broken specifically now? If the reporter of the problem knows exactly when it started going wrong, this might let you pinpoint errors in the logs much more easily. Or, perhaps there’s something significant about the time it went wrong: was it on the stroke of midnight? Have the clocks just changed? Is it the first day of the month?

2. Find the logs

Application logs can be an absolute gold mine and may tell you everything you need to know to get to the bottom of the situation. They can also be full of noise and distractions, so just be prepared to filter out the stuff you don’t need.

If you can’t find them at all then it’s pretty hard to confirm any theories! If you don’t know where the logs are, think creatively. If the application logs anything at all, then it must be configured to do so – somewhere. Try and find any clues – any connection strings or URLs or something in the source code that might relate to logging. Perhaps there’s a logging feature available on the platform hosting the app that’s been enabled.

Follow the path and see where it leads…

3. Have there been any recent deployments?

This should be an easy one to verify. If your app uses some kind of deployment pipeline, when did it last run? Does it coincide with the breakage? If so, consider whether you can roll back to the last deployment. Of course, it isn’t always that straightforward, so you do need to consider if it’s a suitable approach. Now is not the time to make things worse!

4. Reproduce the problem yourself

Get your hands on it. Reach out to the reporter of the problem if necessary and ask for more details about how the problem is reproduced. If the user interface of the app is not accessible to you, then arrange for access to the system so that you can log in and reproduce the error for yourself. Just running through the steps on your own can be a great way to think more clearly about what’s going on.

Being able to reproduce the problem yourself also might help you find things in the logs more easily – take a note of the exact time you reproduce the problem and then you can pinpoint the error in the logs.

5. Run it locally

Time to take matters into your own hands – if the logs don’t give you a good idea of what’s going on, then run the code on your local machine to dig further into the problem.

Pull the code from source control and get it running locally. Attach a debugger at an appropriate point and follow the flow of execution. Follow the exact same steps to reproduce as you tried on the failing system. Does it work locally? What does that tell you?

Sometimes your ideas just run dry and you feel like you’re out of luck. Simplify the problem if you can – what’s the shortest path to the error? Can you strip out any reproduction steps? Comment out parts of the code until it starts working. Reintroduce code in small portions until you find the culprit.

Break it down and get hunting!