First step in troubleshooting complex issues: Define and scope your issue properly

September 9, 2009 11 minute read

Is it a plane, is it a bird, is it a UFO?

Before you can delve into any kind of troubleshooting of an issue you need to thoroughly define it. If you don’t you’ll probably end up spending a lot of time randomly gathering and looking at data that is probably not even relevant to the issue at hand. More importantly, how do you even know that your problem is fixed if you don’t have a good definition of the problem?

This might sound like common sense, but having worked with troubleshooting the better part of the last 10 years I can tell you that it is extremely common that people start looking at data before really understanding the problem. More often than not this is because people are stressed and want to fix the problem fast, or maybe because the problem space is unknown to them so they have difficulties defining what information or details are important. Sometimes symptoms are even contradictory which makes it even more difficult to troubleshoot something, as you don’t know if you can trust the data.

I don’t have a silver bullet recipe for success but I wanted to share how I approach an issue (in a pretty generic way) and I welcome any and all input that you might have on the topic as good scoping/problem definitions is something that is often a topic of discussion at my workplace.

What I do know for a fact is that a lot of times, the issues where we don’t have a proper problem definition that is agreed between all parties, tend to take a lot longer to resolve than the ones they do. Similar to if you have a software project where you haven’t spent proper time on working with the customer on their requirements. And in the end, if you had a good agreement between all parties, you don’t end up resolving the wrong problem.

My 9 questions for a pretty thorough problem description

When I call up a customer to start working on an issue I am generally looking for them to answer 9 questions in their own words.

What is happening?
What did you expect to happen?
When is it happening?
When did it start happening?
How does this problem affect you?
What do you think the problem is? and what data are you basing this on?
What have you tried so far?
What is the expected resolution?
Is there anything that would prohibit certain troubleshooting steps or solutions?

All of these questions are pretty open ended and a lot of the questions can’t be answered on the first pass before gathering more data, but at least then we know what data we need to gather to continue.

Throughout the process of defining the issue, I usually also try to assess the reliability of the information I get so that I know what is a known fact and what is hearsay to avoid spending too much time on things that may not be relevant.

If you’re troubleshooting an issue by yourself, it might be worthwhile pairing up with someone to explain the problem since a lot of things tend to get clearer if you say them out loud and have to explain them.

What am I really looking for when asking these questions?

1. What is happening

Let’s say we have a problem where we are experiencing a “hang” in an ASP.NET application on a production server, this would be what I am looking for or not looking for in my definition.

Bad answer: Occasionally our IIS server just hangs.

This is a pretty common answer, but in reality it says nothing about what is going on. The words hang, crash or memory leak for example mean a lot of different things to different people.

A crash for example might mean anything from “we just get blank pages”, user gets a “service unavailable”, there is a “stopped unexpectedly” log in the event log, user sees a “NullReferenceException” page in the browser, or really anything where the application is not doing what it is supposed to.

A hang, could be anything from a “crash”, to a particular page responding in 5 seconds rather than 2 seconds, or something where the pages stop responding completely, leaving the user with a spinning cursor, or a timeout, or a complete system freeze.

Memory leaks are my favorite ones, people tend to use “memory leak” as the bucket for cases where “we don’t know what is going on so we must have a memory leak”. Even in the cases where there are actually memory problems, there is a big variation in the symptoms, like anything from “the process grows from 200MB private bytes to 1 GB in 20 minutes” to outofmemoryexceptions to “watching the process in task manager I see some memory growth”.

In either case, the more details you can get, the better off you are, and if it is not detailed enough you need to set up an action to get more data before starting to troubleshoot.

Better answer: Once or twice a day over the last two weeks we have gotten reports from customers that the login page is not responding. Eventually if they wait long enough the page will time out. We have confirmed this behavior by logging in at the time of the failure, but all other pages seem to be working well during the problem period. The problem persists until we restart IIS. We have not seen any events in the event log matching the time of the failure.

Note: There are a couple of things worth mentioning here. The timing of once or twice a day is important, as once we resolve the issue, we can be fairly sure it is resolved if it doesn’t happen for two days for example. The fact that it is verified by someone in-house makes the symptom pretty reliable. The fact that the problem persists until an IISReset is issued tells us a couple of things.

we have a window for getting memory dumps before recycling,
there is a good chance the contention exists in the app since it is not persistent after a recycle.

And finally, the fact that no other pages are affected give us a very defined area to look at.

Of course this is just a random issue, but you get the idea of how much better we are off with this answer than the first answer.

2. What did you expect to happen

This is a question that is often overlooked because the answer is often assumed, but the reality is not as simple as that.

If we take a memory issue for example, to understand what is considered bad memory patterns, we need to understand what the memory usage is in testing for x number of users, what the baseline memory usage is and how much memory we are expecting to store in session/cache etc. per user.

For a performance issue we need to have a baseline response time. Even if the problem we are trying to resolve is that pages are timing out, it is important to know if the expected response time for the page is 5 seconds or a few milliseconds, and also if the application has been stress tested and shown those results during testing.

In a crash scenario where you are looking at error events in the event log it is also nice to note down that you are expecting for the process to recycle every 24 hrs for example based on recycling reasons, so that you know that event logs relating to that can be discarded.

In the best of cases we would have a solid set of performance logs, event logs and iis logs from normal operation to compare to once we gather for the faulty state.

3. When is it happening?

A solid repro scenario is of course optimal, but with production type issues this is seldom the case.

Things that I am looking for here are.

How often the problem reproduces (eg. once or twice a day in the above example)
Do we know what actions people take, i.e. it happens when they try to log in?
Do we know if it is confined to one server, one application, one page, one action?
Do we know if it is reproducible in test?
Is it happening only for certain users?
Is it happening only under load?
Does it only happen when memory usage is high, or when CPU usage is over 80%?
Does it always happen at 8 am when the first people in the office start logging in?

etc. etc. any and all conditions relating to the issue are interesting.

4. When did it start happening?

If the application has been working well for a while and suddenly starts behaving weird, it stands to reason that perhaps something that happened around that time could have affected it.

Often though there is a difference between the time when it started happening and when it was first discovered, so make sure to verify the time that it started happening against any available data (event logs, iis logs, performance logs etc.) that you might have.

All in all, knowing when it started happening is often one of the most important clues to finding root cause.

5. How does this problem affect you?

When I work on an issue, I like to know what kind of impact the problem has on the customer and the users of the system.

The reason I ask about this is because if a problem causes users not to be able to log in for example, and this is a critical application for the business, obviously we should be starting by finding a temporary fix, rather than going through root cause analysis, or maybe starting with a temporary fix first (like recycling when memory reaches 600 MB if the problem is a memory leak, to avoid an unexpected crash), and then preparing for root cause analysis.

It also tells me a little bit about the priority that troubleshooting is going to get. If the issue is very severe, maybe it is ok to perform some troubleshooting actions that have a lot of impact on the system, if it is going to help us find the problem faster.

6. What do you think the problem is? and what data are you basing this on?

This is especially interesting if you come in as an external troubleshooter, as you get a lot of insight into known problem areas of the application by asking what they think the problem might be.

There is a temptation to follow down the path of what is already guessed because it often sounds very plausible, but when I look at a problem I try to keep a very open mind to avoid getting stuck in a tunnel.

If someone believes that the problem is “x” then oftentimes, because of how our minds work, we tend to look at the data from that direction, even if it doesn’t fit. Part #2 of this question is what data you are basing your theory on? I try to look at that data with fresh eyes, to see if it really collaborates the theory or if there are some holes in it. To be honest, if a person comes to me with an issue and already has a theory, the first thing i try to do is to disprove it.

7. What have you tried so far?

This question serves two purposes.

Finding out what has been done so we don’t need to re-invent the wheel.
Finding out the results of those actions as it gives us more data about the problem.

Keep in mind here as well that reliability is important. If something was presumably done but there is no documentation of it or the results, it is probably worthwhile re-doing it, depending on how complex the task was.

8. What is the expected resolution?

This might sound like a repeat of question 2 (What did you expect to happen), but it is very different.

An expected resolution might be to “avoid the crash” which you can do by preemptive recycling, separating the app into a different application pool, moving to 64-bit, reverting back to a previous build etc.

Another expected resolution might be to “find the root cause to avoid that it happens again”.

And a third may be, “if we get the pages to consistently respond in less than 5 seconds under load we are good”.

Defining the expected resolution is crucial as this is what you will measure and verify the solution against to determine when you are done troubleshooting. If you don’t have this, how will you know when you are done?

This expected resolution, just like requirements in a software project, should also be agreed upon so that all involved parties work against the same goal.

9. Is there anything that would prohibit certain troubleshooting steps or solutions?

Knowing the limitations of data gathering or certain solutions both determine what actions you can take and change the expectations of what can be achieved.

On this question I would expect answers like, we can not install any tools on the server without going through a process of 10 change requests, or we can reproduce this on a test machine so you can live debug if you need to.

Summarizing everything

Once I have all the facts from the questions above I usually sit down and summarize everything in a format that looks something like this:

Problem description:
===========================
…
Expected resolution:
===========================
…
Troubleshooting done:
============================
…
Next steps:
===========================
…
Timeline for next steps:
===========================
…

This is a format I use to document problems throughout the course of the troubleshooting, updating it as the troubleshooting progresses and this is a format that we use both in our support cases and our bug reports in different variations.

It has proven to be a pretty effective way to keep tabs on issues especially when we have to collaborate on a case or hand over an issue to someone else for some reason or another.

Happy troubleshooting, Tess

Twitter Facebook LinkedIn

Tess Ferrandez