Basic Troubleshooting

I run into this all the time. You’ve got a complex application that runs across a multitude of servers and users are randomly reporting errors. How do you figure out what’s producing the error?

Let’s run through a scenario, then I’ll give you my process on it:

Users launch the application through Citrix. The app launches an IE session which runs on 5 different Citrix servers. All Citrix servers are (supposedly) built the same. Even though the application is purely web-based there are Java controls and vendor-proprietary controls that need to run on the machine running IE. Rather than have that all installed on each machine you put it on Citrix.

The IE launches using a Virtual IP address that points to a Big IP (or any kind of load balancer). Behind the load balancer are 10 web servers (again, supposedly built the same). Then behind the web servers are application servers, SQL servers, file servers, etc.

So when a random user calls into the Helpdesk and reports that when they launch the application and log in, they get an error. They attach a screenshot to the ticket and it’s definitely being produced by the web site. In this case it was an assembly error.

So what do you do?

1)    Determine if this error is happening to multiple users. If it’s just 1 user it’s probably a profile issue or a local caching issue. In that case just delete their roaming profile and their local profile on their machine to see if that fixes it.
2)    If multiple users are having the problem, as in this case, time to go on to the next step.
3)    Is it happening to all users all the time? In this case, it isn’t. It’s “random”.
4)    This, unfortunately, is where the tediousness comes in. Your error is being produced in IE, but is it a web site issue or a Citrix issue? Or a backend issue?
5)    You can quickly eliminate it being a backend issue, as it works for some people. If it wasn’t working for everybody that would be a pretty good indicator you got major problems going on.
6)    So the first thing to try is to see if you can log into all 5 of your Citrix servers. Usually I have a test app (Notepad.exe) published against each server just for this purpose. You can also try RDP into each machine, but that doesn’t tell you whether Citrix is working or not.
7)    Assuming your test app is working all right, go RDP into each Citrix server. You know Citrix itself is working all right.
8)    Open iE on your Citrix server. Browse to the login page of your first web server (not the load balanced address) and test login. If it works go on to the next one. Do this against all 10 web servers. If they all work, then you’ve got a problem with your load balancer. If none of them or 1 or 2 produce an error, you now know that it’s a problem with a web server. If the problem only happens on 1 Citrix server against all the web servers, it’s probably a problem with your web controls on that box.
9)    Confused yet? You know where you’re going with this. Test all 10 web servers on all 5 Citrix servers. See if your problem is consistent or if it travels at all.
10)    In this case all Citrix servers had a problem against web06. So we knew we had a problem with web06. Quick fix is to take it out of the load balancer and bypass the problem for the users.

The ultimate fix is to figure out what the heck is wrong with your web server. In this case it was that the controls weren’t updated correctly the last time they were updated. Problem solved.

Your takeway from this is that you need to know how your apps map out and the total flow of data. Then you just need to chip away at the problem, eliminating options as you go, until you figure it out. In the above case if we’d just kept trying the load balanced VIP we’d have gotten nowhere from the get-go.