Recently we have been getting quite a few similar hangs in ASP.NET. From the first look they really aren’t all that similar but when we get down to the nitty-gritty we see that they all hang or rather have a very high latency, in the exact same location.
On startup, the first request for an aspx page, no matter how simple it is, will take around 2 minutes. In fact, all subsequent requests will also take 2 minutes to execute, so the server is essentially out of order.
All the servers in the web cluster behave the same way, and have done so since the servers were installed.
Debugging the problem
They took two dumps on the server, during the hang and the only thread that seemed mildly interesting was this one…
0:009> kp 2000 ChildEBP RetAddr 0130e41c 7d4e091f ntdll!NtQueryDirectoryFile+0x12 0130e498 79e7a0b3 kernel32!FindNextFileW+0xaa 0130eb38 79e7a23c aspnet_isapi!CPerfCounterServer::CleanupNames+0x103 0130eb3c 79e79e58 aspnet_isapi!CPerfCounterServer::InitPerfPipe+0x5 0130eb44 79e79de2 aspnet_isapi!CPerfCounterServer::Initialize+0x37 0130eb50 79e7642b aspnet_isapi!CPerfCounterServer::StaticInitialize+0x39 0130eb58 5a3233f8 aspnet_isapi!GetExtensionVersion+0x11 0130ec84 5a322dc9 w3isapi!ISAPI_DLL::Load+0x219 0130eca8 5a3228ca w3isapi!ISAPI_DLL_MANAGER::GetIsapi+0x1cd 0130ecd4 5a3968ff w3isapi!ProcessIsapiRequest+0x96 0130ed08 5a3967e0 w3core!W3_ISAPI_HANDLER::IsapiDoWork+0x3fd 0130ed28 5a396764 w3core!W3_ISAPI_HANDLER::DoWork+0xb0 0130ed48 5a3966f4 w3core!W3_HANDLER::MainDoWork+0x16e 0130ed58 5a3966ae w3core!W3_CONTEXT::ExecuteCurrentHandler+0x53 0130ed74 5a3c0a52 w3core!W3_CONTEXT::ExecuteHandler+0x51 0130f01c 5a3c41d9 w3core!W3_CONTEXT::ExecuteChildRequest+0x29d 0130fb74 5a3c427a w3core!W3_STATIC_FILE_HANDLER::HandleDefaultLoad+0x5cc 0130fba8 5a3c4cf2 w3core!W3_STATIC_FILE_HANDLER::DirectoryDoWork+0x55 0130fbcc 5a3c53f8 w3core!W3_STATIC_FILE_HANDLER::RealDoWork+0x113 0130fe78 5a396764 w3core!W3_STATIC_FILE_HANDLER::DoWork+0x368 0130fe98 5a3966f4 w3core!W3_HANDLER::MainDoWork+0x16e 0130fea8 5a3966ae w3core!W3_CONTEXT::ExecuteCurrentHandler+0x53 0130fec4 5a396648 w3core!W3_CONTEXT::ExecuteHandler+0x51 0130feec 5a392264 w3core!W3_STATE_HANDLE_REQUEST::DoWork+0x9a 0130ff10 5a3965ea w3core!W3_MAIN_CONTEXT::DoWork+0xa6 0130ff2c 5a36169f w3core!W3_MAIN_CONTEXT::OnNewRequest+0x55 0130ff38 5a361650 w3dt!UL_NATIVE_REQUEST::DoStateProcess+0x48 0130ff48 5a3616ca w3dt!UL_NATIVE_REQUEST::DoWork+0x7f 0130ff5c 5a3024de w3dt!OverlappedCompletionRoutine+0x1a 0130ff8c 5a3026bc w3tp!THREAD_POOL_DATA::ThreadPoolThread+0x73 0130ffa0 5a301db9 w3tp!THREAD_POOL_DATA::ThreadPoolThread+0x24 0130ffb8 7d4dfff1 w3tp!THREAD_MANAGER::ThreadManagerThread+0x39 0130ffec 00000000 kernel32!BaseThreadStart+0x34
How did I figure out that this one was interesting? Check out Things to ignore when debugging an ASP.NET Hang - Update for .NET 2.0.
Ok, so interesting hmm… well depends on who you ask :)
This thread is processing an ISAPI request (for example an ASP.NET request since all ASP.NET requests go through the
aspnet_isapi.dll), then it is loading an ISAPI dll, which just happens to be our friend
For this one to load it apparently has to initialize some performance counters, and then it goes into some strange function
kernel32!FindNextFileW and sticks around there…
It is in the
NtQueryDirectoryFile the similarities exists between the similar hang issues we’ve had.
These functions make a transition from user mode (our process space) to kernel mode, and when you take a process dump with adplus or debug diag you will only get the user mode stacks and information, so the buck stops here…
Since we know that we are blocking in kernel mode we can make a few assumptions and work on those unless we want to go the whole way and take a kernel dump which is a bit of a bother, both to gather and to debug. We can assume that we are blocking on some kind of driver in kernel mode, and since we are in
NtQueryDirectoryFile, the suspects are drivers for removable storage or antivirus software.
Before I continue I want to just say that before I would ever suggest doing anything to remove antivirus software, I would strongly advice to take backup measures to protect the machine from virus attacks. To test the assumption of an antivirus scanner intercepting the call, most people will just exclude the directory from virus scanning or disable the antivirus software. However, experience has shown that this is not always enough since there are still kernel drivers present, so a full removal of the antivirus software is needed in order to make sure that av software is not intercepting the calls.
We have had quite a few issues where av software has gotten stuck because of very outdated signatures, and in this particular case it happened to be antivirus software getting stuck because they were running a version of the antivirus software that was not designed for use with 64-bit operating systems. Once they installed the version of the av software that had a 64-bit engine it ran like a charm.
So the moral of the story is: keep your av software up to date and make sure you use the one that ia appropriate for your target operating system