11 minute read

This is an issue I come across once in a while and since I am not a COM expert by any means I actually had to check in with our COM guys to see what was going on in this case, and as it turned out (this is embarrassing :)) I had already written about the root cause of this issue before (here. So what I am learning from this is that perhaps I should at least check my blog before asking questions :)

Problem description

Intermittently the application will stall, so all requests will hang until they run an IIS Reset. At this point the CPU usage for the process is very low.

Gathering memory dumps

A memory dump was gathered using Debug Diagnostics 1.1, by right clicking on the w3wp.exe process and choosing “Create Full Userdump” while the application was hanging.

Debugging the issue

I opened up the memory dump in windbg and ran the basic commands to set up the environment

.symfix c:\websymbols  (to set up the symbol path)
.reload (to reload the symbols)
.loadby sos mscorwks  (to load up the sos dll in order to view .net objects and stacks)

As always with hangs, the most interesting thing to look at is what the threads are doing, so the 3 standard commands I run are

~* kb 2000 (to see the native stacks)
~* e !clrstack (to see the .net stacks)
!threads (to get information about the managed/.net threads)

When I browse through the native call stacks I noticed that plenty of them were sitting in this type of call stack

  71  Id: 91d0.9740 Suspend: 0 Teb: 7ff0c000 Unfrozen
ChildEBP RetAddr  Args to Child
10f2f0f0 7c822114 77e6711b 00000001 10f2f140 ntdll!KiFastSystemCallRet
10f2f0f4 77e6711b 00000001 10f2f140 00000001 ntdll!NtWaitForMultipleObjects+0xc
10f2f19c 79f8ead4 00000001 04b0e580 00000000 kernel32!WaitForMultipleObjectsEx+0x11a
10f2f204 79f17522 00000001 04b0e580 00000000 mscorwks!WaitForMultipleObjectsEx_SO_TOLERANT+0x6f
10f2f224 79f17493 00000001 04b0e580 00000000 mscorwks!Thread::DoAppropriateAptStateWait+0x3c
10f2f2a8 79f1732f 00000001 04b0e580 00000000 mscorwks!Thread::DoAppropriateWaitWorker+0x144
10f2f2f8 79f8ea4d 00000001 04b0e580 00000000 mscorwks!Thread::DoAppropriateWait+0x40
10f2f354 79e77f50 ffffffff 00000001 00000000 mscorwks!CLREvent::WaitEx+0xf7
10f2f364 7a0fd9c3 ffffffff 00000001 00000000 mscorwks!CLREvent::Wait+0x17
10f2f3f0 7a0fdbbf 0c9dbe58 ffffffff 04b0e56c mscorwks!AwareLock::EnterEpilog+0x94
10f2f40c 7a0fdd2a 83f96676 0c9dbe58 0c9dbe58 mscorwks!AwareLock::Enter+0x61
10f2f470 7a094352 ffffffff 83f96716 151d4058 mscorwks!AwareLock::Contention+0x16c
10f2f510 0be7cbbb 1d107334 00000001 00000000 mscorwks!JITutil_MonContention+0xa3
WARNING: Frame IP not in any known module. Following frames may be wrong.
10f2f540 0be7c7d2 10f2f580 00000000 15142c80 0xbe7cbbb
10f2f64c 6866b5d5 00000000 10f2f6a0 151a6e0c 0xbe7c7d2
10f2f664 6865a071 689a26e1 00000000 00000000 System_Web_ni+0xab5d5
10f2f668 689a26e1 00000000 00000000 15142724 System_Web_ni+0x9a071
10f2f688 6865a00b 4190ab00 151a6e0c 151a7980 System_Web_ni+0x3e26e1
00000000 00000000 00000000 00000000 00000000 System_Web_ni+0x9a00b

If you have followed the blog you will probably know that the fact that we are sitting in mscorwks!AwareLock::Enter means that we are waiting for a .net lock. i.e. on a line like the bolded line below, or alternatively in a Monitor.Enter()

  // do something

If we switch to this thread (71) and run !clrstack on we get the .net stack so that we know where we are waiting for this lock/monitor

0:030> ~71s
eax=00001c5a ebx=10f2f140 ecx=04b0e56c edx=26064dec esi=10f2f140 edi=7ffdf000
eip=7c82ed54 esp=10f2f0f4 ebp=10f2f19c iopl=0         nv up ei pl zr na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246
7c82ed54 c3              ret
0:071> !clrstack
OS Thread Id: 0x9740 (71)
ESP       EIP
10f2f38c 7c82ed54 [GCFrame: 10f2f38c]
10f2f4c4 7c82ed54 [HelperMethodFrame: 10f2f4c4] System.Threading.Monitor.Enter(System.Object)
10f2f518 0be7cbbb MyFramework.User.VerifyUser()
10f2f548 0be7c7d2 MyFramework.Http.HttpModules.HttpAuthenticationModule.Authenticate(System.Object, System.EventArgs)
10f2f654 6866b5d5 System.Web.HttpApplication+SyncEventExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
10f2f66c 6865a071 System.Web.HttpApplication.ExecuteStep(IExecutionStep, Boolean ByRef)
10f2f6ac 6865a39b System.Web.HttpApplication.ResumeSteps(System.Exception)
10f2f6f4 6865912d System.Web.HttpApplication.System.Web.IHttpAsyncHandler.BeginProcessRequest(System.Web.HttpContext, System.AsyncCallback, System.Object)
10f2f710 6865e0bd System.Web.HttpRuntime.ProcessRequestInternal(System.Web.HttpWorkerRequest)
10f2f744 6865dd72 System.Web.HttpRuntime.ProcessRequestNoDemand(System.Web.HttpWorkerRequest)
10f2f750 6865c447 System.Web.Hosting.ISAPIRuntime.ProcessRequest(IntPtr, Int32)
10f2f900 79f1ef33 [ContextTransitionFrame: 10f2f900]
10f2f950 79f1ef33 [GCFrame: 10f2f950]
10f2faa8 79f1ef33 [ComMethodFrame: 10f2faa8]

and in this case it is in the MyFramework.User.VerifyUser() method that we are waiting for the lock.

Another interesting point about this stack is that we can see that this request hasn’t started executing yet, it is still in the Authenticate module which explains why the requests don’t simply time out.

The next step is to figure out who is holding this lock to determine why we are stuck here, and we find the owner by running !syncblk

0:071> !syncblk
Index SyncBlock MonitorHeld Recursion Owning Thread Info  SyncBlock Owner
   42 04b0e4dc            3         1 04ba8bc8  9c88  41   1d107964 System.Object
   45 04b0e56c           61         1 04c987c0  9da0  40   1d107334 System.Object
Total           74
CCW             2
RCW             0
ComClassFactory 0
Free            43

We have two different sync blocks (locks) in this process and it is a bit hard to guess which one is which. The first one locking on object 1d107964 is held by thread 41 and it has 3 monitors which means that it has one waiter. The way you calculate how many waiters a monitor has is by taking (MonitorHeld-1)/2. This is because the owner will take one monitor, and each waiter takes two monitors.

The second lock (1d107334) is owned by thread 40 and has 30 waiters ((61-1)/2).

Oh, btw, when I say owner, I mean the thread that has entered into the lock statement and is now doing something while holding the lock so that thread is essentially keeping the “waiters” waiting.

If we look at our native stack we had

10f2f510 0be7cbbb 1d107334 00000001 00000000 mscorwks!JITutil_MonContention+0xa3

The 1st parameter to JITUtil_MonContention (the 3rd column here) identifies the object we are waiting on, so with this information we now know that our thread (thread 71) is waiting for the lock held by thread 40.

Next we check what this thread is doing and we find that this thread is also waiting for a lock, but this time in ConfigManager.get_Instance(), and lower down in the call chain we can see our VerifyUser method which is why we are holding the lock that thread 71 is waiting for.

0:071> ~40s
eax=00004a4d ebx=0c82edbc ecx=04b0e4dc edx=1ccfb4d6 esi=0c82edbc edi=7ffdf000
eip=7c82ed54 esp=0c82ed70 ebp=0c82ee18 iopl=0         nv up ei pl zr na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246
7c82ed54 c3              ret
0:040> kb
ChildEBP RetAddr  Args to Child
0c82ed6c 7c822114 77e6711b 00000001 0c82edbc ntdll!KiFastSystemCallRet
0c82ed70 77e6711b 00000001 0c82edbc 00000001 ntdll!NtWaitForMultipleObjects+0xc
0c82ee18 79f8ead4 00000001 04b0e4f0 00000000 kernel32!WaitForMultipleObjectsEx+0x11a
0c82ee80 79f17522 00000001 04b0e4f0 00000000 mscorwks!WaitForMultipleObjectsEx_SO_TOLERANT+0x6f
0c82eea0 79f17493 00000001 04b0e4f0 00000000 mscorwks!Thread::DoAppropriateAptStateWait+0x3c
0c82ef24 79f1732f 00000001 04b0e4f0 00000000 mscorwks!Thread::DoAppropriateWaitWorker+0x144
0c82ef74 79f8ea4d 00000001 04b0e4f0 00000000 mscorwks!Thread::DoAppropriateWait+0x40
0c82efd0 79e77f50 ffffffff 00000001 00000000 mscorwks!CLREvent::WaitEx+0xf7
0c82efe0 7a0fd9c3 ffffffff 00000001 00000000 mscorwks!CLREvent::Wait+0x17
0c82f06c 7a0fdbbf 04c987c0 ffffffff 04b0e4dc mscorwks!AwareLock::EnterEpilog+0x94
0c82f088 7a0fdd2a 9f8962ea 04c987c0 04c987c0 mscorwks!AwareLock::Enter+0x61
0c82f0ec 7a094352 ffffffff 9f89638a 00000000 mscorwks!AwareLock::Contention+0x16c
0c82f18c 0be7d4b6 1d107964 00000000 00000000 mscorwks!JITutil_MonContention+0xa3
WARNING: Frame IP not in any known module. Following frames may be wrong.
0c82f1d0 79e74ebd 00000000 00000000 191240e4 0xbe7d4b6
0c82f1d8 00000000 191240e4 00000000 00000000 mscorwks!HelperMethodFrame::LazyInit+0x17
0:040> !clrstack
OS Thread Id: 0x9da0 (40)
ESP       EIP
0c82f008 7c82ed54 [GCFrame: 0c82f008]
0c82f140 7c82ed54 [HelperMethodFrame: 0c82f140] System.Threading.Monitor.Enter(System.Object)
0c82f194 0be7d4b6 MyFramework.Configuration.ConfigManager.get_Instance()
0c82f2e4 0be7cd8a MyFramework.User..ctor()
0c82f398 0be7cbd1 MyFramework.User.VerifyUser()
0c82f3c8 0be7c7d2 MyFramework.Http.HttpModules.HttpAuthenticationModule.Authenticate(System.Object, System.EventArgs)
0c82f4d4 6866b5d5 System.Web.HttpApplication+SyncEventExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
0c82f4ec 6865a071 System.Web.HttpApplication.ExecuteStep(IExecutionStep, Boolean ByRef)
0c82f52c 6865a39b System.Web.HttpApplication.ResumeSteps(System.Exception)
0c82f574 6865912d System.Web.HttpApplication.System.Web.IHttpAsyncHandler.BeginProcessRequest(System.Web.HttpContext, System.AsyncCallback, System.Object)
0c82f590 6865e0bd System.Web.HttpRuntime.ProcessRequestInternal(System.Web.HttpWorkerRequest)
0c82f5c4 6865dd72 System.Web.HttpRuntime.ProcessRequestNoDemand(System.Web.HttpWorkerRequest)
0c82f5d0 6865c447 System.Web.Hosting.ISAPIRuntime.ProcessRequest(IntPtr, Int32)
0c82f780 79f1ef33 [ContextTransitionFrame: 0c82f780]
0c82f7d0 79f1ef33 [GCFrame: 0c82f7d0]
0c82f928 79f1ef33 [ComMethodFrame: 0c82f928]

We go through the same drill again, finding the owner in the !syncbkl output and this time our first parameter to JITUtil_MonContention is 1d107964 which was the one held by thread 41.

Thread 41 is a bit different than the other threads, we can see from the .net call stack that we are doing Marshal.ReleaseComObject from the OnConfigRemoveCallback method, and going into the native call stack we can see that this in turn has to make a transition to an STA thread to clean up (GetToSTA). The reason for this is because the component we are trying to release is an STA component.

0:040> ~41s
eax=00002be2 ebx=049d75c0 ecx=00000000 edx=00000003 esi=00001464 edi=00000000
eip=7c82ed54 esp=0a25ea30 ebp=0a25eaa0 iopl=0         nv up ei pl zr na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246
7c82ed54 c3              ret
0:041> kb
ChildEBP RetAddr  Args to Child
0a25ea2c 7c822124 77e6baa8 00001464 00000000 ntdll!KiFastSystemCallRet
0a25ea30 77e6baa8 00001464 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
0a25eaa0 77e6ba12 00001464 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac
0a25eab4 776c54ef 00001464 ffffffff 0a25ebb8 kernel32!WaitForSingleObject+0x12
0a25ead0 77789905 049d75c0 0ca91600 00000000 ole32!GetToSTA+0x6f
0a25eaf0 77787ed7 0a25ebb8 0a25ecb8 049d2850 ole32!CRpcChannelBuffer::SwitchAptAndDispatchCall+0xcb
0a25ebd0 77695349 049d2850 0a25ecc8 0a25ecb8 ole32!CRpcChannelBuffer::SendReceive2+0xc1
0a25ec3c 776c4eee 049d2850 0a25ecc8 0a25ecb8 ole32!CAptRpcChnl::SendReceive+0xab
0a25ec90 77ce127e 00000001 0a25ecc8 0a25ecb8 ole32!CCtxComChnl::SendReceive+0x91
0a25ecac 77ce13ca 04be013c 0a25ecf4 0600016e rpcrt4!NdrProxySendReceive+0x43
0a25f094 77ce11bd 77676298 77679608 0a25f0cc rpcrt4!NdrClientCall2+0x206
0a25f0b4 77c53a12 0000000c 00000008 0a25f14c rpcrt4!ObjectStublessClient+0x8b
0a25f0c4 776c4c23 04be013c 0a25f100 0a25f200 rpcrt4!ObjectStubless+0xf
0a25f14c 777128f3 0008b420 79f277a5 0a25f1f4 ole32!CObjectContext::InternalContextCallback+0x126
0a25f19c 79f2340f 0008b430 79f277a5 0a25f1f4 ole32!CObjectContext::ContextCallback+0x85
0a25f1e8 79f23362 0008b430 00000000 00000000 mscorwks!CtxEntry::EnterContextOle32BugAware+0x2b
0a25f308 79f278ec 79f27898 04ae3eb8 992e6146 mscorwks!CtxEntry::EnterContext+0x2db
0a25f340 79f279dc 04ae3eb8 992e6176 00000001 mscorwks!RCW::ReleaseAllInterfacesCallBack+0x5c
0a25f370 7a042312 992e61b2 04ba8bc8 04ba8bc8 mscorwks!RCW::Cleanup+0x22
0a25f3b4 7a07484b 04ae3eb8 992e665a 1946c038 mscorwks!RCW::ExternalRelease+0x158
0:041> !clrstack
OS Thread Id: 0x9c88 (41)
ESP       EIP
0a25f408 7c82ed54 [HelperMethodFrame_1OBJ: 0a25f408] System.Runtime.InteropServices.Marshal.InternalReleaseComObject(System.Object)
0a25f464 793b69cd System.__ComObject.ReleaseSelf()
0a25f468 79461890 System.Runtime.InteropServices.Marshal.ReleaseComObject(System.Object)
0a25f498 0c88715b MyFramework.Configuration.ConfigManager.OnConfigRemoveCallback(System.String, System.Object, System.Web.Caching.CacheItemRemovedReason)
0a25f4d8 686bd802 System.Web.Caching.CacheEntry.CallCacheItemRemovedCallback(System.Web.Caching.CacheItemRemovedCallback, System.Web.Caching.CacheItemRemovedReason)
0a25f514 686bdaad System.Web.Caching.CacheEntry.Close(System.Web.Caching.CacheItemRemovedReason)
0a25f550 689a024c System.Web.Caching.CacheSingle.UpdateCache(System.Web.Caching.CacheKey, System.Web.Caching.CacheEntry, Boolean, System.Web.Caching.CacheItemRemovedReason, System.Object ByRef)
0a25f620 686bb838 System.Web.Caching.CacheInternal.DoRemove(System.Web.Caching.CacheKey, System.Web.Caching.CacheItemRemovedReason)
0a25f62c 686c2f72 System.Web.Caching.ExpiresBucket.FlushExpiredItems(System.DateTime, Boolean)
0a25f6a0 686c310f System.Web.Caching.CacheExpires.FlushExpiredItems(Boolean, Boolean)
0a25f6e8 686c3205 System.Web.Caching.CacheExpires.TimerCallback(System.Object)
0a25f6ec 793d9c1a System.Threading._TimerCallback.TimerCallback_Context(System.Object)
0a25f6f0 793683dd System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
0a25f708 793d9c7f System.Threading._TimerCallback.PerformTimerCallback(System.Object)
0a25f898 79e88f63 [GCFrame: 0a25f898]
0a25f9dc 79e88f63 [ContextTransitionFrame: 0a25f9dc]

Preferably we shouldn’t be using STA components in ASP.NET because of the need for transitions between threads and potential waits which of course cause performance degradations. If this would have been an MTA component we could have simply released it on the same thread.

However, still it should not be sitting like this, waiting indefinitely. The appropriate STA thread should handle the request and this thread should return, releasing the lock and thus releasing the other locks as well.

We can grab the 1st parameter to GetToSTA to figure out what thread it is trying to get to

0a25ead0 77789905 049d75c0 0ca91600 00000000 ole32!GetToSTA+0x6f

If we dump it out with dc, the 2nd and 3rd dword will tell us what process it is trying to get to as well as the thread id (the reason we have to go in a little roundabout fashion to find these is because public symbols don’t have type information, otherwise we could have dumped out the parameter with dt or looked at it in windbg, but now you will just have to rely on the fact that it is dword 2 and 3 (at least in this version of the dll)

0:041> dc 0x049d75c0
049d75c0  049d7640 049d7540 000091d0 00009a04  @v..@u..........
049d75d0  db94d897 0b8e692b e9cda17b 88696fd7  ....+i..{....oi.
049d75e0  e9cda17b 88696fd7 00001822 9a0491d0  {....oi.".......
049d75f0  80605d98 1cdb921b 00000103 03b80260  .]`.........`...
049d7600  049e86c0 00000000 00000000 00000000  ................
049d7610  00000001 ffffffff 00097ec0 04be013c  .........~..<...
049d7620  00000003 00000000 00000000 00000000  ................
049d7630  00000000 00070005 00000000 00000000  ................

The PID (process ID) here is 91d0 and the thread id is 9a04

Just to verify that we are going to a thread in this process we can run | to get basic process information to compare against

0:041> |
.  0    id: 91d0    examine    name: c:\WINDOWS\system32\inetsrv\w3wp.exe
And then we can dump out the threads using !threads to identify the thread with the OS thread ID 9a04

0:041> !threads
ThreadCount: 106
UnstartedThread: 0
BackgroundThread: 44
PendingThread: 0
DeadThread: 62
Hosted Runtime: no
                                      PreEmptive   GC Alloc           Lock
       ID OSID ThreadOBJ    State     GC       Context       Domain   Count APT Exception
  30    1 9a04 049a6c28   3808220 Enabled  00000000:00000000 04b0f008     1 STA (Threadpool Worker)
  36    2 8ce8 04b41ac8      b220 Enabled  00000000:00000000 04b5a728     0 MTA (Finalizer)
  37    3 9cb0 04991518    80a220 Enabled  00000000:00000000 04b5a728     0 MTA (Threadpool Completion Port)
  38    4 984c 04a07458      1220 Enabled  00000000:00000000 04b5a728     0 Ukn
  28    6 9a18 04ba8530   880a220 Enabled  1d1c14e8:1d1c24b4 04b5a728     0 MTA (Threadpool Completion Port)
  26    8 9c04 04be0b80       220 Enabled  00000000:00000000 04b5a728     0 STA
  40    9 9da0 04c987c0   380b220 Enabled  00000000:00000000 04b0f008     2 MTA (Threadpool Worker)
  41    7 9c88 04ba8bc8   180b220 Enabled  00000000:00000000 04b0f008     1 MTA (Threadpool Worker)
  42    a 9898 04bc6030   380b220 Enabled  00000000:00000000 04b0f008     1 MTA (Threadpool Worker)
  48    d 9bb0 0ca3a3a0       220 Enabled  00000000:00000000 04b5a728     0 STA
  53    f 9044 0c99b7d8       220 Enabled  00000000:00000000 04b5a728     0 STA
  47    c 9924 04ba7f28       220 Enabled  00000000:00000000 04b5a728     0 STA

Looking at the 1st thread in the list (Thread 30), we can see that the OSID is 9a04 so this is our guy.

  30    1 9a04 049a6c28   3808220 Enabled  00000000:00000000 04b0f008     1 STA (Threadpool Worker)

There are a few other things that are interesting about it as well.

  1. It is an STA thread (as it should be since we are doing GetToSTA to get here) and
  2. It is a ThreadPool Worker thread, meaning that it is a regular .net thread, so not really a thread that would be pumping messages like an STA thread should

If we would look at the stack for it we would see that this thread is actually also waiting for the lock held by thread 40 so we are stuck in a deadlock where

thread 30 is waiting for a lock held by thread 40 thread 40 is waiting for a lock held by thread 41 thread 41 needs to get to thread 30 in order to release an STA component

What about the solution?

The problem we need to solve here is that we need to make sure that thread 41 is not trying to get to a managed thread (or another thread not pumping messages) to release the STA component.

As it turns out, the reason that thread 30 is marked as an STA thread (it should really be MTA) is because the first thread created by the threadpool is created without specifying the apartment model. If this thread happens to handle a request where it uses an STA component this thread will then be marked as STA for the rest of it’s life, but unfortunately, as opposed to proper STA threads, it will not be pumping messages so you may block while trying to get to it.

To resolve the issue, install this hotfix