11 minute read

Hope you have had a chance to go through the high CPU lab… Some people already commented with the solution in the lab instructions but here are my answers and commentary on the lab.

Some parts of the instructions are removed for brevity - and as usual, my comments are inline

Reproduce the issue

  1. Start the application and browse to All Products, this should take about 5 seconds or more depending on what machine you’re on
  2. Watch the CPU usage in task manager while browsing to this page

    • What is the CPU usage like for iisexpress.exe during the processing of this request?

      very high, approaching 100%

Reproduce the issue again and gather the memory dump

Note: Watch the performance counters - specifically the collections - while reproducing

  1. Generate load with .\ToolsAndScripts\tinyget.ps1 -url https://localhost:44350/Products -numTimes 5

     [System.Runtime]
         % Time in GC since last GC (%)                                  36
         Allocation Rate (B / 1 sec)                          2 266 546 976
         CPU Usage (%)                                                   46
         Exception Count (Count / 1 sec)                                  0
         GC Heap Size (MB)                                              219
         Gen 0 GC Count (Count / 1 sec)                                  41
         Gen 0 Size (B)                                          28 191 264
         Gen 1 GC Count (Count / 1 sec)                                  41
         Gen 1 Size (B)                                          12 270 032
         Gen 2 GC Count (Count / 1 sec)                                  41
         Gen 2 Size (B)                                                 192
         LOH Size (B)                                           118 228 792
         Monitor Lock Contention Count (Count / 1 sec)                    0
         Number of Active Timers                                          0
         Number of Assemblies Loaded                                    139
         ThreadPool Completed Work Item Count (Count / 1 sec)             2
         ThreadPool Queue Length                                          0
         ThreadPool Thread Count                                          9
         Working Set (MB)                                               309
    

    If you look at the counters in Visual Studio - Debug/Performance Profiler/.net counters it looks like this

    perfmon

    • How high does the CPU Usage (%) get?

      Around 50-60 % in my case. This will vary greatly depending on if you are on a single proc or multi processor box since the server GC and workstation GC are optimized for different things. The server GC which runs in the ASP.NET service on multi processor boxes will do more full collections, relocation and compact memory more. The workstation version omits some of this at the expense of higher memory usage.

      % Time in GC since last GC is the percentage of elapsed time that was spent in performing a garbage collection (GC) since the last GC cycle. This counter is usually an indicator of the work done by the Garbage Collector on behalf of the application to collect and compact memory. This counter is updated only at the end of every GC and the counter value reflects the last observed value; its not an average.” in other words this is bound to spike up a bit after a Gen 2 collection but the important part here is that the average time in GC doesn’t end up too high.

    • What is the ratio between Gen 0, Gen 1, and Gen 2 GC Count? What should it be in an optimal scenario and why?

      Gen 0 collections are basically free. Gen 1 collections (which includes Gen 0 collections) are also pretty cheap since we are still dealing with a low amount of memory that needs to be collected, relocated etc… Gen 2 collections on the other hand can get really really expensive since it deals with the whole .net GC heap. An optimal ratio would be to see 100 Gen 0 collections, per 10 Gen 1 collections and 1 Gen 2 collection, so 100:10:1. In my case the numbers are

      40 Gen 0 Collections per second, and the same for Gen 1 and Gen 2, so a ratio of 1:1:1

      That means that pretty much every single collection that was done was a Gen 2 collection (since this also triggers a 1 and 0 collection) which is really bad since it means that we take no advantage of having a generational GC, we basically scavenge through all memory all the time.

    • What can cause a ratio like this?

      Generally two things cause this type of ratio. Either the allocation pattern is such that pretty much all your allocations are for large objects so the large object heap segments get exhausted all the time or you constantly call GC.Collect directly or through something like GC.GetTotalMemory(true).

  2. Reproduce the issue again

  3. When the CPU usage is high, press enter in the procdump/dotnet-dump window to get the dump - alternatively you can also right-click in task manager to generate a memory dump.

Verify that you took the dump at the right time

  1. Run !threadpool to see the CPU usage of the system to make sure that you got the dump at the right time

     0:000> !threadpool
     CPU utilization: 81% Worker Thread: Total: 7 Running: 6 Idle: 1 MaxLimit: 32767 MinLimit: 8
     Work Request in Queue: 0
     --------------------------------------
     Number of Timers: 0
     --------------------------------------
     Completion Port Thread:Total: 1 Free: 1 MaxFree: 16 CurrentLimit: 1 MaxLimit: 1000 MinLimit: 8
    
    • What is the CPU Utilization?

      81%

    • How does the CPU Utilization shown here correlate to the CPU usage of the process?

      It doesn’t really correlate - a couple of things to note here… 1. the CPU usage shown here is for the whole system so it will not correlate to the CPU usage of the process, but if you’re running at high CPU typically you are taking most of the CPU. 2. 81% is a very special number. At 80% the threadpool will stop creating new threadpool threads, so when the process is in the middle of a garbage collection, the threadpool will artificially assign CPU usage 81% to stop new threads. In other words, if you see 81% you’re likely in a GC but you can’t say anything about the CPU usage.

Determine which threads have been consuming most of the CPU time

  1. Run .time to see the uptime and CPU User time of the process
  2. Run !runaway to see the User Mode Time for all the threads

     0:000> .time
     Debug session time: Wed Feb  3 13:29:25.000 2021 (UTC + 1:00)
     System Uptime: 0 days 5:35:05.910
     Process Uptime: 0 days 0:44:51.000
     Kernel time: 0 days 0:00:58.000
     User time: 0 days 0:02:12.000
     0:000> !runaway
     User Mode Time
     Thread       Time
     39:26d0     0 days 0:00:08.906
     38:53a4     0 days 0:00:08.796
     37:3780     0 days 0:00:08.609
     40:69e0     0 days 0:00:08.250
     42:63e8     0 days 0:00:08.187
     41:6e9c     0 days 0:00:07.562
     43:40bc     0 days 0:00:07.546
     44:6ec0     0 days 0:00:07.437
     45:6974     0 days 0:00:01.625
     ...
    
    • From !runaway which threads are consuming most of the CPU?

      Threads 39 through 44

    • How much user mode CPU time was consumed by the process in total? (from .time)

      approx 64s out of the 2 mins of user mode time was used by these threads

    Note: Looking at !runaway output can be a bit tricky for multiple reasons.

    • First off, on a multi processor machine you have to remember that the user mode time (i.e. clock cycles spent in user mode code) is CPU time spent on all processors, therefore the user mode time may add up to more than elapsed time (process uptime).
    • Secondly, the !runaway output shows the total user mode time spent on that thread since it was started. In asp.net for example a thread may be reused for multiple requests so a high user mode time on a thread does not necessarily mean that the requests running on that thread is a big CPU consumer.
    • And finally, some threads like the GC threads (in multi processor, serverGC processes) stay alive for the duration of the process, so they have a higher chance of accumulating a lot of user mode time than worker threads, therefore it is really more interesting to look at two consecutive dumps and compare the differences in user mode time.
  3. Pick the threads that have the highest user mode time and look at their stacks

     ~#s         # set thread context, pick a thread number from !runaway
     kb 2000     # look at the native stack
     !clrstack   # look at the .net stack
    
     0:039> k
     # Child-SP          RetAddr               Call Site
     00 00000031`fb5ff258 00007fff`9a1aa34e     ntdll!NtWaitForSingleObject+0x14
     01 00000031`fb5ff260 00007fff`0b595db8     KERNELBASE!WaitForSingleObjectEx+0x8e
     02 (Inline Function) --------`--------     coreclr!CLREventWaitHelper2+0x6
     03 00000031`fb5ff300 00007fff`0b81b154     coreclr!CLREventWaitHelper+0x20
     04 (Inline Function) --------`--------     coreclr!CLREventBase::WaitEx+0x10
     05 (Inline Function) --------`--------     coreclr!CLREventBase::Wait+0x10
     06 (Inline Function) --------`--------     coreclr!GCEvent::Impl::Wait+0x17
     07 (Inline Function) --------`--------     coreclr!GCEvent::Wait+0x17
     08 00000031`fb5ff360 00007fff`0b6477f4     coreclr!SVR::gc_heap::bgc_thread_function+0x94
     09 00000031`fb5ff3c0 00007fff`0b62335a     coreclr!<lambda_d75637c55c295607723a3ad666cdddee>::operator()+0x7c
     0a 00000031`fb5ff3f0 00007fff`9b377034     coreclr!Thread::intermediateThreadProc+0x8a
     0b 00000031`fb5ffb30 00007fff`9c7bd0d1     kernel32!BaseThreadInitThunk+0x14
     0c 00000031`fb5ffb60 00000000`00000000     ntdll!RtlUserThreadStart+0x21
     0:039> !clrstack
     OS Thread Id: 0x26d0 (39)
             Child SP               IP Call Site
    
    • What are they doing? Can you make an assumption based on this as to what is causing the high CPU usage?

      This looks like a GC thread (on a multi processor machine we have one per processor which correlates well with 8) so it looks like we’re spending most of our time in garbage collection.

  4. [Only on multi-proc machines since they have dedicated GC threads]. Sum up the user mode time for the GC threads in !runaway and divide this by total user mode time from .time.

    • How much of the total user mode time is spent by the GC threads?

      50%

Look at the memory dump to figure out what caused the high CPU in GC

  1. Run ~* kb 2000 to get all native stacks and search for the thread that triggered the GC (coreclr!SVR::GCHeap::GarbageCollectGeneration)

     0:052> k
     # Child-SP          RetAddr               Call Site
     00 00000031`8837d8e8 00007fff`9a1aa34e     ntdll!NtWaitForSingleObject+0x14
     01 00000031`8837d8f0 00007fff`0b595db8     KERNELBASE!WaitForSingleObjectEx+0x8e
     02 (Inline Function) --------`--------     coreclr!CLREventWaitHelper2+0x6
     03 00000031`8837d990 00007fff`0b83bce0     coreclr!CLREventWaitHelper+0x20
     04 (Inline Function) --------`--------     coreclr!CLREventBase::WaitEx+0x41
     05 (Inline Function) --------`--------     coreclr!CLREventBase::Wait+0x41
     06 (Inline Function) --------`--------     coreclr!GCEvent::Impl::Wait+0x41
     07 (Inline Function) --------`--------     coreclr!GCEvent::Wait+0x41
     08 00000031`8837d9f0 00007fff`0b811239     coreclr!SVR::gc_heap::wait_for_gc_done+0xc0
     09 00000031`8837da20 00007fff`0b839fea     coreclr!SVR::GCHeap::GarbageCollectGeneration+0x2c9
     0a 00000031`8837da70 00007fff`0b6bd753     coreclr!SVR::gc_heap::trigger_gc_for_alloc+0x22
     0b 00000031`8837dab0 00007fff`0b5df95b     coreclr!SVR::gc_heap::try_allocate_more_space+0xddca3
     0c 00000031`8837db20 00007fff`0b62e6c4     coreclr!SVR::gc_heap::allocate_more_space+0x5b
     0d 00000031`8837dbb0 00007fff`0b6bd57c     coreclr!SVR::gc_heap::allocate_large_object+0x64
     0e 00000031`8837dc20 00007fff`0b54c9e5     coreclr!SVR::GCHeap::Alloc+0xdddac
     0f (Inline Function) --------`--------     coreclr!Alloc+0x12f
     10 00000031`8837dc70 00007fff`0b5fd302     coreclr!AllocateString+0x195
     11 00000031`8837dd20 00007ffe`ac4d9a34     coreclr!FramedAllocateString+0x82
     12 00000031`8837de70 00007ffe`ac4d480b     0x00007ffe`ac4d9a34
     ...
    
     0:052> !clrstack
     OS Thread Id: 0x56b8 (52)
             Child SP               IP Call Site
     000000318837DD78 00007fff9c80c034 [HelperMethodFrame: 000000318837dd78]
     000000318837DE70 00007ffeac4d9a34 System.String.Concat(System.String[])
     000000318837DEE0 00007ffeac4d480b BuggyBits.Controllers.ProductsController.Index()
     000000318837DFE0 00007ffeac1e20fd SOS Warning: Loading symbols for dynamic assemblies is not yet supported
     DynamicClass.lambda_method(System.Runtime.CompilerServices.Closure, System.Object, System.Object[])
    
    • Why was a GC triggered?

      We were trying to allocate a large object (gc_heap::allocate_large_object) and in doing so we needed more space and that triggered the garbage collection.

    • What type of object was it allocating?

      Based on the managed stack it looks like we are allocating a string.

    • What is the thread doing?

      Concatenating strings in the ProductsController.

    • Could this be causing the bad Gen 0, Gen 1, Gen 2 ratio? Why?

      Concatenating large strings is a well known cause for high CPU in GC. The reason for this is explained in a lot more detail in this article so I won’t go through the reasons in depth here, but basically what it comes down to is that when you do something like str = str + str2 in a loop you will generate new strings rather than expanding str because strings are immutable. If the loop goes on for a while and the strings are big you will constantly create new strings where one is just slightly bigger than the previous and this causes the LOH segments to be exhausted frequently, which in turn cause a lot of full GCs and high CPU in GC.

  2. Find out what is on the large object heap

    Note: if you unfortunate you may have caught the GC in the plan or relocate phase in which case the !dumpheap output may not be correct. If that is the case try to make do with what you get from the stacks and !dumpheap -min 85000)

     !dumpheap -min 0n85000
    
     0:052> !dumpheap -type System.String -min 0n85000
             Address               MT     Size
     000001d8911938a8 00007ffeabba1e18   495176
     000001d89120c710 00007ffeabba1e18   495322
     000001d891285610 00007ffeabba1e18   539122
     000001d891309028 00007ffeabba1e18   539560
     000001d891ad0230 00007ffeabba1e18   514448
     000001d891b4dbe0 00007ffeabba1e18   537662
     000001d891bd1040 00007ffeabba1e18   514594
     000001d891c4ea88 00007ffeabba1e18   538246
     000001d8a11d10a8 00007ffeabba1e18   578688
     000001d8a125e548 00007ffeabba1e18   538830
     000001d8a12e1e38 00007ffeabba1e18   515032
     000001d8a135fa30 00007ffeabba1e18   515178
     000001d8a13dd6c0 00007ffeabba1e18   515324
     000001d8a145b3e0 00007ffeabba1e18   515470
    
    • What is on the large object heap?

      Mostly strings

    • Is there any pattern in the sizes of the strings?

      They don’t seem to be of the same size but very similar

  3. Dump out a few of the strings to see the contents

     !do <address of string>
    
     0:052> !DumpObj /d 000001d901a83c18
     Name:        System.String
     MethodTable: 00007ffeabba1e18
     EEClass:     00007ffeabb92150
     Size:        538100(0x835f4) bytes
     File:        C:\Program Files\dotnet\shared\Microsoft.NETCore.App\3.1.11\System.Private.CoreLib.dll
     String:
     Fields:
                 MT    Field   Offset                 Type VT     Attr            Value Name
     00007ffeabaeb1f0  4000242        8         System.Int32  1 instance           269039 _stringLength
     00007ffeabae8008  4000243        c          System.Char  1 instance               3c _firstChar
     00007ffeabba1e18  4000244      110        System.String  0   static 000001d591191360 Empty
    
     0:052> du 000001d901a83c18+0xc
     000001d9`01a83c24  "<tr><th>Product Name</th><th>Des"
     000001d9`01a83c64  "cription</th><th>Price</th></tr>"
     000001d9`01a83ca4  "<tr><td>Product0</td><td>Descrip"
     000001d9`01a83ce4  "tion for product0</td><td>$100</"
     000001d9`01a83d24  "td><tr><td>Product1</td><td>Desc"
     000001d9`01a83d64  "ription for product1</td><td>$10"
     000001d9`01a83da4  "0</td><tr><td>Product2</td><td>D"
     000001d9`01a83de4  "escription for product2</td><td>"
     000001d9`01a83e24  "$100</td><tr><td>Product3</td><t"
     000001d9`01a83e64  "d>Description for product3</td><"
     000001d9`01a83ea4  "td>$100</td><tr><td>Product4</td"
     000001d9`01a83ee4  "><td>Description for product4</t"
    
    • Based on the stacks, the knowledge of the collection ratios and the string contents, can you determine what went wrong. You might want to do some internet searches on some of the calls on the stacks if you don’t know the answer immediately.

      Looks like the ProductsController is concatenating large string tables which is causing the high CPU

Verify your assumption in the code and modify the code to resolve the issue

  1. Look at the code for the function you identified

    • What line of code is causing the problem?
    • How do you resolve it? Why does this resolve the problem?

      Use a StringBuilder and append data rather than concatenating strings. The reason this resolves the issue is because the string builder has a buffer that it fills up when you append so it fills up the existing buffer and doesn’t have to create a new one all the time. When it needs to expand the buffer it doubles it so the amount of times the app has to make new allocations on the LOH is decreased by a factor of 100 or 1000 in this case. If you change it to use a string builder instead you will see what I mean… the CPU usage will drop, the execution time is vastly different and if you look at the number of GCs that occur you will notice that instead of making around 5000 or so garbage collections we are talking about less than 10 full GCs.

Laters, Tess