A: background

1. Tell a story

Since this pure memory project entered a big client, I am now very sensitive to memory and CPU, running a few gigabytes of data memory, especially insecure, always want to use Windbg to catch a few dump to see which part is caused by it, is it my code or my colleague’s code? Many old friends who have seen my blog always leave a message to let me out of a winDBG series or video, I will not ah, there is no way, people in rivers and rivers, sooner or later have to get a few knives, forced will also have a few showy 😄😄😄, nonsense not to say, this one will look at how HashSet expansion.

Two: Capacity expansion mechanism of HashSet

1. How to check

The best way to understand how to expand is to look at the underlying HashSet source code, the crudest entry point being the hashset.add method.

Int Prime = hashhelpers.getPrime (capacity); int prime = hashhelpers.getPrime (capacity); Get a prime number. Get a prime number. Simply put, the number that can only be divisible by 1 and itself is called prime number, and that curiosity comes to see how prime number is calculated together! Screenshot again.

In other words, when the number of elements is greater than 719W, the IsPrime method can only be used to dynamically calculate prime numbers.

public static bool IsPrime(int candidate) { if ((candidate & 1) ! = 0) { int num = (int)Math.Sqrt(candidate); for (int i = 3; i <= num; i += 2) { if (candidate % i == 0) { return false; } } return true; } return candidate == 2; }Copy the code

After reading the whole process, I think you should understand that when you Add for the first time, the default space occupation is 3, the smallest of 72 predefined prime numbers. Friends who have read my previous article know that the default size of List is 4, followed by a simple and rude * 2 processing, as shown in the following code.

private void EnsureCapacity(int min) { if (_items.Length < min) { int num = (_items.Length == 0) ? 4 : (_items.Length * 2); }}Copy the code

2. Secondary expansion exploration of HashSet

When you get to three hashsets, it’s obvious that you’re going to have to do a second capacity expansion, unlike a List that you can just do with an Enrecapacity method and see how you can do that.


public static int ExpandPrime(int oldSize)
{
    int num = 2 * oldSize;
    if ((uint)num > 2146435069u && 2146435069 > oldSize)
    {
        return 2146435069;
    }
    return GetPrime(num);
}
Copy the code

The final expansion is done in the ExpandPrime method. The process is to first multiply by 2, then take the nearest prime number (7), and use 7 as the new Size of the HashSet. If you must see the demo, I’ll write a little code to prove it.

2. Do you smell risk?

<1> Time risk

For the sake of demonstration, I have shown the 72 predefined last few prime numbers.

public static readonly int[] primes = new int[72] { 2009191, 2411033, 2893249, 3471899, 4166287, 4999559, 5999471, 7199369};Copy the code

In other words, when the number of elements in HashSet is 2893249, the trigger expansion becomes 2893249 * 2 => 5786498. 5999471, namely 289W suddenly increased to 599W, it is 599W-289W = 310W at a stroke of space empty, but this increased twice more oh, scary not? So let’s write a code to verify that.

static void Main(string[] args) { var hashSet = new HashSet<int>(Enumerable.Range(0, 2893249)); hashSet.Add(int.MaxValue); Console.Read(); } 0:00 0 >! clrstack -l 000000B8F4DBE500 00007ffaf00132ae ConsoleApplication3.Program.Main(System.String[]) [C:\4\ConsoleApp1\ConsoleApp1\Program.cs @ 16] LOCALS: 0x000000B8F4DBE538 = 0x0000020e0b8fcc08 0:000> ! DumpObj /d 0000020e0b8fcc08 Name: System.Collections.Generic.HashSet`1[[System.Int32, System.Private.CoreLib]] Size: 64(0x40) bytes File: C: \ Program Files\dotnet\shared\Microsoft.NET Core App \ 5.0.0 - preview 5.20278.1 \ System. Collections. DLL Fields: MT Field Offset Type VT Attr Value Name 00007ffaf0096d10 4000017 8 System.Int32[] 0 instance 0000020e2025e9f8 _buckets 00007ffaf00f7ad0 4000018 10 ... ivate.CoreLib]][] 0 instance 0000020e2bea1020 _slots 00007ffaeffdf828 4000019 28 System.Int32 1 instance 2893250 _count 0:00 0 >! DumpObj /d 0000020e2025e9f8 Name: System.Int32[] Size: 23997908(0x16e2dd4) bytes Array: Rank 1, Number of elements 5999471, Type Int32 (Print Array) Fields: NoneCopy the code

And most importantly, it’s a one-time expansion, rather than a gradual expansion like the one implemented in Redis, and the time cost is also worth noting.

<2> Spatial risk

What are the risks? Take a look at the size of the two hashsets, 289W and 599w, which I’m most sensitive to.

static void Main(string[] args) { var hashSet1 = new HashSet<int>(Enumerable.Range(0, 2893249)); var hashSet2 = new HashSet<int>(Enumerable.Range(0, 2893249)); hashSet2.Add(int.MaxValue); Console.Read(); } 0:00 0 >! clrstack -l OS Thread Id: 0x4a44 (0) 000000B1B4FEE460 00007ffaf00032ea ConsoleApplication3.Program.Main(System.String[]) [C:\4\ConsoleApp1\ConsoleApp1\Program.cs @ 18] LOCALS: 0x000000B1B4FEE4B8 = 0x000001d13363cc08 0x000000B1B4FEE4B0 = 0x000001d13363d648 0:000> ! objsize 0x000001d13363cc08 sizeof(000001D13363CC08) = 46292104 (0x2c25c88) bytes (System.Collections.Generic.HashSet`1[[System.Int32, System.Private.CoreLib]]) 0:000> ! objsize 0x000001d13363d648 sizeof(000001D13363D648) = 95991656 (0x5b8b768) bytes (System.Collections.Generic.HashSet`1[[System.Int32, System.Private.CoreLib]])Copy the code

HashSet1:46292104/1024/1024 = 44.1M 95991656/1024/1024 = 91.5m, wasted: 91.5-44.1 = 47.4m.

If you think that only 47.4MB is being wasted, you are wrong. Remember that the old size is overwritten by the new size set in the underlying capacity expansion, and this old size set will always occupy space on the heap before the GC is collected. The diagram below:

Int [] m_buckets = Slot[] m_slots = int[] m_buckets

static void Main(string[] args) { var hashSet2 = new HashSet<int>(Enumerable.Range(0, 2893249)); hashSet2.Add(int.MaxValue); Console.Read(); 1 >} 0:01! dumpheap -stat 00007ffaf84f7ad0 3 123455868 System.Collections.Generic.HashSet`1+Slot[[System.Int32, System.Private.CoreLib]][]Copy the code

Slot[] = Slot[]; Slot[] = Slot[]; Slot[] = Slot[];

1 > 0:01! DumpHeap /d -mt 00007ffaf84f7ad0 Address MT Size 0000016c91308048 00007ffaf84f7ad0 16743180 0000016c928524b0 00007ffaf84f7ad0 34719012 0000016ce9e61020 00007ffaf84f7ad0 71993676 0:011> ! gcroot 0000016c91308048 Found 0 unique roots (run '! gcroot -all' to see all roots). 0:011> ! gcroot 0000016c928524b0 Found 0 unique roots (run '! gcroot -all' to see all roots). 0:011> ! gcroot 0000016ce9e61020 Thread 2b0c: 0000006AFAB7E5F0 00007FFAF84132AE ConsoleApplication3.Program.Main(System.String[]) [C:\4\ConsoleApp1\ConsoleApp1\Program.cs @ 15] rbp-18: 0000006afab7e618 -> 0000016C8000CC08 System.Collections.Generic.HashSet`1[[System.Int32, System.Private.CoreLib]] -> 0000016CE9E61020 System.Collections.Generic.HashSet`1+Slot[[System.Int32, System.Private.CoreLib]][]Copy the code

As can be seen from the above, I used gcroot to find the reference root of these three addresses, but two of them were missing. The last one had the new size of 599W, right? Do prints the values of the three addresses.

1 > 0:01! do 0000016c91308048 Name: System.Collections.Generic.HashSet`1+Slot[[System.Int32, System.Private.CoreLib]][] Size: 16743180(0xff7b0c) bytes Array: Rank 1, Number of elements 1395263, Type VALUETYPE (Print Array) Fields: None 0:011> ! do 0000016c928524b0 Name: System.Collections.Generic.HashSet`1+Slot[[System.Int32, System.Private.CoreLib]][] Size: 34719012(0x211c524) bytes Array: Rank 1, Number of elements 2893249, Type VALUETYPE (Print Array) Fields: None 0:011> ! do 0000016ce9e61020 Name: System.Collections.Generic.HashSet`1+Slot[[System.Int32, System.Private.CoreLib]][] Size: 71993676(0x44a894c) bytes Array: Rank 1, Number of elements 5999471, Type VALUETYPE (Print Array) Fields: NoneCopy the code

(Rank 1, Number of elements) (Rank 1, Number of elements) 1395263, so in this case: the total size of managed heap is approximately: 23.7m + 47.4m + 91.5m = 162.6m, I go, not simply put… In other words, there is 162.6-91.5 =71.1M of uncollected garbage on the managed heap ➕ 47.4M of empty space just now, the total waste is :118.5M, I hope I did my math correctly…

3. Is there a solution?

Capacity can be used to control the Size of a List. Unfortunately, there is no such solution in HashSet, except for a clunky clipping method called TrimExcess, which extends the current Size to the nearest prime value, as shown in the following code:

public void TrimExcess() { int prime = HashHelpers.GetPrime(m_count); Slot[] array = new Slot[prime]; int[] array2 = new int[prime]; int num = 0; for (int i = 0; i < m_lastIndex; i++) { if (m_slots[i].hashCode >= 0) { array[num] = m_slots[i]; int num2 = array[num].hashCode % prime; array[num].next = array2[num2] - 1; array2[num2] = num + 1; num++; }}}Copy the code

The application in this case is to limit 289W to 347W, and still occupy 58W space. The diagram below:

Three:

Struct Slot {int hashCode; struct Slot {int hashCode; internal int next; internal T value; }, so use it with caution after you understand the principle.