The article was first published on an official account (Yuebanfeiyu), and then synchronized to the personal website: xiaoflyfish.cn/

If you like, I will share more articles later!

Feel there is harvest, hope to help like, forward ha, thank you, thank you

Wechat search: month with flying fish, make a friend, into the interview exchange group

preface

In the process of checking the slow interface at work, I found a function that took a long time, and finally found that it was caused by the List de-duplication.

Due to the small amount of early data in the test environment and online, the performance problem of this interface did not attract great attention, and then frequent timeouts did not attract attention.

See “Alibaba Java development manual” there is such a description:

You see, ali predecessors have summed it up for free, but you will still see someone using the contains function of List to retrieve……

If you don’t remember, you’ll be fined 100,000 times

If you need to download the book resources on the Internet, I can also send you private chat

Today I will combine the source code to talk about how Set is to ensure the uniqueness of data, why the performance gap between the two deduplication methods is so large

HashSet source

First look at the class comment:

Looking at the class comment, we can get the following information:

The underlying implementation is based on HashMap, so iteration cannot be guaranteed in insert order, or any other order;
The time consuming performance of add, remove, contanins, size and other methods will not increase with the increase of data volume, which is mainly related to the array data structure at the bottom of HashMap. No matter how large the data volume is, the time complexity is O (1) without considering hash conflicts.
Thread is not safe, if you need to secure please lock, or useThe Collections. SynchronizedSet;
Iterative process, if the data structure be changed, will fail fast, will throw ConcurrentModificationException.

The implementation of a HashSet is based on a HashMap. In Java, there are two innovative ways to implement a HashSet based on a base class:

Inherit the base class and override the methods of the base class, such as inherit HashMap and override its Add method;
Composite base classes, the ability to reuse a base class by calling its methods.

A HashSet uses a combined HashMap, which has the following advantages:

Inheritance means that the parent class is the same thing, and Set and Map are intended to express two things, so inheritance is not appropriate. Moreover, Java syntax limits, subclasses can only inherit one parent class, which is difficult to extend later.

The combination is more flexible, can arbitrarily combine the existing base class, and can be extended and arranged on the basis of the method of the base class, and the method name can be arbitrarily named, do not need to be consistent with the method name of the base class.

To combine a HashMap is to treat a HashMap as a local variable of one’s own. Here is how to combine a HashSet:

Key is the key of the Hashset and value is PRESENT below
private transient HashMap<E,Object> map;
// Value in HashMap
private static final Object PRESENT = new Object();
Copy the code

From these two lines of code, we can see two things:

When we use a HashSet, like the add method, we only have one input, but the add method of a combined Map has two inputs, key and value, and the corresponding Map key is our add parameter, and value is PRESENT in line 2, The design here is very clever, replacing the Map Value with a default Value PRESENT;

Let’s look at the add method again:

public boolean add(E e) {
    // Use the HashMap put method directly to make some simple logical judgments
    return map.put(e, PRESENT)==null;
}
Copy the code

Let’s get to the underlying source java.util.HashMap#put:

public V put(K key, V value) { 
 return putVal(hash(key), key, value, false.true); 
}
Copy the code

Now look at the hash method:

static final int hash(Object key) { 
 int h; 
 return (key == null)?0 : (h = key.hashCode()) ^ (h >>> 16); 
}
Copy the code

You can see that if the key is null, the hash value is 0, otherwise the key is xor by its own hashCode function and xor by moving it 16 bits to the right to get the final hash value.

Let’s go back to java.util.HashMap#putVal:

In java.util.hashmap #putVal, the (n-1) & hash is used directly to get the current element’s bit in the node array. If not, construct a new node and store it in the corresponding location of the node array. If so, the logics are as follows:

p.hash == hash && ((k = p.key) == key || (key ! = null && key.equals(k)))Copy the code

To determine if the elements are equal.

Replace the old value with the new value if they are equal, otherwise add red-black tree or linked list nodes.

Conclusion: The uniqueness of HashSet elements is guaranteed by the uniqueness of HashMap keys.

One last look:

“Alibaba Java Development Manual” there is such a description:

Do you understand by now the reason for these 2 and 3 points

The performance comparison

In fact, the core of the performance difference between HashSet and ArrayList is the performance comparison of contains function.

We look at the implementation of java.util.HashSet#contains and java.util.ArrayList#contains respectively.

Java. Util. HashSet# contains the source code:

public boolean contains(Object o) {
        return map.containsKey(o);
    }
Copy the code

Ultimately, this is also determined by HashMap

If hash conflicts are not extremely severe (most have very little hash conflict), the time complexity for n elements to judge and insert in sequence into the Set is close to O (n), and the search complexity is O (1).

Java.util.ArrayList#contains contains

public boolean contains(Object o) {
        return indexOf(o) >= 0;
    }
public int indexOf(Object o) {
        if (o == null) {
            for (int i = 0; i < size; i++)
                if (elementData[i]==null)
                    return i;
        } else {
            for (int i = 0; i < size; i++)
                if (o.equals(elementData[i]))
                    return i;
        }
        return -1;
    }
Copy the code

The core logic is found to be: if null, the entire collection is traversed to determine whether there are null elements; Otherwise, the entire list is iterated over, and equal to the current element is determined by O.squares (currently iterated elements), which returns the index of the current loop.

So, the time complexity of judging and inserting n elements into a Set is close to O (n^2), and the complexity of finding is O (n).

Therefore, the performance gap is self-evident when compared with time complexity.

We plotted the two time complexity functions respectively, and the comparison of their growth rate was very obvious:

If the amount of data is small, it is acceptable to use List deduplication, but when the amount of data is large, the interface response time will be extremely slow, which is unbearable, and even cause a large number of thread blocking and failure.

Finally, your likes, shares and retweets are the motivation for me to continue sharing. Thank you

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Note an online problem caused by collection de-duplication

preface

HashSet source

The performance comparison

Note an online problem caused by collection de-duplication

preface

HashSet source

The performance comparison

Related Posts

Cve-2017-16995 Ubuntu Local Rights Lifting Vulnerability

Using ES+Logstash to realize article highlighting search and Mysql data synchronization

JVM8 – method area