Before getting into hashCode, I’ll pose a few questions that I’ll answer later

  1. What is a hashCode?
  2. What does hashCode do?
  3. Why do hash conflicts occur, and can they be avoided?
  4. Why does overriding equals() require overriding hashCode()?
  5. How is hashCode generated in Java?

HashCode () : java8 Object: hashCode() : java8 Object: hashCode(

    /**
     * Returns a hash code value for the object. This method is
     * supported for the benefit of hash tables such as those provided by
     * {@link java.util.HashMap}.
     * <p>
     * The general contract of {@code hashCode} is:
     * <ul>
     * <li>Whenever it is invoked on the same object more than once during
     *     an execution of a Java application, the {@code hashCode} method
     *     must consistently return the same integer, provided no information
     *     used in {@code equals} comparisons on the object is modified.
     *     This integer need not remain consistent from one execution of an
     *     application to another execution of the same application.
     * <li>If two objects are equal according to the {@code equals(Object)}
     *     method, then calling the {@code hashCode} method on each of
     *     the two objects must produce the same integer result.
     * <li>It is <em>not</em> required that if two objects are unequal
     *     according to the {@link java.lang.Object#equals(java.lang.Object)}
     *     method, then calling the {@code hashCode} method on each of the
     *     two objects must produce distinct integer results.  However, the
     *     programmer should be aware that producing distinct integer results
     *     for unequal objects may improve the performance of hash tables.
     * </ul>
     */
Copy the code

I can roughly sum up his meaning in the following points

  1. Each object returns a hashCode, which is used to speed up the hash table
  2. The return of hashCode() should and should not change if the equals() comparison has not changed when the same Java application is called more than once on the same object during execution. (In particular, different Java programs can allow hashCode to change within the same Java program.)
  3. According to equals(), if two objects are equal, hashCode must also be equal
  4. If two objects are not equal according to equals(), hashCode may be equal, but make them as different as possible to improve hash table performance

What is a hashCode?

HashCode is used to speed up hash tables. In the same Java application, hashCode is a signature of a Java object, and equal objects must be equal hashCode.

What does hashCode do?

It’s used to speed up the hash table, as I mentioned, in space for time.

This blog post is not about hashing tables, so I won’t expand it here

The idea is to use hashcode to find the corresponding hash bucket in the hash table, and then call equals() to determine whether the elements mounted in the hash bucket are equal.

Why do equal elements appear in the same hash bucket?

As mentioned earlier, hashcode objects that are equal must be equal, but hashcode objects that are equal are not necessarily equal, so you need to call equals() again

Why do hash conflicts occur, and can they be avoided?

Let’s take a look at the method signature of hashCode in Java

public native int hashCode(a);
Copy the code

It can be seen that hashCode is essentially an int. The range of data represented by int is limited, while the object is infinite. When an infinite set is compressed into a finite set, hash conflict becomes an inevitable event, and hash conflict is inevitable.

Of course, in our application, objects cannot be infinite, so choosing a good hash algorithm can reduce the probability of hash conflicts

Why does overriding equals() require overriding hashCode()?

This issue has actually been mentioned several times before. According to the Java specification, hashCode() must be equal to equals()

Overwriting equals() invariably results in a change in whether objects are equal, so the hashCode() method needs to be modified

If you change equals() without changing hashCode(), the hash bucket cannot be accurately located when using the hash table, causing the hash table to work improperly

How is hashCode generated in Java?

This is a very complicated problem, so let me extract some comments from java8 Object

    /**
     * As much as is reasonably practical, the hashCode method defined by
     * class {@code Object} does return distinct integers for distinct
     * objects. (This is typically implemented by converting the internal
     * address of the object into an integer, but this implementation
     * technique is not required by the
     * Java&trade; programming language.)
     */
Copy the code

Most of the time hashCode is a map of an object’s memory address, but in Java it is not.

So how is hashCode generated in Java?

In fact, Java’s default HashCode is generated by a set of random algorithms that depend only on the order and thread in which objects are generated.

So let me prove it in code

    public static void main(String[] args) {
        for (int i = 0; i < 5; i++) {
            Object o = new Object();
            System.out.println("hashCode:" + o.hashCode());
            System.out.println("address:"+ VM.current().addressOf(o)); }}Copy the code

The Unsafe class was straightforward to retrieve the memory address, but it was a bit cumbersome, so I used a packaged toolkit

        <dependency>
            <groupId>org.openjdk.jol</groupId>
            <artifactId>jol-core</artifactId>
            <version>0.10</version>
        </dependency>
Copy the code

Repeated execution of the above code results in the following

For the first time, The second time The third time
hashCode:531885035 address:31856529176 hashCode:705265961 address:31874395832 hashCode:428746855 address:31874396520 hashCode:317983781 address:31874397208 hashCode:987405879 address:31874397896 hashCode:531885035 address:31856529344 hashCode:705265961 address:31874395856 hashCode:428746855 address:31874396544 hashCode:317983781 address:31874397232 hashCode:987405879 address:31874397920 hashCode:531885035 address:31856529920 hashCode:705265961 address:31874407200 hashCode:428746855 address:31874407888 hashCode:317983781 address:31874408576 hashCode:987405879 address:31874409264

We can clearly see that the same order of hashCode is the same, and the memory address is not the same, so we can be sure that Java hashCode and memory are independent, and seem to be related to the creation order.

Next we use code to see what happens when a hash conflict occurs

    public static void main(String[] args) {
        HashSet<Integer> set = new HashSet<Integer>();
        int count = 0, num = 0;
        while (true) {
            count++;
            Object o = new Object();
            if (set.contains(o.hashCode())) {
                num++;
                System.out.println("count:" + count);
                System.out.println("hashCode:" + o.hashCode());
                if (num == 5) break; } set.add(o.hashCode()); }}Copy the code

Repeat the above code

For the first time, The second time The third time
count:105730 hashCode:2134400190 count:111177 hashCode:651156501 count:121838 hashCode:1867750575 count:145361 hashCode:2038112324 count:146294 hashCode:1164664992 count:105730 hashCode:2134400190 count:111177 hashCode:651156501 count:121838 hashCode:1867750575 count:145361 hashCode:2038112324 count:146294 hashCode:1164664992 count:105730 hashCode:2134400190 count:111177 hashCode:651156501 count:121838 hashCode:1867750575 count:145361 hashCode:2038112324 count:146294 hashCode:1164664992

We can see that the location of the hash collision is the same, and the hashCode is the same when the hash collision occurs, so we can conclude that the generation of hashCode is related to the order in which the object is generated.

Next we repeat the above operation with different threads

    public static void main(String[] args) throws InterruptedException {
        new Thread(() ->{
            HashSet<Integer> set = new HashSet<Integer>();
            int count = 0, num = 0;
            while (true) {
                count++;
                Object o = new Object();
                if (set.contains(o.hashCode())) {
                    num++;
                    System.out.println("count:" + count);
                    System.out.println("hashCode:" + o.hashCode());
                    if (num == 5) break;
                }
                set.add(o.hashCode());
            }
        }).start();
        Thread.sleep(1000L);
    }
Copy the code

The results are as follows

For the first time, The second time The third time
count:53869 hashCode:820543451 count:87947 hashCode:951498229 count:99859 hashCode:175788428 count:100940 hashCode:1979813773 count:123438 hashCode:916565907 count:51579 hashCode:1573800482 count:124111 hashCode:1834341042 count:139761 hashCode:2021276643 count:142276 hashCode:1807321828 count:143515 hashCode:1522017140 count:69186 hashCode:742597489 count:102967 hashCode:2084692455 count:144290 hashCode:424089733 count:176112 hashCode:53258565 count:178639 hashCode:1046986547

There seems to be no pattern, which requires to dig Java source code to solve this phenomenon

Object source code address

In Object we can see that hashCode actually calls IHashCode

static JNINativeMethod methods[] = {
    {"hashCode"."()I",                    (void *)&JVM_IHashCode},
    {"wait"."(J)V",                   (void *)&JVM_MonitorWait},
    {"notify"."()V",                    (void *)&JVM_MonitorNotify},
    {"notifyAll"."()V",                    (void *)&JVM_MonitorNotifyAll},
    {"clone"."()Ljava/lang/Object;",   (void *)&JVM_Clone},
};
Copy the code

It then finds IHashCode in jvm.cpp and actually calls FastHashCode

JVM_ENTRY(jint, JVM_IHashCode(JNIEnv* env, jobject handle))
  JVMWrapper("JVM_IHashCode");
  // as implemented in the classic virtual machine; return 0 if object is NULL
  return handle == NULL ? 0 : ObjectSynchronizer::FastHashCode (THREAD, JNIHandles::resolve_non_null(handle)) ;
JVM_END
Copy the code

Then steak FastHashCode FastHashCode in synchronizer. CPP

intptr_t ObjectSynchronizer::FastHashCode (Thread * Self, oop obj) {
  if (UseBiasedLocking) {
    // NOTE: many places throughout the JVM do not expect a safepoint
    // to be taken here, in particular most operations on perm gen
    // objects. However, we only ever bias Java instances and all of
    // the call sites of identity_hash that might revoke biases have
    // been checked to make sure they can handle a safepoint. The
    // added check of the bias pattern is to avoid useless calls to
    // thread-local storage.
    if (obj->mark()->has_bias_pattern()) {
      // Box and unbox the raw reference just in case we cause a STW safepoint.
      Handle hobj (Self, obj) ;
      // Relaxing assertion for bug 6320749.assert (Universe::verify_in_progress() || ! SafepointSynchronize::is_at_safepoint(),"biases should not be seen by VM thread here");
      BiasedLocking::revoke_and_rebias(hobj, false, JavaThread::current()); obj = hobj() ; assert(! obj->mark()->has_bias_pattern(),"biases should be revoked by now"); }}// hashCode() is a heap mutator ...
  // Relaxing assertion for bug 6320749.assert (Universe::verify_in_progress() || ! SafepointSynchronize::is_at_safepoint(),"invariant"); assert (Universe::verify_in_progress() || Self->is_Java_thread() ,"invariant") ;
  assert (Universe::verify_in_progress() ||
         ((JavaThread *)Self)->thread_state() != _thread_blocked, "invariant"); ObjectMonitor* monitor =NULL;
  markOop temp, test;
  intptr_t hash;
  markOop mark = ReadStableMark (obj);

  // object should remain ineligible for biased lockingassert (! mark->has_bias_pattern(),"invariant");// Mark is the object header
  if (mark->is_neutral()) {
    hash = mark->hash();              // Retrieve the hash value.
    if (hash) {                       // if it has hash, just return it
      return hash;
    }
    hash = get_next_hash(Self, obj);  // This is the core method for generating HashCode
    temp = mark->copy_set_hash(hash); // merge the hash code into header
    // use (machine word version) atomic operation to install the hash
    test = (markOop) Atomic::cmpxchg_ptr(temp, obj->mark_addr(), mark);
    if (test == mark) {
      return hash;
    }
    // If atomic operation failed, we must inflate the header
    // into heavy weight monitor. We could add more code here
    // for fast path, but it does not worth the complexity.
  } else if (mark->has_monitor()) {
    monitor = mark->monitor();
    temp = monitor->header();
    assert (temp->is_neutral(), "invariant"); hash = temp->hash();if (hash) {
      return hash;
    }
    // Skip to the following code to reduce code size
  } else if (Self->is_lock_owned((address)mark->locker())) {
    temp = mark->displaced_mark_helper(); // this is a lightweight monitor owned
    assert (temp->is_neutral(), "invariant"); hash = temp->hash();// by current thread, check if the displaced
    if (hash) {                       // header contains hash code
      return hash;
    }
    // WARNING:
    // The displaced header is strictly immutable.
    // It can NOT be changed in ANY cases. So we have
    // to inflate the header into heavyweight monitor
    // even the current thread owns the lock. The reason
    // is the BasicLock (stack slot) will be asynchronously
    // read by other threads during the inflate() function.
    // Any change to stack may not propagate to other threads
    // correctly.
  }

  // Inflate the monitor to set hash code
  monitor = ObjectSynchronizer::inflate(Self, obj);
  // Load displaced header and check it has hash code
  mark = monitor->header();
  assert (mark->is_neutral(), "invariant"); hash = mark->hash();if (hash == 0) {
    hash = get_next_hash(Self, obj);
    temp = mark->copy_set_hash(hash); // merge hash code into header
    assert (temp->is_neutral(), "invariant"); test = (markOop) Atomic::cmpxchg_ptr(temp, monitor, mark);if(test ! = mark) {// The only update to the header in the monitor (outside GC)
      // is install the hash code. If someone add new usage of
      // displaced header, please update this code
      hash = test->hash();
      assert (test->is_neutral(), "invariant") ;
      assert (hash != 0."Trivial unexpected object/monitor header usage."); }}// We finally get the hash
  return hash;
}
Copy the code

get_next_hash

static inline intptr_t get_next_hash(Thread * Self, oop obj) {
  intptr_t value = 0 ;
  // Get a random one
  if (hashCode == 0) {
     // This form uses an unguarded global Park-Miller RNG,
     // so it's possible for two threads to race and generate the same RNG.
     // On MP system we'll have lots of RW access to a global, so the
     // mechanism induces lots of coherency traffic.
     value = os::random() ;
  } else
  // Calculate one based on the memory address
  if (hashCode == 1) {
     // This variation has the property of being stable (idempotent)
     // between STW operations. This can be useful in some of the 1-0
     // synchronization schemes.
     intptr_t addrBits = cast_from_oop<intptr_t>(obj) >> 3 ;
     value = addrBits ^ (addrBits >> 5) ^ GVars.stwRandom ;
  } else
  / / returns 1
  if (hashCode == 2) {
     value = 1 ;            // for sensitivity testing
  } else
  // What do you mean by this
  if (hashCode == 3) {
     value = ++GVars.hcSequence ;
  } else
  // Return the memory address directly
  if (hashCode == 4) {
     value = cast_from_oop<intptr_t>(obj) ;
  }
  // This is an algorithm for generating random numbers
   else {
     // Marsaglia's xor-shift scheme with thread-specific state
     // This is probably the best overall implementation -- we'll
     // likely make this the default in future releases.
     unsigned t = Self->_hashStateX ;
     t ^= (t << 11); Self->_hashStateX = Self->_hashStateY ; Self->_hashStateY = Self->_hashStateZ ; Self->_hashStateZ = Self->_hashStateW ;unsigned v = Self->_hashStateW ;
     v = (v ^ (v >> 19)) ^ (t ^ (t >> 8)); Self->_hashStateW = v ; value = v ; } value &= markOopDesc::hash_mask;if (value == 0) value = 0xBAD; assert (value ! = markOopDesc::no_hash,"invariant"); TEVENT (hashCode: GENERATE) ;return value;
}
Copy the code

Here we basically know how the hashcode is generated, in jdk1.8 is actually in the fifth generation of way, we can under the Linux system input: Java – XX: + PrintFlagsFinal – version | grep hashcode command to see

Ok, let’s take a look at the fifth way. My first thought when I saw this code was “what the hell is _hashStateX?” Let’s see how thead.cpp defines it:

// thread-specific hashCode stream generator state - Marsaglia shift-xor form
  _hashStateX = os::random() ;
  _hashStateY = 842502087 ;
  _hashStateZ = 0x8767 ;    // (int)(3579807591LL & 0xffff) ;
  _hashStateW = 273326509 ;
Copy the code

Thread defines a random number and three constants, which are used to generate hashcode according to the algorithm above.

For details, please refer to Xorshift RNGs

Because the above algorithm needs to obtain a random number in the thread as an initial value, the above algorithm has a causal relationship before and after, and the later result is calculated according to the previous result. Therefore, for the same thread, the generation of object hashcode is sequentially related in a sense.

Why is the generation of hashCode in the main function always in order? Because there are fixed values in the main thread.

You can see that hashCode in 1.8 is independent of the memory address, but is dependent on the thread (or random number generated) and the order in which it is generated