BitSet

The BitSet class implements a bit-vector that grows on demand, which is actually a Vector of “bits”. Each bit is a Boolean value representing true or false. If we want to store data efficiently such that there are only two types of data, we can use bitsets.

First of all, it needs to be explained that BitSet does not belong to the Set framework, and there is no List, Map or Set interface. BitSet is more of a switching information. For mass non-repeated data, using index to represent data will greatly save space.

The bitmap

Vector of bits, also known as a bitmap, is often used in various algorithm designs because it can represent contiguous data of a given range in a very compact format, which refers to arrays.

The basic principle is that the value (switch) at a particular location (usually the subscript position of the array), 0 is not present, 1 is present, that is, the use of a position is 0 to indicate whether the number (this position represents the number, usually the subscript) is present.

To help you understand, we can explain the data in the figure above, which is that I cut part of a bit vector, namely the position with subscript 1000-1009. A switch of 1(value 1) means that the data was present, which means that the data with the array index, 1000, was present, 1003 was present, and 1008 was present, and nothing else.

Here we can actually para figure have a basic understanding of, anyway it is expressed in an array of the data on the specific location have no, because there is no will only have two kinds of results, is the true and false, so we can use 0, 1, 0 1 here refers to the binary 0 1, So instead of using some other data type, like an Int or a String, because that’s a space consuming data type, that’s why we use this data structure, right

Why bitmaps save space

Earlier we said that bitmaps are mainly used to represent information such as switches. We used bitmaps mainly to save space, but we didn’t say how. Let’s say we have a product that has 100 million daily users, a total of about a billion users, and our user ids range from zero to a billion, and now we want to calculate UV and PV. So the requirement is this simple requirement, PV is very simple and we mainly calculate UV, in order to calculate UV we need to store the information of the user who is logged in first.

SET to implement

We can add the user ID of type long to a Set. Then we calculate the space consumption. Assuming 100 million daily lives, we have 100 million integers of type long. So we need about 763M of storage space

vector of bits

What we need is a bit array of 1 billion size, 1000000000/8/1024/1024=119M. If we use the bit vector, we only need 119M. In fact, we can see that the gap between these is quite large

So the essence of a bitmap is to choose the right data type to represent a particular meaning. In this case, there is no, here we use Boolean to look at a test. Suppose I want to represent information like 1024 switches, Then we expected a memory footprint of 1024 bits, assuming that bit would represent what we wanted to express

@Test
public void test2(a){
    boolean[] bits = new boolean[1024];
    System.out.println(ClassLayout.parseInstance(bits).toPrintable());
}
Copy the code

The output

[Z object internals: OFFSET SIZE TYPE DESCRIPTION VALUE 0 4 (object header) 01 00 00 00 (00000001 00000000 00000000 00000000) (1) 4 4 (object  header) 00 00 00 00 (00000000 00000000 00000000 00000000) (0) 8 4 (object header) 05 00 00 f8 (00000101 00000000 00000000 11111000) (-134217723) 12 4 (object header) 00 04 00 00 (00000000 00000100 00000000 00000000) (1024) 16 1024 boolean [Z.<elements> N/A Instance size: 1040 bytes Space losses: 0 bytes internal + 0 bytes external = 0 bytes totalCopy the code

We see that it has a memory footprint of 1040 bytes, but excluding the object header it still has a memory footprint of 1024bytes instead of 1024 bits.

In fact, we can summarize a formula here, first of all, let’s figure out what data type we’re using, 4 bytes if it’s an int, 8 bytes if it’s a long, and then we know how many times the memory cost of using a normal type is as much as using a bit, for example, 32 times if we’re using an int, In the uv example above, our array size is 1 billion, and our set size is 100 million, so we use 64/10 is a multiple of our memory overhead using non-bit and bit. That’s about 6.4 times

Bitmap usage scenarios

We also mentioned above have mainly use in judgment or whether such a scenario, such as UV calculation is to determine the users have a login, on the basis of this, of course, also spawned very for applications, mainly used for filtering and cache, for example, before we query the DB can determine the user first time there, If there is no need to go DB, this can prevent malicious login what to reduce the pressure on the system, the main is to learn this idea, and then to apply.

Interview questions are often asked, such as counting four billion data points that don’t appear, and sorting four billion different data points.

Another example: there are 10 million random numbers, ranging from 100 million to 100 million. Now the requirement to write an algorithm, between 100 million to 100 million numbers not in the random number out (Baidu).

There is also a topic in Programming Pearls about using bitsets to find phone numbers.

A common extension of a Bitmap is to use two or more bits to represent more information about the number, such as how many times it occurred.

BitSet met

Bit [] bits =new bit[1024]; bit[] =new bit[1024]; So we have to think of some other way to do this, because int is 32 bits, long is 64 bits, Java doesn’t provide bits but Java does provide ints provide long, so we can think of it as a collection of bits, However, there will be some changes in our bit operation, that is, our operation is for a set of bits, we can call it a bit-wise operation

Although the Bitset structure is simple, there are some details to implement. The key is some bit operations, such as how to reverse the finger position, set, query the state of the specified bit (0 or 1), etc.

I. BitSet specification

In accordance with international practice, let’s take a look at the BitSet specification first

/**
 * This class implements a vector of bits that grows as needed. Each
 * component of the bit set has a {@code boolean} value. The
 * bits of a {@code BitSet} are indexed by nonnegative integers.
 * Individual indexed bits can be examined, set, or cleared. One
 * {@code BitSet} may be used to modify the contents of another
 * {@codeBitSet} through logical AND, logical exclusive OR, AND * logical exclusive OR operations. Each bit of the bit vector is a Boolean, each bit of the bit vector is indexed by a non-negative int * each indexed bit, * A BitSet can be initially have the value {<p>By default, all bits in the set initially have the value {@codeFalse}. * Default, * <p>Every bit set has a current size, which is the number of bits * of space currently in use by the bit set. Note that the size is * related to the implementation of a bit set, so it may change with * implementation. The length of a bit set relates to logical length * of a bit set and is defined * Bitsets have a current size, which is the number of bits currently used by the BitSet. Note that this size is dependent on how the BitSet is implemented, The length of a BitSet is the logical size of the BitSet, * <p>Unless otherwise noted, passing a null parameter to any of the methods in a {@code BitSet} will result in a {@codeNullPointerException}. * Any method that passes A null value to A BitSet throws A NullPointerException * <p>A {unless otherwise specified.@codeBitSet} is not safe for multithreaded use without external synchronization@author  Arthur van Hoff
 * @author  Michael McCloskey
 * @author  Martin Buchholz
 * @sinceJDK1.0 * /
 public class BitSet implements Cloneable.java.io.Serializable {
    /* * BitSets are packed into arrays of "words." Currently a word is * a long, which consists of 64 bits, requiring 6 address bits. * The choice of word size is determined purely by performance concerns. */
    private final static int ADDRESS_BITS_PER_WORD = 6;
    private final static int BITS_PER_WORD = 1 << ADDRESS_BITS_PER_WORD;
    private final static int BIT_INDEX_MASK = BITS_PER_WORD - 1;

    /* Used to shift left or right for a partial word mask */
    private static final long WORD_MASK = 0xffffffffffffffffL;

    / * * *@serialField bits long[]
     *
     * The bits in this BitSet.  The ith bit is stored in bits[i/64] at
     * bit position i % 64 (where bit position 0 refers to the least
     * significant bit and 63 refers to the most significant bit).
     */
    private static final ObjectStreamField[] serialPersistentFields = {
        new ObjectStreamField("bits".long[].class),
    };

    /** * The internal field corresponding to The serialField "bits". So the size of the BitSet is an integer multiple of the size of the long type (64-bit). * /
    private long[] words;

    /** * The number of words in the logical size of this BitSet. */
    private transient int wordsInUse = 0;

    /** * Whether the size of "words" is user-specified. If so, we assume * the user knows what he's doing and try harder to preserve it. */
    private transient boolean sizeIsSticky = false;

    /* use serialVersionUID from JDK 1.0.2 for interoperability */
    private static final long serialVersionUID = 7997698588986878753L;

 }
Copy the code

2. Construction method

2.1 No-parameter construction

The first constructor creates a default object with all bits initialized to 0.

/**
 * Creates a new bit set. All bits are initially {@code false}.
 */
public BitSet(a) {
  	// BITS_PER_WORD=64
    initWords(BITS_PER_WORD);
    sizeIsSticky = false;
}
Copy the code
2.2 Parametric construction

The second method allows the user to specify the initial size, with all bits initialized to 0.

/**
 * Creates a bit set whose initial size is large enough to explicitly
 * represent bits with indices in the range {@code 0} through
 * {@code nbits-1}. All bits are initially {@code false}.
 *
 * @param  nbits the initial size of the bit set
 * @throws NegativeArraySizeException if the specified initial size
 *         is negative
 */
public BitSet(int nbits) {
    // nbits can't be negative; size 0 is OK
    if (nbits < 0)
        throw new NegativeArraySizeException("nbits < 0: " + nbits);

    initWords(nbits);
    sizeIsSticky = true;
}
Copy the code

In the above two constructors, the actual space is controlled by the initWords method, where we instantiate an array of longs

private void initWords(int nbits) {
    words = new long[wordIndex(nbits-1) + 1];
}
/** * Given a bit index, return word index containing it. */
private static int wordIndex(int bitIndex) {
    return bitIndex >> ADDRESS_BITS_PER_WORD;
}
Copy the code

WordIndex is a constant ADDRESS_BITS_PER_WORD. The ADDRESS_BITS_PER_WORD is a constant ADDRESS_BITS_PER_WORD.

private final static int ADDRESS_BITS_PER_WORD = 6;  
Copy the code

Obviously 2^6=64 (8 bytes, corresponding to longs). So, when we pass 129 as an argument, we apply for an array of longs [(129-1)>>6+1]. A BitSet is an integer representing a number of bits.

We know that bitsets can be expanded internally, but if we can predict the size of the space in advance, we can reduce the number of arrays expanded and the number of elements copied.

2.4 Other construction methods

Of course we can also create bitsets based on other data sets, such as long[], Byte [], LongBuffer, and ByteBuffer

BitSet bitSet = BitSet.valueOf(new long[] { 42.12 });
Copy the code

3. How BitSet works

3.1 set method

To simplify the problem we use byte instead of int or long to represent the set of bits. First we initialize all the positions to 0

Now if we want to set the third position to 1, we move 1 three places to the left

For example, the following logical or operation results are as follows

Also if we want to set the position at subscript 7 to true

In the figure above, we first move 1 7 places to the right, and then do the logic or operation with the result of moving 1 3 places to the right

3.2 the get method

The get method is used to check whether a particular position is true or false(1 or 0). The main implementation principle of this method is to move first and then do the logical sum

For example: we check if the position of 3 is 1

  1. Move 1 to the right by 3 bits
  2. Make a logical sum between the new value and the original value
  3. If the result is positive, the value at that location is 1

If we examine the value at position 4, we will find that it is equal to 0, and then we will prove that the value at position 4 is 0

3.2 Capacity Expansion Method

Currently we can only store one 8-bit bit vector. To break this limitation, we can only use one byte array representation instead of a single byte, as shown below

But once we use byte array representation, we need to “set get clear” to a specific byte in the array. For example, we want to set the value of position 14. What we find is the second byte in the array. Of course, if you want to set the byte beyond 15, you have to expand the array, which happens automatically in bitsets.

4. Implementation of BitSet

As mentioned above, byte is used instead of bit set to simplify operations. Our design is also to simplify the principle of BitSet, but BitSet uses long instead of bit set. So far, we have mastered a lot about the principle. Let’s take a look at the implementation in Java and how to use it. First, let’s verify that BitSet is beautiful and saves us space

@Test
public void test2(a){
    boolean[] bits = new boolean[1024];
    System.out.println(ClassLayout.parseInstance(bits).toPrintable());
    BitSet bitSet = new BitSet(1024);
    System.out.println(GraphLayout.parseInstance(bitSet).toPrintable());
}
Copy the code

The output

[Z object internals: OFFSET SIZE TYPE DESCRIPTION VALUE 0 4 (object header) 01 00 00 00 (00000001 00000000 00000000 00000000) (1) 4 4 (object  header) 00 00 00 00 (00000000 00000000 00000000 00000000) (0) 8 4 (object header) 05 00 00 f8 (00000101 00000000 00000000 11111000) (-134217723) 12 4 (object header) 00 04 00 00 (00000000 00000100 00000000 00000000) (1024) 16 1024 boolean [Z.<elements> N/A Instance size: 1040 bytes Space losses: 0 bytes internal + 0 bytes external = 0 bytes total java.util.BitSet@e50a6f6d object externals: ADDRESS SIZE TYPE PATH VALUE 71821ef90 24 java.util.BitSet (object) 71821efa8 144 [J .words [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]Copy the code

We see the same functionality for the Boolean [] array that uses 1040 bytes, 16 bytes containing the object header, and all that is left is the actual size of the Boolean [] array. From this we can see that in the array, a Boolean value occupies one byte of storage

And then we notice that the internal array of the BitSet is 16, because it’s long, So 16*64=1024 bits (true and false), even though 168bytes are used, which is a lot less than Boolean [] ‘s 1040bytes.

The BitSet method uses a long instead of a set of bits

/**
 * Sets the bit at the specified index to {@code true}.
 *
 * @param  bitIndex a bit index
 * @throws IndexOutOfBoundsException if the specified index is negative
 * @sinceJDK1.0 * /
public void set(int bitIndex) {
    if (bitIndex < 0)
        throw new IndexOutOfBoundsException("bitIndex < 0: " + bitIndex);
		BitIndex = bitIndex; bitIndex = bitIndex; bitIndex = bitIndex; bitIndex = bitIndex; bitIndex = bitIndex
    int wordIndex = wordIndex(bitIndex);
  	// If yes, expand the capacity
    expandTo(wordIndex);
   
    words[wordIndex] |= (1L << bitIndex); // Restores invariants
    checkInvariants();
}

/** * Given a bit index, return word index containing it
private static int wordIndex(int bitIndex) {
  	ADDRESS_BITS_PER_WORD=6 2 to the sixth power is the number of bytes a long takes
    return bitIndex >> ADDRESS_BITS_PER_WORD;
}

/**
 * Ensures that the BitSet can accommodate a given wordIndex, temporarily violating the invariants.  The caller must
 * restore the invariants before returning to the user,possibly using recalculateWordsInUse().
 * @param wordIndex the index to be accommodated.
 */
private void expandTo(int wordIndex) {
    int wordsRequired = wordIndex+1;
    if(wordsInUse < wordsRequired) { ensureCapacity(wordsRequired); wordsInUse = wordsRequired; }}Copy the code

conclusion

  1. Bitmap has many use scenarios, such as filtering or data caching scenarios, the use of good can improve the performance of our program
  2. Since There is no bit data type in Java, Java bitsets use long instead of bit