1. BitMap

The basic idea of a bitmap is to use a Bit to mark the Value of an element, and the Key is the element. Since Bit is used to store data, the storage space can be greatly saved. (PS: highlight to save storage space)

Suppose you have a requirement to find out if some number m exists among 2 billion random integers, and assume a 32-bit operating system with 4 gigabytes of memory

In Java, int is 4 bytes, 1 byte =8 bits.

If each number is stored as an int, there are 2 billion ints, thus occupying approximately (2000000000*4/1024/1024/1024)≈ 7.45g

If the storage is different, number is 2 billion 2 billion, take up the space of about (1024/1024/1024/2000000000/8) material 0.233 G

There is no need to say more

So, the question is, how do you represent a number?

As I said, each digit represents a number, 0 for nonexistence, 1 for existence, which is exactly binary

So we can easily represent {1,2,4,6} :

The picture

The smallest unit of computer memory allocation is the byte, or 8 bits, so what about {12,13,15}?

Of course, in another 8-bit representation:

The picture

So it looks like a two-dimensional array

Int TMP [1+N/32] = TMP [1+N/32]

TMP [0] : the value ranges from 0 to 31

TMP [1] : the value ranges from 32 to 63

TMP [2] : indicates 64 to 95

.

So, given any integer M, then M/32 gets the index, and M%32 knows where it is in that index

add

So here’s the question, how do we put a number in there? For example, if you want to put the number 5 in, how do you do that?

5/32=0, 5/32= 5, 5/32= 5, 5/32= 5, 5/32= 5, 5/32= 5, 5/32= 5

The picture

Binary is

The picture

This is equivalent to 86 | 32 = 118

86 | (1 < < 5) = 118

b[0] = b[0] | (1<<5)

That is, to insert a number, move the 1 to the left with the digit that represents the number, and then do the bitwise or operation with the original number

Reduction, is the 86 + (5/8) | (1 < < 5% (8))

As a result, the formula can be summarized as: p + 1 < < (I / 8) | (8) (I %), p said the value of the now, I said to be inserted into the number

remove

So that’s adding, so what do I do if I want to clear?

Again, in the example above, suppose we want to remove 6, what do we do?

The picture

From the diagram, you just need to set the position of the number to 0

If you move 1 six places to the left, you reach the position represented by the number 6, and then invert it, and finally place it with the original number, so that the position is 0

b[0] = b[0] & (~(1<<6))

b[0] = b[0] & (~(1<<(i%8)))

To find the

As we said earlier, each digit represents a number, with 1 representing presence and 0 representing absence. The number of bits added and cleared by setting the value to 1 or 0, so the existence of a number is determined by determining whether the number’s bit is 0 or 1

Let’s say we want to know if 3 is there, so we just say b[0] & (1<<3) if this value is 0, it doesn’t exist, and if it’s 1, it does exist

2. What are Bitmaps for

Rapid sorting, search and de-duplication of large amounts of data

Quick sort

If we want to sort 5 elements (4,7,2,5,3) from 0 to 7 (assuming that the elements are not duplicated), we can use a bitmap to sort.

To represent 8 numbers, we need only 8 bits (1Bytes). First we create a space of 1 byte, set all the bits in this space to 0, and then set the corresponding position to 1.

Finally, the Bit area is traversed, and the number of the Bit that is one is output (2,3,4,5,7), so as to achieve the purpose of sorting, time complexity O(n).

Advantages:

  • High operation efficiency, no need for comparison and shift;

  • Low memory usage, for example, N=10000000; Only N/8=1250000Byte=1.25 MB memory is required

Disadvantages:

  • All data cannot be duplicated. That is, duplicate data cannot be sorted or searched.

  • The advantage is only when data is dense

Quick go to heavy

Find the number of non-repeating integers in 2 billion integers. Memory is not enough to hold 2 billion integers.

First of all, there is not enough memory space for 0.5 billion integers. The next key question is how to design our bitmap to represent the states of these two billion numbers. In fact, this problem is very simple, there are only three states of a number, respectively, there is no existence, only one, there is repetition. Therefore, we need only 2bits to store the state of a number. Suppose we set a number that exists once to 00, once to 01, and twice or more to 11. So we need about 2 gigabytes of storage.

The next task is to put the two billion numbers in (store), and if the corresponding state bit is 00, change it to 01, indicating that it exists once; If the corresponding status bit is 01, change it to 11, indicating that there is already one, that is, multiple occurrences. If it is 11, the corresponding status bit remains unchanged, indicating multiple occurrences.

Finally, the number of numbers with status bit 01 is counted, and the number of non-repeating numbers is obtained with time complexity O(n).

Quickly find

If an element in an int array is 4 bytes and 32 bits, divide it by 32 to find the element’s subscript, and take the remainder (%32) of 32 to find where it is. If the bit is 1, it exists.

Summary & review

Bitmap is mainly used to quickly retrieve keyword states. Usually, keywords are required to be a continuous sequence (or most of the keywords in a continuous sequence). In the most basic case, 1bit is used to represent the state of a keyword (two states can be identified), but 2 bits (four states can be identified) can also be used as required. Three bits (eight states).

The main use of bitmaps is to represent the state of a continuous (or near-continuous, mostly present) sequence of keywords (the smaller the number of states/keywords, the better).

On 32-bit machines, an integer, such as int a=1, takes up 32 bits of memory to facilitate computing. But for some applications, this is a huge waste, because we can store decimal numbers from 0 to 31 with the corresponding 32bit bits, which is the basic idea of a bitmap. The bitmap algorithm uses this idea to sort, query, and de-duplicate large amounts of data.

Add 1

So without overrunning the numbers, for both positive and negative numbers, moving one place to the left is the same thing as multiplying by 2 to the first power, moving n places to the left is the same thing as multiplying by 2 to the n, moving one place to the right is the same thing as dividing by 2, moving n places to the right is the same thing as dividing by 2 to the n.

<< to the left is equivalent to multiplying 2 to the n, for example: 1<<6 is equivalent to 1×64=64, 3<<4 is equivalent to 3×16=48

> to the right is equivalent to dividing by 2 to the n, for example: 64>>3 is equivalent to 64÷8=8

48^32 = 48%32=16

Add 2

Instead of using third-party variables, swap the values of both variables

// a = a + b; b = a - b; a = a - b; // a = a ^ b; b = a ^ b; a = a ^ b;Copy the code

3. BitSet

Bitsets implement a bit vector that can grow as needed. Each digit has a Boolean value. The bits of a BitSet can be indexed by non-negative integers. You can find, set, and clear a bit. The contents of another BitSet can be modified by logical operators. By default, all bits have a default value of false. Also recommended: Java advanced video resources

The picture

The picture

The picture

The picture

The picture

As you can see, it’s pretty much what we thought

I’m going to store it in an array of long, 64 bytes, and when I set it, I move it 6 bits to the right (equal to dividing by 64), and then I change the status bits

It doesn’t matter if you don’t understand anything else. It’s enough to understand these two sentences:

int wordIndex = wordIndex(bitIndex);  
words[wordIndex] |= (1L << bitIndex);  

Copy the code

4. Bloom Filters

The picture

Bloom filter is a data structure that can be used to determine whether an element is in a collection. It has the characteristics of fast running and small memory footprint.

The trade-off for efficient inserts and queries is that the Bloom Filter is a probabilistic data structure: it can only tell us that an element is definitely not in the collection or might be.

The Bloom filter’s underlying data structure is a vector of bits (think of as an array).

It is mainly applied to large-scale data scenarios that do not need accurate filtering, such as checking spam addresses, reptilian URL address deduplication, and solving cache penetration problems

If you want to determine whether an element is in a set, the general idea is to store all the elements in the set and then compare them. Linked lists, trees, hash tables, and other data structures are the same idea, but as the number of elements in the collection increases, the storage space needs to increase; At the same time, the retrieval speed is slower and slower, and the retrieval time complexity is O(n), O(log n), O(1) respectively.

Bloom filters work by mapping an element as it is added to a collection to K points in a Bit array set to 1 through K hash functions. When you retrieve it, you can tell if the element is in the set by looking at whether all of these points are 1; If any of these points have a 0, the element being checked must not be there; If both are 1, then the element being examined is likely to be.

BloomFilter process

  1. First, we need k hash functions, each of which hashes the key into an integer.

  2. To initialize, an array of n bits is required, with each bit initialized to 0.

  3. When a key is added to the set, k hash functions are used to calculate k hash values, and the corresponding bit position in the array is 1.

  4. When determining whether a key is in the set, k hash functions are used to calculate K hash values and query the corresponding bit bits in the array. If all the bit bits are 1, the key is in the set.

The picture

< the dependency > < groupId > com. Google. Guava < / groupId > < artifactId > guava < / artifactId > < version > 28.1 jre < / version > </dependency>Copy the code

(Thanks for reading, hope you all help)