A simple question

Given that there are n whole integers, the range of these integers is [0, 100], please design a data structure, use an array to store these data, and provide two methods, respectively addMember and isExist, the following is the class definition of this data structure, please implement its two methods

function FindClass(){ var datas = []; Member this.addMember = function(member){datas.push(member); }; This. isExist = function(member){for(var I = 0; i < datas.length; i++){ if(datas[i]==member){ return true; } } return false; }; This.isexist = function(member){if(datas.indexof (member) >= 0){return true; } return false; }; } var fc = new FindClass(); var arr = [0, 3, 5, 6, 9, 34, 23, 78, 99]; for(var i = 0; i < arr.length; i++){ fc.addMember(arr[i]); } console.log(fc.isExist(3)); console.log(fc.isExist(7)); console.log(fc.isExist(78));Copy the code

A faster way

The time complexity of isExist is O(n) regardless of whether to use the for loop or indexOf method. The more elements added, the slower the speed of isExist method. We need an algorithm with time complexity of O(1). IsExist’s execution speed is constant time, no matter how much data is added to it.

Data [3] = 1; data[2] = 0; data[2] = 0; data[2] = 0; data[2] = 0; data[2] = 0; The isExist method uses the index to determine whether member exists. The implementation code is as follows:

function FindClass(){ var datas = new Array(100); For (var I = 0; i < datas.length; i++){ datas[i] = 0; AddMember = function(member){datas[member] = 1; }; This. isExist = function(member){if(datas[member] == 1){return true; } return false; }; } var fc = new FindClass(); var arr = [0, 3, 5, 6, 9, 34, 23, 78, 99]; for(var i = 0; i < arr.length; i++){ fc.addMember(arr[i]); } console.log(fc.isExist(3)); console.log(fc.isExist(7)); console.log(fc.isExist(78));Copy the code

1.2 implementation of the algorithm, has been very fast, but it is facing a new problem, if the data is very much, up to 100 million, each integer is 4 bytes, then 100 million integer number is 400 million bytes, 1024 bytes is 1 KB, 1024 KB is 1M, 400 million bytes on 381M memory space.

What if I don’t have that much memory? We need a data compression algorithm that uses very little space to represent the existence or absence of 100 million numbers.

1.3 More space-saving algorithms

1.2 implementation of the algorithm, has been very fast, but it is facing a new problem, if the data is very much, up to 100 million, each integer is 4 bytes, then 100 million integer number is 400 million bytes, 1024 bytes is 1 KB, 1024 KB is 1M, 400 million bytes on 381M memory space.

What if I don’t have that much memory? We need a data compression algorithm that uses very little space to represent the existence or absence of 100 million numbers.

2. Street lights

There are 8 stacks of street lamps on the street, numbered 1, 2, 3, 4, 5, 6, 7 and 8 respectively. Street lamps no.2, no.5, No.7 and No.8 are on, while the rest are not on. Please design a simple method to express the status of these 8 stacks of street lamps.

The answer is 2578, which means the numbered streetlights are on and the rest are off.

A computer science student looking at this question should say 75, because the binary representation of 75 is 0, 1, 0, 0, 1, 1, 1, 1, which corresponds to the state when the lights are off

1 2 3 4 5 6 7 8 0 1 0 0 1 0 1Copy the code

Only 8 bit bits can represent the on and off condition of 8 stacks of street lamps, so an integer with 32 bit bits can represent the on and off condition of 32 stacks of street lamps. Recall the problem we met in the first section, isExist method to determine whether a number exists, can it also use this expression?

Value is an int, initialized to 0. When addMember passes 0, it sets the first bit of value to 1

00000000 00000000 00000000 00000001
Copy the code

If value = 1 and a 3 is added, set the fourth digit of the binary value to 1

00000000 00000000 00000000 00001001
Copy the code

In this case, value = 9, 9 can indicate the presence of both 0 and 3, an integer, in fact, can indicate the presence or absence of 0, 31, if I create an array of size 10, and I store integers in it, then this array can indicate the presence or absence of 0, 319

Datas [0] Indicates whether 0 to 31 exists. Datas [1] indicates whether 32 to 63 exists.... Datas [9] Indicates whether 288 to 319 existCopy the code

In this way, we can reduce the space usage to 32nd of the original, store the existence of 100 million integers, only need 12M of memory space, so how to manipulate the binary bits of integers.

3. Bit operations

The data in memory is ultimately stored in binary mode. Bitwise operation is to operate on the binary bits of integers in memory.

3.1 Bitwise and &

The bitwise and operation is performed on two integers. If the digits of the same binary bit are both 1, the result is 1, and if one bit is 0, the result is 0. The following is the calculation process of 3 & 7

Binary integer 0 1 1 3 1 1 1 7 0 1 1 3Copy the code

3 times 7 is 3

3.2 bitwise or |

Two integers for bitwise or operations, the same binary number if one is 1, the result is 1, 0, the result is 0, here are 5 | 8 calculation process

Binary integer 0 1 0 1 5 1 0 0 0 8 1 1 0 1 13Copy the code

5 | 8 = 13

3.3 the left < <

The binary moves n bits to the left, adding n zeros to the end and then 3<<1.

Binary integer 1, 1, 3, 1, 0, 6Copy the code

3 < < 1 = 6

3.4 Test the cat

A group of numbers, 3, 9, 19, 20, represented by an integer

var value = 0;
value = value | 1<<3;
value = value | 1<<9;
value = value | 1<<19;
value = value | 1<<20;
console.log(value); 
Copy the code

The program output is 1573384bitmap

New implementation

After the previous series of analysis and bit operation learning, we will now redesign a class to implement addMember and isExist methods with faster speed and less memory.

  • The data range is 0 to 100, so only 4 integers are needed to indicate the presence or absence of 4*32 numbers, creating an array of size 4

  • AddMember /32 specifies the arr_index of member, and member%32 specifies the bit_index of the integer. Finally perform bit_arr [arr_index] = bit_arr [arr_index] | 1 < < bit_index;

  • When executing isExist, use member/32 to determine the arr_index of member. Then use member%32 to determine the bit_index of the integer. Finally, execute bit_arr[arr_index] & 1<<bit_index. If the result is not 0, memeber exists

New implementation method

function BitMap(size){ var bit_arr = new Array(size); for(var i=0; i<bit_arr.length; i++){ bit_arr[i] = 0; } this.addMember = function(member){ var arr_index = Math.floor(member / 32); Var bit_index = member % 32; / / decision in 32 bit integer bit on which one bit_arr [arr_index] = bit_arr [arr_index] | 1 < < bit_index; }; this.isExist = function(member){ var arr_index = Math.floor(member / 32); Var bit_index = member % 32; Var value = bit_arr[arr_index] &1 <<bit_index; var value = bit_arr[arr_index] &1 <<bit_index; if(value ! = 0){ return true; } return false; }; } var bit_map = new BitMap(4); var arr = [0, 3, 5, 6, 9, 34, 23, 78, 99]; for(var i = 0; i < arr.length; i++){ bit_map.addMember(arr[i]); } console.log(bit_map.isExist(3)); console.log(bit_map.isExist(7)); console.log(bit_map.isExist(78));Copy the code

Big data sorting

There are as many as 1 billion unordered integers, with a maximum of 1.5 billion given. Sort the 1 billion numbers.

BitMap stores the maximum set of 1.5 billion, only needs 180M space, space use is completely acceptable, as for the speed, the storage and comparison process of the bit operation speed is very fast, the first traversal, put 1 billion numbers into the BitMap, the second traversal, from 0 to 1.5 billion, if in BitMap, Prints the value so that after two traversals, you can sort so much data.

For the sake of demonstration, just use a very small array, [0, 6, 88, 7, 73, 34, 10, 99, 22], known array maximum 99, using BitMap sorting algorithm is as follows

var arr = [0, 6, 88, 7, 73, 34, 10, 99, 22]; var sort_arr = []; var bit_map = new BitMap(4); for(var i = 0; i < arr.length; i++){ bit_map.addMember(arr[i]); } for(var i = 0; i <= 99; i++){ if(bit_map.isExist(i)){ sort_arr.push(i); } } console.log(sort_arr);Copy the code

Bloom filter

The foregoing BitMap is really bad, however, has a very strong limitations, BitMap can be used to handle an integer, cannot be used to handle strings, assuming that let you write a powerful crawler, crawl every day hundreds of millions of web pages, you will need a data structure that can store take the url, you have to climb, so it doesn’t repeat crawl.

You might think of using a hash function to process the URL to integers, which would seem like a BitMap again, but that would still be a problem. Assuming that the maximum value that a BitMap can map is M, the hash value of a URL needs to be modulo M. In this case, conflicts will occur, and the conflict rate will increase with the increase of stored data.

The idea of Bloom filter is very simple, and its basic idea is the same as BitMap. Bloom filter can be regarded as an extension of BitMap. In order to solve the conflict rate, bloom filter requires k hash functions. When creating a new key, hash the key into K integers, and then set the binary bit corresponding to the k integers to 1 in the array. When determining whether a key exists, hash the key using K hash functions to obtain K integers. If the binary bits of each k integer are 1, the key exists; otherwise, the key does not exist.

For a Bloom filter, two parameters need to be set, one is the estimated maximum amount of data to be stored, and one is the acceptable collision rate.

The hash function

A hash function is a mapping of an object of indefinite length to another object of fixed length. If you’re confused by this concept, you can think of it another way. You pass a string to the hash function, which returns an integer. To implement a Bloom filter, we need a good hash function that computs quickly and has few collisions. Fortunately, there are many open source hash algorithms out there. I found an implementation of MurmurHash on Github, with the following code

The final complete implementation code

function murmurhash3_32_gc(key, seed) { var remainder, bytes, h1, h1b, c1, c1b, c2, c2b, k1, i; remainder = key.length & 3; // key.length % 4 bytes = key.length - remainder; h1 = seed; c1 = 0xcc9e2d51; c2 = 0x1b873593; i = 0; while (i < bytes) { k1 = ((key.charCodeAt(i) & 0xff)) | ((key.charCodeAt(++i) & 0xff) << 8) | ((key.charCodeAt(++i) & 0xff) << 16) | ((key.charCodeAt(++i) & 0xff) << 24); ++i; k1 = ((((k1 & 0xffff) * c1) + ((((k1 >>> 16) * c1) & 0xffff) << 16))) & 0xffffffff; k1 = (k1 << 15) | (k1 >>> 17); k1 = ((((k1 & 0xffff) * c2) + ((((k1 >>> 16) * c2) & 0xffff) << 16))) & 0xffffffff; h1 ^= k1; h1 = (h1 << 13) | (h1 >>> 19); h1b = ((((h1 & 0xffff) * 5) + ((((h1 >>> 16) * 5) & 0xffff) << 16))) & 0xffffffff; h1 = (((h1b & 0xffff) + 0x6b64) + ((((h1b >>> 16) + 0xe654) & 0xffff) << 16)); } k1 = 0; switch (remainder) { case 3: k1 ^= (key.charCodeAt(i + 2) & 0xff) << 16; case 2: k1 ^= (key.charCodeAt(i + 1) & 0xff) << 8; case 1: k1 ^= (key.charCodeAt(i) & 0xff); k1 = (((k1 & 0xffff) * c1) + ((((k1 >>> 16) * c1) & 0xffff) << 16)) & 0xffffffff; k1 = (k1 << 15) | (k1 >>> 17); k1 = (((k1 & 0xffff) * c2) + ((((k1 >>> 16) * c2) & 0xffff) << 16)) & 0xffffffff; h1 ^= k1; } h1 ^= key.length; h1 ^= h1 >>> 16; h1 = (((h1 & 0xffff) * 0x85ebca6b) + ((((h1 >>> 16) * 0x85ebca6b) & 0xffff) << 16)) & 0xffffffff; h1 ^= h1 >>> 13; h1 = ((((h1 & 0xffff) * 0xc2b2ae35) + ((((h1 >>> 16) * 0xc2b2ae35) & 0xffff) << 16))) & 0xffffffff; h1 ^= h1 >>> 16; return h1 >>> 0; } function BoolmFilter (max_count, error_rate) {var bitMap = []; Var max_count = max_count; Var error_rate = error_rate; Var bit_size = math.ceil (max_count * (- math.log (error_rate)/(math.log (2) * math.log (2)))); Var hash_ount = math.ceil (math.log (2) * (bit_size/max_count)); Var set_bit = function (bit) {var arr_index = math.floor (bit / 31); var bit_index = Math.floor(bit % 31); bitMap[arr_index] |= (1 << bit_index); }; Var get_bit = function (bit) {var arr_index = math.floor (bit / 31); var bit_index = Math.floor(bit % 31); return bitMap[arr_index] &= (1 << bit_index); }; // Add key this.add = function (key) {if (this.isexist (key)) {return -1; } for (var I = 0; i < hash_ount; i++) { var hash_value = murmurhash3_32_gc(key, i); set_bit(Math.abs(Math.floor(hash_value % (bit_size)))); }}; This. isExist = function (key) {for (var I = 0; i < hash_ount; i++) { var hash_value = murmurhash3_32_gc(key, i); if (! get_bit(Math.abs(Math.floor(hash_value % (bit_size))))) { return false; } } return true; }; }; Var bloomFilter = New BoolmFilter(1000000, 0.01); bloomFilter.add('https://blog.csdn.net/houzuoxin/article/details/20907911'); bloomFilter.add('https://www.jianshu.com/p/888c5eaebabd'); console.log(bloomFilter.isExist('https://blog.csdn.net/houzuoxin/article/details/20907911')); console.log(bloomFilter.isExist('https://www.jianshu.com/p/888c5eaebabd')); console.log(bloomFilter.isExist('https://www.jianshu.com/p/888c5eaebabd435'));Copy the code

Intersection of two sets (normal mode)

Two arrays of [1, 4, 6, 8, 9, 10, 15], [6, 14, 9, 2, 0, 7], calculate their intersection using a BitMap

var arr1 = [1, 4, 6, 8, 9, 10, 15]; var arr2 = [6, 14, 9, 2, 0, 7]; var intersection_arr = [] var bit_map = new BitMap(); for(var i = 0; i<arr1.length; i++){ bit_map.addMember(arr1[i]); } for(var i= 0; i<arr2.length; i++){ if(bit_map.isExist(arr2[i])){ intersection_arr.push(arr2[i]); } } console.log(intersection_arr);Copy the code

Support for negative numbers (hard mode)

The BitMap described in the course can only store integers greater than or equal to 0. Please modify the BitMap class to store positive or negative numbers and determine whether they exist

function SuperBitMap(size){
    var positive_bit_map = new BitMap.BitMap(size);
    var negative_bit_map = new BitMap.BitMap(size);

    this.addMember = function(member){
        if(member >= 0){
            positive_bit_map.addMember(member);
        }else{
            negative_bit_map.addMember(member);
        }
    };

    this.isExist = function(member){
        if(member >= 0){
            return positive_bit_map.isExist(member);
        }else{
            return negative_bit_map.isExist(member);
        }
    };
}

var arr = [1, 3 ,-6, -8, 8, 9];
var super_bm = new SuperBitMap();

for(var i =0;i<arr.length;i++){
    super_bm.addMember(arr[i]);
}

console.log(super_bm.isExist(-8));
console.log(super_bm.isExist(8));
console.log(super_bm.isExist(9));
console.log(super_bm.isExist(-6));
console.log(super_bm.isExist(-5));
Copy the code

Find numbers that are not repeated

There is an array of [1, 3, 4, 5, 7, 4, 8, 9, 2, 9], some values are repeated, please modify the BitMap class, add isRepeat(member) method, determine whether member is repeated, and use this method to find the number in the array is not repeated.

With a bit can represent a number if there is, but can’t say whether the number of repeat, because a bit only two states, the existence of the corresponding number, then we can use even a bit to indicate the presence or absence of a number and whether to repeat, so, a 32 bit integer, can represent the existence of the 16 digits and repeat or not. Take 0 as an example, when addMember(0), the 0th bit position is 1, and the binary representation can be written as

00000000 00000000 00000000 00000011 function BitMap(size){ var bit_arr = new Array(size); for(var i=0; i<bit_arr.length; i++){ bit_arr[i] = 0; } this.addMember = function(member){ var arr_index = Math.floor(member / 16); Var bit_index = member % 16; // Decide on which of the 32 bits of the integer if(! this.isExist(member)){ bit_arr[arr_index] = bit_arr[arr_index] | 1<<bit_index*2; }else{ bit_arr[arr_index] = bit_arr[arr_index] | 1<<(bit_index*2+1); }}; this.isExist = function(member){ var arr_index = Math.floor(member / 16); Var bit_index = member % 16; Var value = bit_arr[arr_index] &1 <<bit_index*2; if(value ! = 0){ return true; } return false; }; this.isRepeat = function(member){ var arr_index = parseInt(member / 16); Var bit_index = member % 16; Var value = bit_arr[arr_index] &1 <<(bit_index*2 + 1); var value = bit_arr[arr_index] &1 <<(bit_index*2 + 1); if(value ! = 0){ return true; } return false; }; } var arr_1 = [1, 3, 4, 5, 7, 4, 8, 9, 2, 9]; var bm = new BitMap(2); for(var i = 0; i < arr_1.length; i++){ bm.addMember(arr_1[i]); } var arr = [] for(var i = 0; i <=9; i++){ if(! bm.isRepeat(i)){ arr.push(i); } } console.log(arr);Copy the code