This article has participated in the activity of “New person creation Ceremony”, and started the road of digging gold creation together.
1. What is bloom filter:
Bloom Filter (English: Bloom Filter) was proposed by Bloom in 1970. It’s actually a long binary vector and a series of random mapping functions. Bloom filters can be used to retrieve whether an element is in a collection. The advantage of this algorithm is that the space efficiency and query time are much better than the general algorithm, but the disadvantage is that there is a certain error recognition rate and deletion difficulty.
If you want to determine whether an element is in a set or not, the general idea is to store all the elements and then compare them, linked lists, trees, whatever data structure. But with the increase of elements in the collection, we need more and more storage space, and the retrieval speed is slower and slower. Such as:
- The list O (n),
- O (logn) tree.
But hash is fast:
- Hash table: O(1).
It maps an element to a point in a Bit array using a Hash function. So in this case, we just have to look at whether this point is 1 to see if it’s in the set. That’s the basic idea of a Bloom filter. The following figure shows the hash index of a key computed by a hash function:After calculating the value, change the index value to 1.The problem with Hash is conflict. Assuming the Hash function is good, if our bit array is m points long, then if we want to reduce the collision rate to say 1%, the Hash table can only hold m over 100 elements. Obviously, that’s not space-efficient anymore. The solution is simply to use multiple hashes for the same key and change the corresponding index value. Here’s the key: a key is computed using the hash method shown below to get index and change its value to 1.
As long as the key has been hashed previously, the idnex obtained by our program is the same as the index computed by the hash method, and the corresponding value is changed to 1. We cannot guarantee that the entire key exists, as shown in the figure below, but we can guarantee that a key has been hashed. If index has a value of 0, this key must not exist.If a key is used for hash calculation and the corresponding index value is 0, the key must not have this feature, which can solve the cache penetration problem in Redis.
2. Write a simple Bloom filter using BitSet
To make things easier for everyone to understand, the following uses BitSet to write a simple demo, here to illustrate the Java BitSet class:
- Java platform bitsets are used to store a sequence of bits. If you want to store a sequence of bits efficiently, you can use bitsets. Because bitsets wrap bits in bytes, using bitsets is more efficient and storage efficient than using a List of Boolean objects.
- A BitSet is a bit-operated object with a value of only 0 or 1, that is, false and true. It maintains an array of longs. Initially, there is only one long, so the minimum size of the BitSet is 64. And then eventually it’s going to be stored internally by N longs.
Why do bitsets use long arrays for internal storage?
- The JDK chose the long array as the internal storage structure for bitsets for performance reasons. Bitsets provide and and or operations that require an AND or for all bits in two bitsets, traversing all array elements. Using long keeps the number of loops to a minimum, so Java chooses to use the long array as the internal storage structure for bitsets.
- There is basically no difference between long and byte on the stack, except that the compiler wastes up to 7 bytes when it forces address alignment, and there is no difference when it reads groups of elements from memory. Because the assembly instruction has mov instructions for different lengths of data. So, the underlying reason the JDK chose to use the long array as the internal storage structure for bitsets was to reduce the number of loops between AND and OR and improve performance.
- For example, when we operate and, or, and xOR in BitSet, we need to operate on all the bits in the BitSet, and we need to read all the words in the BitSet in sequence. If the long array is stored, we can read 64 bits each time, while when the int array is stored, Only 32 bits can be read at a time. In addition, when we search for the next bitset to 1 in the bitset, Word will first compare with 0. If the value of Word is 0, it means that there is no 1 bit in word, so we can ignore word. If it is stored in the long array, we can skip 64 bits at a time. Only 32 bits can be skipped at a time.
Good, we get back to business, the following code defines the eight hash function, each time you insert the value will be after eight hash method and then the corresponding position is set to 1, each judge whether to exist, are eight hash method, if there is a hash method to calculate the location of the corresponding value is not 1, it returns false, This means that the value does not exist. If both values are 1, it returns true. Of course, as mentioned earlier, returning true does not guarantee that the element exists.
public class MyBloomFilter {
/** * A 1 billion bit */
private static final int DEFAULT_SIZE = 256 << 22;
/** * To reduce the error rate, the additive hash algorithm is used, so defining an 8-element array of prime numbers is equivalent to building 8 different hash algorithms */
private static final int[] seeds = {3.5.7.11.13.31.37.61};
/** * is equivalent to building 8 different hash algorithms */
private static HashFunction[] functions = new HashFunction[seeds.length];
/** * Initializes the bitmap of the Bloom filter */
private static BitSet bitset = new BitSet(DEFAULT_SIZE);
/** * Add data **@paramValue Specifies the value to be added */
public static void add(String value) {
if(value ! =null) {
for (HashFunction f : functions) {
// Compute the hash value and change the bitmap position to true
bitset.set(f.hash(value), true); }}}/** * determines whether the corresponding element has *@paramValue Specifies the element *@returnResults the * /
public static boolean contains(String value) {
if (value == null) {
return false;
}
boolean ret = true;
for (HashFunction f : functions) {
ret = bitset.get(f.hash(value));
// a hash function returns false to break out of the loop
if(! ret) {break; }}return ret;
}
/** * test */
public static void main(String[] args) {
for (int i = 0; i < seeds.length; i++) {
functions[i] = new HashFunction(DEFAULT_SIZE, seeds[i]);
}
// Add 100 million data
for (int i = 0; i < 100000000; i++) {
add(String.valueOf(i));
}
System.out.println(contains("99999999")); // true
String id = "123456789";
add(id);
System.out.println(contains(id)); // true
System.out.println("" + contains("234567890")); //false}}class HashFunction {
private int size;
private int seed;
public HashFunction(int size, int seed) {
this.size = size;
this.seed = seed;
}
public int hash(String value) {
int result = 0;
int len = value.length();
for (int i = 0; i < len; i++) {
result = seed * result + value.charAt(i);
}
return (size - 1) & result; }}Copy the code
3. Cache penetration in Redis
Cache penetration, that is, through the cache, directly access mysql and other relational databases, for example: query a database does not exist data, such as commodity details, query a non-existent name, every time access DB, if someone malicious damage, it is likely to directly cause excessive pressure on DB. There are two current solutions:
- When a key is used to query data, if the corresponding data in the database does not exist, we set the value of the key to a default value, such as “NULL”, and set a cache expiration time. In this case, before the cache expires, All access through this key is blocked by the cache. If the data corresponding to this key exists in DB, after the cache is invalid, access the data through this key again, and get the new value.
- A common approach is to use a Bloom filter (which can hold a lot of data in a small amount of memory) to hash all possible data into a large enough bitmap,
Data that must not exist is intercepted by the bitmap, thus avoiding query pressure on the underlying storage system.
Bloom filter: is actually a long binary vector and a series of random mapping functions. Bloom filters can be used to retrieve whether an element is in a collection. Its advantage is that its space efficiency and query time are far more than the general algorithm, but its disadvantage is that it has certain false recognition rate and deletion difficulty.
This article focuses on using the Bloom filter, which has the following advantages and disadvantages:
Advantages:
- Bloom filters have huge advantages in both space and time over other data structures. Bloom filter storage space and insert/query time are constant. In addition, the Hash functions are unrelated to each other, so they can be implemented by hardware in parallel. Bloom filters do not need to store the elements themselves, which can be advantageous in situations where confidentiality is very important. A Bloom filter can represent the complete set, not any other data structure.
Disadvantages:
- As we mentioned above, the Bloom filter can determine that an element must not exist, but it can’t determine that an element must exist, it has a certain error rate.
- Normally, you can’t remove elements from a Bloom filter. It’s easy to think of a bit array as an array of integers, incrementing the counter for each element you insert, and then subtracting the counter when you delete an element. However, it is not so simple to delete elements safely. First we must ensure that the deleted element is actually in the Bloom filter, which cannot be guaranteed by the filter alone.
- The cuckoo filter is claimed to solve the soft spot of bloon filter deletion, but the cuckoo filter to delete must limit the number of times the same element cannot be inserted more than a certain threshold, the availability is not high.
4. Bloom filter in Redis
4.1 RedisBloom
The bitMap provided by Redis can implement the Bloom filter, but you need to design the mapping function and some details, which is no different from our custom.
The bloom filter was introduced in Redis 4.0 as a plugin. Bloom filters are loaded into Redis Server as a plug-in, providing Redis with powerful bloom de-duplexing capabilities.
There are two ways to install RedisBloom with Redis already installed:
4.1.1 Directly Compile and Install
Git clone https://github.com/RedisBloom/RedisBloom.git CD RedisBloom make # compiler generates a rebloom. Redis - so file server -- loadModule /path/to/rebloomCopy the code
4.1.2 Using Docker for Installation
- Pull image (this image contains Redis and RedisBloom)
Docker pull Redislabs /rebloom: Latest # Pull image
- Create a new container with the image you just pulled, and run it
docker run -p 6379:6379 --name redisbloom_ysw redislabs/rebloom:latest
Run the container
- Use the Redis client to connect to the newly launched Redis service with RedisBloom
redis-cli
Basic instructions for Bloom filter:
- Bf. Add Adds elements to the Bloom filter
- Bf. exists Determines whether an element is in a Bloom filter
- Bf. madd adds multiple elements to the Bloom filter, while bf.add can only add one
- Bf. Mexists determine if multiple elements are in a Bloom filter
4.2 Using Lettuce +Redisson in springboot projects
- Jedis and Lettuce are fairly pure Redis clients, providing very little advanced functionality. Jedis has poor performance, so if you don’t need to use the advanced features of Redis, we recommend using Lettuce first.
- The advantage of Redisson is that it provides many Redis advanced features right out of the box. If you need Redis advanced features in your application, use Redisson. In this case, it is common to use both Lettuce and Redisson.
4.2.1 Using lettuce:
- Introducing dependencies in poM:
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-redis</artifactId>
<version>2.3.5. RELEASE</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-pool2</artifactId>
</dependency>
Copy the code
- The configuration file
spring:
datasource:
#redis configuration
redis:
database: 0
host: 99.248217.222.
prot: 6379
password:
lettuce:
pool:
max-active: 8
max-wait: - 1
max-idle: 10
min-idle: 2
timeout: 6000
# # # # # # # # # # #
Copy the code
- Writing configuration classes
/** ** <p>Redis configuration class * <p> Because the default template can only store strings, we need to customize the RedisTemplate to store more types of data, and set up the serializer so that we can easily manipulate the instance object. * * *@author ysw
* @dateThe 2022-01-12 * /
@Configuration
public class LettuceRedisConfig {
@Bean
public RedisTemplate<String, Serializable> redisTemplate(LettuceConnectionFactory connectionFactory) {
RedisTemplate<String, Serializable> redisTemplate = new RedisTemplate<>();
// Set the serializer for key
redisTemplate.setKeySerializer(new StringRedisSerializer());
// What serialization is used to set the value of a String
redisTemplate.setValueSerializer(new GenericJackson2JsonRedisSerializer());
// The hash key also uses String serialization
redisTemplate.setHashKeySerializer(new StringRedisSerializer());
redisTemplate.setHashValueSerializer(new Jackson2JsonRedisSerializer(Object.class));
LettuceConnectionFactory = LettuceConnectionFactory = LettuceConnectionFactory = LettuceConnectionFactory = LettuceConnectionFactory
// You can also define your own container and pass it in via @qualifier ("")
redisTemplate.setConnectionFactory(connectionFactory);
returnredisTemplate; }}Copy the code
- Write your own Redis utility classes
The five methods in the red box above correspond to the five data structures in Redis:
opsForZSet()
: corresponds to zset (ordered set) in RedisopsForValue()
: corresponds to String in Redis (String type)opsForHash()
: corresponds to the Hash in RedisopsForList()
: corresponds to List in RedisopsForSet()
: corresponds to Set in Redis
The specific tools are as follows:
** ** <p>Redis utility class ** *@author ysw
*/
@Component
public class RedisUtils {
@Resource
private RedisTemplate redisTemplate;
/** * set key-value *@paramThe key key *@paramThe value value * /
public void set(String key, String value) {
redisTemplate.opsForValue().set(key, value);
}
/** * set key-value * with lifetime@paramThe key key *@paramThe value value *@paramTimeout Survival time *@paramUnit Time unit */
public void set(String key, String value, long timeout, TimeUnit unit) {
redisTemplate.opsForValue().set(key, value, timeout, unit);
}
/** * Sets the lifetime of the specified data. * *@paramThe key key *@paramTime Survival time (s) */
public void expire(String key, long time) {
redisTemplate.expire(key, time, TimeUnit.SECONDS);
}
/** * get the value * based on key@paramThe key key *@returnThe obtained value */
public String get(String key) {
return String.valueOf(redisTemplate.opsForValue().get(key));
}
/** * Deletes the specified information. * *@paramThe key key *@returnCheck whether the file is successfully deleted */
public boolean delete(String key) {
return redisTemplate.delete(key);
}
public void zSetOrderPrepare(String key,BackOrderPrepareVO member,int score){
ZSetOperations<String, BackOrderPrepareVO> operations = redisTemplate.opsForZSet();
operations.add(key,member,score);
}
/** * Deletes the specified information. * *@paramThe key key *@returnCheck whether the file is successfully deleted */
public boolean existsOrderPrepareKey(String key){
return redisTemplate.hasKey("key");
}
public Set getAllOrderPrepareValueAndScore(String key){
Set set = redisTemplate.opsForZSet().reverseRange(key, 0, -1);
returnset; }}Copy the code
At this point the integration of the database is terminated, and we can call methods in our own written utility classes elsewhere to interact with the Redis service.
4.2.2 use Redisson
- Introducing dependencies in poM:
<dependency>
<groupId>org.redisson</groupId>
<artifactId>redisson</artifactId>
<version>3.11.1</version>
</dependency>
Copy the code
- The same configuration file as the Lettuce configuration file
- Write relevant configuration classes
/** ** <p>Redisson configuration class ** **@author ysw
* @dateThe 2022-01-12 * /
@Configuration
public class RedissonConfig {
@ Value (" ${spring. Redis. Host: 99.248.217.222} ")
private String host;
@Value("${spring.redis.port:6379}")
private String port;
@Bean
public RedissonClient redissonClient(a) {
Config config = new Config();
// Redis is single-machine mode
Starter relies on incoming redisson to start with redis://
config.useSingleServer()
.setAddress("redis://" + host + ":" + port);
returnRedisson.create(config); }}Copy the code
- Bloom filter demo
@Slf4j
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
public class RedissonBloomFilter {
@Resource(name = "redissonClient")
private RedissonClient redissonClient;
@Test
public void bloomFilterDemo(a){
RBloomFilter<Object> bloomFilter = redissonClient.getBloomFilter("bloom-filter");
// Initialize, container 10000. Fault tolerance rate 1/1000
bloomFilter.tryInit(10000.0.001);
// Add 10000
for (int i = 0; i < 10000; i++) {
bloomFilter.add("YuShiwen" + i);
}
// Count the number of misjudgments
int count = 0;
// Query nonexistent data 1000 times
for (int i = 0; i < 1000; i++) {
if (bloomFilter.contains("xiaocheng" + i)) {
count++;
}
}
System.out.println("Number of misjudgments:"+count);
System.out.println("Is YuShiwen9999 present in filter:"+bloomFilter.contains("YuShiwen9999"));
System.out.println("Is YuShiwen11111 present in filter:"+bloomFilter.contains("YuShiwen11111"));
System.out.println("Estimated number of inserts:" + bloomFilter.getExpectedInsertions());
System.out.println("Fault tolerance:" + bloomFilter.getFalseProbability());
System.out.println("Number of hash functions: + bloomFilter.getHashIterations());
System.out.println("Number of inserted objects:"+ bloomFilter.count()); }}Copy the code
Output result:
Number of hash functions: 10 Number of hash functions: 9999 YuShiwen9999 Exists in the filter: true YuShiwen11111 exists in the filter: false Number of expected insertions: 10000 Error tolerance rate: 0.001 Number of hash functions: 10 Number of inserted objects: 9999Copy the code