A common step in the actual development process is to determine whether the current key exists.

This blog is divided into three parts:

1. Compare the performance of the current key in several ways. 2. Redis implements Bloom filter and inserts data in batches, and determines whether the current key value exists. 3. Make a summary for the above. I. Performance Comparison The following methods are tested and compared:

1, List contains method

2. Map containsKey method

3, Google Bloom filter mightContain method

The premise to prepare

When the SpringBoot project starts, 5 million 32-bit strings are distributed to the List, Map, and Google Bloom filters.

@slf4j @RestController Public Class PerformanceController {Slf4j @restController public class PerformanceController {

Public static final int SIZE = 5000000; / * * * the list collection store data * / public static a list < String > list = Lists. NewArrayListWithCapacity (SIZE); * / / * * * map collections stored data public static map < String, Integer > map = Maps. NewHashMapWithExpectedSize (SIZE); / * * * guava bloom filter * / BloomFilter < String > BloomFilter = BloomFilter. Create (Funnels. UnencodedCharsFunnel (), the SIZE). */ public static List<String> exist = Lists. NewArrayList (); / timing tool class * * * * / public static Stopwatch Stopwatch = a Stopwatch. CreateUnstarted (); @postconstruct public void insertData() {for (int I = 0; i < SIZE; i++) { String data = UUID.randomUUID().toString(); data = data.replace("-", ""); Add (data); Map. put(data, 0); //3, save the local bloomfilter. put(data); If (I % 1000000 == 0) {exist. Add (data); if (I % 1000000 == 0) {exist. */ @requestMapping ("/list") public void existsList() {// Start (); For (String s: exist) {if (list. The contains (s)) {the info (" list collection is the data = = = = = = = = = = = = = {} ", s); Stopwatch.stop (); Log. Info (" Elapsed ", stopwatch.elapsed(MILLISECONDS)); stopwatch.reset(); */ @requestMapping ("/map") public void existsMap() {// start stopwatch.start(); For (String s: exist) {if (map) either containsKey (s)) {the info (" map collections is the data = = = = = = = = = = = = = {} ", s); Stopwatch.stop (); Log. Info ("map set test to determine if elapsed time exists :{}", stopwatch.Elapsed (MILLISECONDS)); stopwatch.reset(); */ @requestMapping ("/bloom") public void existsBloom() {// Start the timer stopwatch.start(); For (String s: exist) {if (bloomfilter.mightcontain (s)) {log.info("guava bloomfilter.mightcontain (s) ============= data {}", s); Stopwatch.stop (); Log. Info (" Bloom set test, time :{}", stopwatch.Elapsed (MILLISECONDS)); stopwatch.reset(); }Copy the code

} 2. Test the output

The test results

Each of these methods is executed five times, so if you count them once, then you can roughly get 5 million pieces of data, each of which is a 32-bit String.

1. The contains method of List takes about 80 milliseconds to execute. 2. Map’s containsKey method takes less than 1 millisecond to execute. 3, Google Bloom filter mightContain method, no more than 1 ms. conclusion

The reason why maps are more efficient than lists is not to mention that they are all so fast. I also tested a million pieces of data and traversed the key through the list in less than a millisecond. This shows that in the actual development process, if the data

In small quantities, it doesn’t matter where you use it.

From the above execution efficiency point of view, Google Bloom filter actually has no advantage at all, indeed if the amount of data is small, completely through the above can be solved, do not need to consider the bloom filter, but if the amount of data is huge, tens of millions or even hundreds of millions of levels

Don’t do that. You can’t do collections, not because the execution efficiency is unacceptable, but the memory footprint is unacceptable.

Let’s figure out how much memory it takes to store 5 million rows of data in a List with a key of 32 bytes.

5 million * 32 = 16000000 bytes ≈ 152MB

This is obviously unacceptable for a collection to take up so much memory.

So let’s figure out how much memory the Bloom filter takes up

Let the bit array size be M, the sample number be N, and the error rate be P.

N = 5 million, p = 3% (Google Bloom filter default is 3%, we can also change it)

The formula is obtained as follows:

M material 16.7 MB

Is it possible to receive more.

The Google Bloom filter also has major drawbacks

1. Re-store data into Google Bloom filters every time a project starts, consuming additional resources. 2. In distributed cluster deployment architecture, a copy of the same data must be stored in bloom filter on each cluster node. 3. As the data volume increases, bloom filters take up a large amount of JVM memory, which is clearly not reasonable. A better solution is to use Redis as a Bloom filter for distributed clusters.

If you are not using Docker, you will need to deploy Redis on the server first and then install the Redis filter plugin rebloom.

If you’ve used Docker before, deployment is very simple with the following command:

Docker run -p 6379:6379 Redislabs /rebloom #

2, Lua batch insert script SpringBoot complete code HERE I will not paste out, at the end of the article I will attach the github address of the whole project, here is only the meaning of the script:

bloomFilter-inster.lua

local values = KEYS local bloomName = ARGV[1] local result_1 for k,v in ipairs(values) do result_1 = Redis. Call (‘ bf. ADD’,bloomName,v) end return result_1 1) Parameter Description

KEYS and ARGV[1] are passed in the Java code. RedisTemplate has a method

execute(RedisScript script, List keys, Object… Args) script entities that encapsulate bulk insert Lua scripts. Keys Keys for the script. ARGV[1] For the first variable argument, if you enter more than one variable argument, you can use ARGV[2]….. To obtain. 2) traversal

Lua traversals scripts in two different ways: ipairs and pairs. I don’t want to expand it here. Here’s a blog for reference.

Note that Lua traversals are a little different from Java, where we start at 0, whereas Lua script k starts at 1.

3) Insert command

ADD is the command to insert data into the Bloom filter, return true on success.

3. Check whether the Bloomfilter-exist. Lua script exists

local bloomName = KEYS[1] local value = KEYS[2] — bloomFilter local result_1 = redis.call(‘BF.EXISTS’, bloomName, Value) return result_1 KEYS[1] is set to get(0), so Lua traversal starts from 1.

Bf. EXISTS Indicates whether the data command EXISTS in a Bloom filter. If it EXISTS, true is returned.

4, test let’s test whether it is successful.

@Slf4j @RestController public class RedisBloomFilterController {

@Autowired private RedisService redisService; public static final String FILTER_NAME = "isMember"; @requestMapping ("/save-redis-bloom") public Object saveReidsBloom() {List<String>  exist = Lists.newArrayList("11111", "22222"); Object object = redisService.addsLuaBloomFilter(FILTER_NAME, exist); Log.info (" saved successfully ====object:{}",object); return object; } @requestMapping ("/ existsReidsBloom ") public void existsReidsBloom() {if (! RedisService. ExistsLuabloomFilter (FILTER_NAME, "00000")) {the info (" redis bloom filter does not exist the data = = = = = = = = = = = = = {} ", "00000"); } / / there is output if (redisService existsLuabloomFilter (FILTER_NAME, "11111")) {the info (" redis bloom filter is the data = = = = = = = = = = = = = {} ", "11111"); }}Copy the code

}} Insert interface first, insert two data, if return true, it indicates success, if the same data is inserted successfully for the first time, return false for the second time, it indicates failure to insert the same value repeatedly.

Then adjust the query interface, here should be two logs will output, because the above “00000” is inverse, many! Number.

Let’s look at the final result.

As we expected, the Redis Bloom filter was successful from deployment to integration with SpringBoot.

The following is a personal summary of the whole. The main point is to think about the circumstances under which one of the above methods can be considered to determine the presence of the element.

1, the amount of data is not large, and there can be no error. So you can use either List or Map. Although List is used to judge whether the element exists by traversing the set, its performance is worse than Map, but just like the above test, 1 million data,

List traversal and Map are within 1 millisecond, so it doesn’t matter which one you choose. Why care about 0? A few milliseconds.

2. The amount of data is small and error is allowed. This is a good place to consider using Google Bloom filters. Although the efficiency of querying data is almost the same, the key is that they reduce the memory overhead, which is key.

3. Large amount of data, and no error. If a large number of data is used to improve the efficiency of querying the existence of elements, I think it is not right to use Map, because if the amount of data is large, it will occupy more memory, so I recommend using Map

Redis’s Map data structure is used to store data, which can greatly reduce JVM memory overhead and does not require data to be stored in the collection with every restart.

4. Large amount of data, and error is allowed. If it is a single application and the amount of data can be received by the memory, then consider Google Bloom filter, because it will be faster than Redis query. After all, it requires no network IO overhead.

If you have a distributed cluster architecture or a very large amount of data, consider using a Redis Bloom filter because it does not need to store data on every node and does not use JVM memory.