Bloom filter research and design Go implementation

1. Definition:

A Bloom filter is a set-like data structure that is spatially more efficient than traditional set-like data structures, such as hash tables or trees. A Bloom filter can be 100 percent sure that something is not in a collection, but it cannot be 100 percent sure that something is in a collection.

2. Type:

Local memory based Bloom filter
File-based Bloom filter
Redis based Bloom filter (distributed available) (plug-in (not considered), redis comes with its own BIT)

3. Application Scenarios

Prevent cache penetration

4. Implementation:

var bloomInstance *RedisBloomFilter const RedisMaxLength = 8 * 512 * 1024 * 1024 type BloomFilter interface { Put([]byte) error PutString(string) error Has([]byte) (bool, error) HasString(string) (bool, error) getKeyAndOffset(offset uint) (string, uint) } func BloomInstance() *RedisBloomFilter { if bloomInstance ! = nil{ return bloomInstance } Init() return bloomInstance } type RedisBloomFilter struct { keyPrefix string n uint k Func HashData(data []byte) [4]uint64 {A1 := []byte{1} // To grab another bit of data hasher := murmur3.New128() hasher.Write(data) // #nosec v1, v2 := hasher.Sum128() hasher.Write(a1) // #nosec v3, Sum128() RETURN [4] Uint64 {v1, v2, v3, V4,}} // Create a Redis bloomfilter func NewRedisBloomFilter(muint, p float64, kp string) *RedisBloomFilter { n, k := bloom.EstimateParameters(m, p) filter := &RedisBloomFilter{ n: n, k: k, keyPrefix: kp, } cli := redis.RedisInstance() key, value := filter.getKeyAndOffset(n) cli.Do("SETBIT", key, value, 0) return filter } func Init(){ fmt.Println(os.Args[0]) key := conf.GetConfig().MustValue("bloom","key_prefix","") if key == ""{ panic("bloom:get key fail") } data_number := conf.GetConfig().MustValue("bloom","data_number","") False_positive := conf.getconfig ().mustvalue ("bloom","false_positive","") m,_ := strconv.parseuint (data_number,10,64) p,_  :=strconv.ParseFloat(false_positive,64) bloomInstance = NewRedisBloomFilter(uint(m),p,key) fmt.Println("redis init Func (filter *RedisBloomFilter) Put(data []byte) error {cli := redis.redisinstance () h := HashData(data) for i := uint(0); i < filter.k; i++ { key, value := filter.getKeyAndOffset(uint(location(h, i) % uint64(filter.n))) _, err := cli.Do("SETBIT", key, value, 1) if err ! Func (filter *RedisBloomFilter) PutString(data String) error {return Filter.put ([]byte(data))} func (filter *RedisBloomFilter) Has(data []byte) (bool, error) { cli := redis.RedisInstance() h := HashData(data) for i := uint(0); i < filter.k; i++ { key, value := filter.getKeyAndOffset(uint(location(h, i) % uint64(filter.n))) bitValue,err := cli.GetBitInt(key, value) if err ! = nil { return false, err } if bitValue == 0 { return false, nil } } return true, Func (filter *RedisBloomFilter) HasString(data string) (bool, error) { return filter.Has([]byte(data)) } func (filter *RedisBloomFilter) getKeyAndOffset(offset uint) (string, uint) { n := uint(offset / RedisMaxLength) thisOffset := offset - n*RedisMaxLength key := fmt.Sprintf("%s:%d", filter.keyPrefix, n) return key, thisOffset } func location(h [4]uint64, i uint) uint64 { ii := uint64(i) return h[ii%2] + ii*h[2+(((ii+(ii%2))%4)/2)] }Copy the code

5. Pressure test report:

Configuration: memory 16GB Tool: JMeter

The test uses 100000 threads

Test results:

The query does not have data

	QPS	Error rate
The use of Bloom	13832.4 / SEC	0%
Do not use the Bloom	2470/sec	0%

QPS(Bloom)/QPS(Bloom)≈5.60

Conclusion: Using bloom filters improves the response time for user input with no data returned, and also protects the database from cache penetration

6. Question:

Problem setting the initial bitset size

func EstimateParameters(n uint, p float64) (m uint, k uint) {
   m = uint(math.Ceil(-1 * float64(n) * math.Log(p) / math.Pow(math.Log(2), 2)))
   k = uint(math.Ceil(math.Log(2) * float64(m) / float64(n)))
   return
}
Copy the code

Hashing algorithm selection problem

Hashing algorithm collision rate (common)

	Average execution time of 10 million	The average execution time of 100 million	The average length of 100 million times
MurmurHash	0.0066868	0.01194736	19
CityHash	0.0066179	0.01129171	19
crc64	0.0064459	0.01242473	19

This implementation mainly uses Murmur3 algorithm

Is it better to set it before or after caching

This parameter is selected based on service scenarios. To improve user access efficiency, add it after cache

How to ensure the consistency of data in bloom filter and database data, and how to prevent leakage

Flush all existing data in the database into the Bloom filter

How to synchronize new data to bloom filter

Use coroutines to synchronize newly added data to the Bloom filter

Bloom filter research and design Go implementation

Related Posts

Macos installed FFMPEG and the solution to the problem, a success

Java multithreaded access to eight use scenarios of Synchronized methods

2021 Year-end summary – Review is important!!