1. Definition:
A Bloom filter is a set-like data structure that is spatially more efficient than traditional set-like data structures, such as hash tables or trees. A Bloom filter can be 100 percent sure that something is not in a collection, but it cannot be 100 percent sure that something is in a collection.
2. Type:
  • Local memory based Bloom filter
  • File-based Bloom filter
  • Redis based Bloom filter (distributed available) (plug-in (not considered), redis comes with its own BIT)
3. Application Scenarios
Prevent cache penetration

4. Implementation:

var bloomInstance *RedisBloomFilter const RedisMaxLength = 8 * 512 * 1024 * 1024 type BloomFilter interface { Put([]byte) error PutString(string) error Has([]byte) (bool, error) HasString(string) (bool, error) getKeyAndOffset(offset uint) (string, uint) } func BloomInstance() *RedisBloomFilter { if bloomInstance ! = nil{ return bloomInstance } Init() return bloomInstance } type RedisBloomFilter struct { keyPrefix string n uint k Func HashData(data []byte) [4]uint64 {A1 := []byte{1} // To grab another bit of data hasher := murmur3.New128() hasher.Write(data) // #nosec v1, v2 := hasher.Sum128() hasher.Write(a1) // #nosec v3, Sum128() RETURN [4] Uint64 {v1, v2, v3, V4,}} // Create a Redis bloomfilter func NewRedisBloomFilter(muint, p float64, kp string) *RedisBloomFilter { n, k := bloom.EstimateParameters(m, p) filter := &RedisBloomFilter{ n: n, k: k, keyPrefix: kp, } cli := redis.RedisInstance() key, value := filter.getKeyAndOffset(n) cli.Do("SETBIT", key, value, 0) return filter } func Init(){ fmt.Println(os.Args[0]) key := conf.GetConfig().MustValue("bloom","key_prefix","") if key == ""{ panic("bloom:get key fail") } data_number := conf.GetConfig().MustValue("bloom","data_number","") False_positive := conf.getconfig ().mustvalue ("bloom","false_positive","") m,_ := strconv.parseuint (data_number,10,64) p,_  :=strconv.ParseFloat(false_positive,64) bloomInstance = NewRedisBloomFilter(uint(m),p,key) fmt.Println("redis init Func (filter *RedisBloomFilter) Put(data []byte) error {cli := redis.redisinstance () h := HashData(data) for i := uint(0); i < filter.k; i++ { key, value := filter.getKeyAndOffset(uint(location(h, i) % uint64(filter.n))) _, err := cli.Do("SETBIT", key, value, 1) if err ! Func (filter *RedisBloomFilter) PutString(data String) error {return Filter.put ([]byte(data))} func (filter *RedisBloomFilter) Has(data []byte) (bool, error) { cli := redis.RedisInstance() h := HashData(data) for i := uint(0); i < filter.k; i++ { key, value := filter.getKeyAndOffset(uint(location(h, i) % uint64(filter.n))) bitValue,err := cli.GetBitInt(key, value) if err ! = nil { return false, err } if bitValue == 0 { return false, nil } } return true, Func (filter *RedisBloomFilter) HasString(data string) (bool, error) { return filter.Has([]byte(data)) } func (filter *RedisBloomFilter) getKeyAndOffset(offset uint) (string, uint) { n := uint(offset / RedisMaxLength) thisOffset := offset - n*RedisMaxLength key := fmt.Sprintf("%s:%d", filter.keyPrefix, n) return key, thisOffset } func location(h [4]uint64, i uint) uint64 { ii := uint64(i) return h[ii%2] + ii*h[2+(((ii+(ii%2))%4)/2)] }Copy the code

5. Pressure test report:

Configuration: memory 16GB Tool: JMeter

The test uses 100000 threads

Test results:

The query does not have data

QPS Error rate
The use of Bloom 13832.4 / SEC 0%
Do not use the Bloom 2470/sec 0%

QPS(Bloom)/QPS(Bloom)≈5.60

Conclusion: Using bloom filters improves the response time for user input with no data returned, and also protects the database from cache penetration

6. Question:

  • Problem setting the initial bitset size
func EstimateParameters(n uint, p float64) (m uint, k uint) {
   m = uint(math.Ceil(-1 * float64(n) * math.Log(p) / math.Pow(math.Log(2), 2)))
   k = uint(math.Ceil(math.Log(2) * float64(m) / float64(n)))
   return
}
Copy the code
  • Hashing algorithm selection problem

Hashing algorithm collision rate (common)

100000 500000 1 million 5 million 10 million 100 million Average execution time of 10 million The average execution time of 100 million The average length of 100 million times
MurmurHash 0 0 0 0 0 0 0.0066868 0.01194736 19
CityHash 0 0 0 0 0 0 0.0066179 0.01129171 19
crc64 0 0 0 0 0 0 0.0064459 0.01242473 19

This implementation mainly uses Murmur3 algorithm

  • Is it better to set it before or after caching

This parameter is selected based on service scenarios. To improve user access efficiency, add it after cache

  • How to ensure the consistency of data in bloom filter and database data, and how to prevent leakage

Flush all existing data in the database into the Bloom filter

  • How to synchronize new data to bloom filter

Use coroutines to synchronize newly added data to the Bloom filter