A student of the project team suddenly came to me for help today, asking me to help with a problem of Redis.

It turns out that he uses Redis Bitmap to realize Bloom filter, which records the id data of the content that the user has read and determines whether it has been read or not. In this way, the memory cost is much lower than that of Set storage. He has two main operations on Bitmap:

The write operation

After the user has read a text, the setBit method is used to mark the specified offset.

redisTemplate.opsForValue().setBit(key, i, true);
Copy the code

A read operation

When the client requests the next page of data, it needs to redo the recalled contents. The general practice is to call the getBit method in a loop, and the code is as follows:

  for (int i : offset) {
    if(! redisTemplate.opsForValue().getBit(key, i)) {return false; }}Copy the code

However, for the sake of performance, in order to avoid a large number of frequent requests for Redis, the student did not directly use getBit method. Instead, he read it in the form of string, converted it into byte array, and then converted each bit into byte for storage.

        byte[] bitmapByte = new byte[0];
        String value = Optional.ofNullable(redisTemplate.opsForValue().get(key)).orElse("");
        if (StringUtils.isBlank(value)) {
            return Collections.emptyList();
        }
        try {
            bitmapByte = value.getBytes("UTF-8");
        } catch (UnsupportedEncodingException e) {
            log.error("Failed to get byte array");
        }
        List<Byte> bitMap = new ArrayList<>(bitmapByte.length * 8);
        for (byte b : bitmapByte) {
            bitMap.addAll(getByteArray(b));
        }
Copy the code

The core of his whole scheme is Bitmap zero storage, but the actual effect is not as he expected, the converted byte array is correct except for the first eight bits, the subsequent bytes are all wrong.

What’s the problem?

1,2,5, and 8 bits of data are OK. The command line result is D \x80, which is unicode encoding, which is also OK.

Go back to check the code, at first glance is no problem, at a loss.

Think back to the Bitmap implementation.

Inside Redis bitmaps are stored using strings. A Bitmap is a RedisObject of type REDIS_STRING. The PTR pointer of a RedisObject points to an SDS. SDS an enhanced version of the C language implementation of the character array object, the content stored in the BUF array.

So 1,2,5, and 8 bits are set to 1, and this is how data is actually stored in Redis

  • buf[0] = 0b01100100
  • buf[1] = 0b10000000

Note that the order of storage in BUF is from high to low, as opposed to the usual low to high order.

I notice that buf[0] stores an ASCII character and the conversion works fine, while BUf [1] is -128, which is not an ASCII character and the conversion fails.

So I suspect the problem might be in the code.

It turns out that when -128 is converted to UTF-8 characters, it becomes three bytes,-17,-65, and -67. The complement of -17 is 11101111, which is exactly the same as the second set of 8 bits.

That is not to change the encoding format to change the encoding format to ASCII good, the classmate immediately tried.

It doesn’t work, because -128 is not an ASCII character, and when you convert it to ASCII it becomes? That’s a fixed 63.

ASCII encodings only use 7 bits, utF-8 is variable-length, there are 8,16,24, and 32, so looking for a fixed 8 bit character encodings, we found ISO_8859_1.

Would it be good to replace the above code with ISO_8859_1? Our test result is still 63. The string returned from the RedisTemplate is utF-8 encoded, and the conversion is still wrong. The original binary data is transmitted between our service and Redis, and the encoding is correct. The problem may occur in RedisTemplate, because a ValueSerializer will convert the byte array.

ValueSerializer is implemented as StringRedisSerializer after checking the StringRedisTemplate source code.

    public StringRedisTemplate() {
        RedisSerializer<String> stringSerializer = new StringRedisSerializer();
        this.setKeySerializer(stringSerializer);
        this.setValueSerializer(stringSerializer);
        this.setHashKeySerializer(stringSerializer);
        this.setHashValueSerializer(stringSerializer);
    }
Copy the code

In the implementation of StringRedisSerializer, the default character encoding is UTF-8

    private final Charset charset;

    public StringRedisSerializer() {
        this(StandardCharsets.UTF_8);
    }

    public StringRedisSerializer(Charset charset) {
        Assert.notNull(charset, "Charset must not be null!");
        this.charset = charset;
    }

Copy the code

Now that’s clear, we’ll just change the character encoding in StringRedisSerializer. But the new problem is that StringRedisTemplate is already instantiated and singleton, and modifying it directly will affect its use elsewhere.

The only difference is that StringRedisSerializer’s default character set is ISO_8859_1, And configure the BitRedisTemplate in RedisConfiguration.

    @Bean
    @ConditionalOnMissingBean
    public BitRedisTemplate stringRedisTemplate(RedisConnectionFactory redisConnectionFactory) throws UnknownHostException {
        BitRedisTemplate template = new BitRedisTemplate();
        template.setConnectionFactory(redisConnectionFactory);
        return template;
    }
Copy the code

Then replace the redisTemplate in the code above, and everything is fine.

To sum up, the key point is the whole thing, if every byte is stored in ASCII characters, there is no problem. However, as long as there is data stored in the highest bit of each byte, there will be character encoding problems, yes ISO_8859_1 is a good choice, but make sure that every link is ISO_8859_1.