preface
Lucene code volume is still more, in the case of not looking very clear, first write a new tool class to write some operations ~ is also a lot of harvest.
When Lucene writes to index files, it often compresses the data to save space. This article introduces a useful compression method for int, long. Variable-length storage.
It is widely used in Lucene, so in order to understand the code proficiently, let’s take a look
In The 8.7.0 version of Lucene code, it is not defined as a separate class, probably because it is a small function point
Of variable length data written to the implementation in the org.. Apache lucene. Store. DataOutput# writeVInt, for the implementation of reading longer data in org.. Apache lucene. Store. DataInput# readVInt.
define
What is variable-length storage? Let’s take writeVInt as an example and look at the comment:
Writes an int in a variable-length format. Writes between one and five bytes. Smaller values take fewer bytes. Negative numbers are supported, but should be avoided.
VByte is a variable-length format for positive integers is defined where the high-order bit of each byte indicates whether more bytes remain to be read. The low-order seven bits are appended as increasingly more significant bits in the resulting integer value. Thus values from zero to 127 may be stored in a single byte, Values from 128 to 16,383 May be stored in two bytes, and so on.
A brief translation:
Writes an integer in variable length format. Write 1-5 bytes. Smaller values take up fewer bytes. Support negative numbers but try not to use them.
VByte is a variable-length format of positive integers. The high end of each byte is used to identify whether more bytes need to be read. The lowest seven bits represent the actual data. Append the progressively read lower and higher highs to the original integer.
0 127 takes only one byte, 128 16383 takes two bytes, and so on.
The smaller the number, the higher the compression rate. If the number is full of the largest ints, the more bytes are needed to store them.
implementation
We implement a simple utility class, can achieve the above variable length storage (Lucene code copy out), in addition to providing some methods to help us look at the source.
public class VariableInt {
/** * transfer int to byte[] use variable format */
public static byte[] writeVInt(int i) {
int bytesRequired = bytesRequired(i);
byte[] res = new byte[bytesRequired];
int idx =0;
while ((i & ~0x7F) != 0) {
res[idx++] = ((byte) ((i & 0x7F) | 0x80));
i >>>= 7;
}
res[idx] = (byte) i;
return res;
}
/** * transfer byte[] to int use variable format */
public static int readVInt(byte [] vs) throws IOException {
int idx = 0;
byte b = vs[idx++];
// If the value is greater than 0, the first digit is 0, indicating that no data needs to be read
if (b >= 0) return b;
int i = b & 0x7F;
b = vs[idx++];
i |= (b & 0x7F) < <7;
if (b >= 0) return i;
b = vs[idx++];
i |= (b & 0x7F) < <14;
if (b >= 0) return i;
b = vs[idx++];
i |= (b & 0x7F) < <21;
if (b >= 0) return i;
b = vs[idx];
// Warning: the next ands use 0x0F / 0xF0 - beware copy/paste errors:
i |= (b & 0x0F) < <28;
if ((b & 0xF0) = =0) return i;
throw new IOException("Invalid vInt detected (too many bits)");
}
/** * compute int need bytes. */
public static int bytesRequired(int i) {
if (i < 0) throw new RuntimeException("I Don't Like Negative.");
if ((i >>> 7) = =0) return 1;
if ((i >>> 14) = =0) return 2;
if ((i >>> 21) = =0) return 3;
if ((i >>> 28) = =0) return 4;
return 5; }}Copy the code
In addition to read-write accidents, provides a method to calculate how many bytes an int number needs to be stored. When we debug the source code, this helps us analyze the index files that are written.
The code for VariableLong will not be posted. It’s basically the same thing as Variable, but the length goes from 1-5 to 1-9.
Zigzag encoding
In Lucene’s implementation of DataOutPut, we can see the writeZint(int I) method, which, as we learned, uses ZigZag encoding + variable length storage to store an integer.
What is ZigZag coding?
First, let’s review computer code:
- Source code: the highest bit is a sign bit, the rest of the absolute value;
- Reverse code: in addition to the symbol bits, the remaining bits of the original code are reversed successively;
- Complement: For positive numbers, the complement is itself; For negative numbers, the remaining bits of the source code are reversed and then +1, except for the sign bits.
For convenience and other problems, computers use complement codes to store integers.
So we have a problem with our variable-length integers. He was very hostile to negative numbers.
- The value 1 is an int that stores 4 bytes. The variable length encoding above uses 1 byte.
- -1 is an int whose complement is:
11111111111111111111111111111111
1. You use variable-length encoding for storage, which requires 5 bytes, and the purpose of compression is not met. It takes up more space.
So, based on a common understanding: small integers are used a lot, so variable length coding is needed. There are also many small negative integers, variable length coding will compress the rate is not high or even reverse compression.
Hence the ZigZag code, which can handle negative numbers effectively. H (n) = (n << 1) ^ (n >> 31), h(n) = (n << 1) ^ (n >> 31)
The hash function looks like this:
Imagine what this hash function does?
For small negative integers:
- Moving 1 bit to the left cancels the sign bit and fills the low position with 0
- If the sign is moved 31 bits to the right, the sign bit is moved to the lowest. The highest value of negative numbers is supplemented by 1, and the highest value of positive numbers is supplemented by 0
- For positive numbers, the least significant sign bit is 0 and the other bits remain the same. For negative numbers, the least significant sign bit is 1 and the other bits are reversed
Then said into a 00000000000000000000000000000001-1, the relatively small, suitable for the use of variable length coding. 1 said into 00000000000000000000000000000010, although a little increase, but still very small, also suitable for the use of variable length coding.
To sum up:
Zigzag coding addresses the problem of low compression rates for small negative integers when using variable-length coding, based on the consensus that we use more small integers (both positive and negative). So we’re mapping negative integers to positive integers.
The corresponding table is:
The integer | zigzag |
---|---|
0 | 0 |
– 1 | 1 |
1 | 2 |
2 – | 3 |
2 | 4 |
– 3 | 5 |
3 | 6 |
Zigzag implementation
This ZigZag implementation is relatively simple, based on the variable-length coding already implemented above. Just implement a simple hash function.
/** * transfer int to byte[] use zig-zag-variable format */
public static byte[] writeZInt(int i) {
/ / zigzag encoding
i = (i >> 31) ^ (i << 1);
return writeVInt(i);
}
/** * transfer byte[] to int use zig-zag-variable format */
public static int readZInt(byte[] vs) throws IOException {
int i = readVInt(vs);
return ((i >>> 1) ^ -(i & 1));
}
Copy the code
Perfect.
conclusion
This article briefly introduced.
- Variable length encoding is used to compress integers. Good compression rate can be achieved for small positive integers.
- Zigzag coding for integers can solve the difficulty of low compression rate of variable length coding for small negative integers.
So, when you’re sure your numbers are small integers, use zigZag + variant-length coding to compress them. 25-50% compression is still possible.
Many open source programs that need serialization, all use ZigZag + variable length encoding to compress integers, such as Google’s Protobuf, Apache’s Avro project, Apache’s Lucene project, all use this combination in some scenarios, use it quickly ~
To the end.
All the above are personal thoughts, if there is any mistake welcome to comment.
Welcome to reprint, please sign and keep the original link.
Email address:[email protected]
For more study notes, see personal blog or follow wechat official account < huyan10 >——>HuYan ten