What is a DocValue?
After the row store StoredField in Chapter 1, we’ll move on to the column store DocValue.
What is column storage? What’s the difference between this and row storage? If we want to store the table data with three fields and four doc’s, row storage is like StoredField in the previous chapter. The disk is dropped one doc at a time, and the fields under each doc are tightly stored. In column storage, disks are dropped one field at a time, and values are stored in doCID order under each field:
So why do you need branch and column storage? There are two main reasons: performance and storage cost.
First, performance. The advantage of line storage is that it is very easy to retrieve its original field information according to docID, requiring only one access to the disk. But what if I want to sort, filter, or count by fieldC? Let’s say I want to know the sum of values under fieldC? If you only use row storage, you need to read almost the entire disk, and only field C data is useful. As we all know, disk read speed is very slow, so in the case of sorting and filtering for a field, row storage is not advantageous. This is where column storage comes in, and the idea of column storage is very simple, since you’re counting fieldC, I’m just going to read the fieldC portion, which greatly reduces the read time.
Besides, the storage cost, for example, fieldC is score, which is used for internal scoring. It is only used for statistics, sorting and screening, but not for presentation. And the score of DOC1, DOC2, DOC3 and DOC4 is 34,30,24 and 32 respectively. In the case of column storage, you can easily subtract the minimum 24 to 10,6,0,8, and then encode them tightly with the maximum number of bits, requiring only 4bit*4 = 16bits, or 2 bytes. However, if you use line storage, because it has no global information, you can only see a single value, so you can only use VINT encoding at most. Each number needs at least 1 byte, and a total of 4 bytes are needed to complete the encoding.
Lucene has five types of docValues, which are SortedDocValues, SortedSetDocValues, NumericDocValues, SortedNumericDocValues, BinaryDocValues. NumericDocValues are stored for numeric types, and SortedDocValue is stored for byte types. And SortedSetDocValue and SortedNumericDocValues compared to SortedDocValue and NumericDocValues can make a field have multiple values.
3, sorted by field value, sorted by value -> sorted by value -> sorted by value -> sorted by field value For SortedNumericDocValue, it means that multiple values in a field are ordered. For SortedDocValues, it means that sorted strings are sorted into values in lexicographical order, such as doc1: Apple, doc2:banana, doc3: add, that ends up being stored as 1,2,0.
The reason why I say sorted here is that I once had an unforgettable experience. At that time, the leader asked me to realize the disk storage of sorted doc value under float type, asking me to learn from Lucene’s practice. In our project, sorted_doc_value actually stores the sorted mapping of value->docid, which can be solved directly in memory using red-black trees or hop tables. Therefore, I always thought sorted doc value meant this. Searching lucene’s source code, I thought something was wrong, but it turned out that the two had different definitions of sorted.
Does Lucene provide a value->docid mechanism similar to mysql indexing? In fact, Lucene provides a sort Map that allows the docuement to reorder the library in the order of a field to achieve the same early truncation effect, but this only works for one field, not multiple fields at the same time. So it doesn’t actually maintain a value->docid mapping mechanism internally.
Five types of DocValue indexes
Demo
Let’s look at the Demo, which basically covers all the DocValues types
public static void MAIN(a) throws IOException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer();
// 1. create the index
Directory directory = FSDirectory.open(Paths.get("tempPath"));
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter w = new IndexWriter(directory, config);
addDoc(w, "Lucene in Action"."193398817", -5.new int[] {1.2}, new String[]{"los angles"."beijing"});
addDoc(w, "Lucene for Dummies"."55320055Z".4.new int[] {5.1}, new String[]{"shanghai"."beijing"});
addDoc(w, "Managing Gigabytes"."55063554A".12.new int[] {0.1.2}, new String[]{"shenzhen"."guangzhou"});
addDoc(w, "The Art of Computer Science"."9900333X".2.new int[] {10.4.3}, new String[]{"shanghai"."los angles"});
addDoc(w, "C++ Primer"."914324235".11.new int[] {0.5.2.3}, new String[]{"beijing"."shenzhen"});
addDoc(w, "I like Lucene"."fdsjfa2313".1.new int[] {0.1.2.4}, new String[]{"nanjing"."tianjin"});
addDoc(w, "Lucene and C++ Primer"."fdsfaf".10.new int[] {0.1.2}, new String[]{"shenzhen"."guangzhou"});
addDoc(w, "C++ api"."411223432".2.new int[] {0.11.2}, new String[]{"shenzhen"."shanghai"});
addDoc(w, "C++ Primer"."914324236".50.new int[] {3.2.6.1}, new String[]{"beijing"});
w.close();
// 2. query
String querystr = "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
// Query q = new TermQuery(new Term("title", querystr));
// BooleanQuery query = new BooleanQuery();
MatchAllDocsQuery q = new MatchAllDocsQuery();
//sort
SortField visitSort = new SortedNumericSortField("visit", SortField.Type.INT, true);
Sort sort = new Sort(visitSort);
// 3. search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs docs = searcher.search(q, hitsPerPage, sort);
ScoreDoc[] hits = docs.scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0; i<hits.length; ++i) {int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + "." + d.get("isbn") + "\t" + d.get("title") + "\t" + d.get("visit"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn, int visit, int [] sale_list, String []locations) throws IOException {
Document doc = new Document();
doc.add(new StoredField("visit", visit));
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
doc.add(new SortedDocValuesField("title".new BytesRef(title)));
if(! title.equals("C++ api")){
doc.add(new NumericDocValuesField("visit", visit));
}
for (int sale : sale_list){
doc.add(new SortedNumericDocValuesField("sale", sale));
doc.add(new SortedNumericDocValuesField("sale", sale));
}
for (String location: locations){
doc.add(new SortedSetDocValuesField("city".newBytesRef(location))); } w.addDocument(doc); }}Copy the code
Processing logic
Same as StoredField, the core starts from DefaultIndexingChain. As we mentioned before, DefaultIndexingChain is processed with field as the basic unit. DocValuesWriter (docValuesWriter, docValuesWriter, docValuesWriter, docValuesWriter, docValuesWriter) But in the end, the same will be conducted by Lucene70DocValuesConsumer processing, in the process of handling have write function can be reuse, for examplewriteValues
, is actually a wrapper around writing the value type values.. Other procedures that cannot be reused are replaced by Others. In fact, I think lucene is a bit over-encapsulated. In simple terms, it is to obtain different docValueWriter, and impose different write methods on each docValue. But it is true that it is written here. I’ll go into more detail about how each type falls
NumericDocValues
The simplest of the docValues is NumericDocValues. As mentioned earlier, a DocValue actually stores the information of docid->value in a field, which is NumericDocValues when the value is of numeric type.
Trading way
When storing any type of data, there are at least two files, one is Meta file and the other is data file. For docValue, the Meta file has a.dvm suffix and the value file has a.DVD suffix. Generally, the Meta information is obtained through.dVM and then the initial information is restored from the.DVD.
Compression algorithm
In other words, given an unordered set of values of type long, how to disentangle them with a good compression ratio and compression speed:
- Direct memory
The simplest and most intuitive way is to use a KV disk repository directly to store the doCID and docValue mappings. That’s the same thing as not compressing.
- Dictionary coded storage
If there are a lot of duplicate values, I write the unique values into the meta information and only record their numbers in the data. For example, for the “city” field, there are only dozens of values, so I don’t need to write them all. Mark “Shanghai” as 1 and “Beijing” as 2, and write the original information into the meta information. Note that unique values can be sorted before encoding.
- Compressed coded storage
In most cases, the range is actually different, and it doesn’t apply to case2. So, assuming that our doc value is [6, 15, 12, 3, 9, 12, 21] and in the range of [3,21], intuitively we can just subtract the minimum value from it, Shrink its range to [0, 18], so that it becomes [3, 12, 9, 0, 6, 9, 18]. Then compress each number with the maximum number of bits of 18, 5bits (10010), and finally consume 35bits. But is there a better way? If you divide by 3, you get [1,4,3,0,2,3,6]. In this case, each number uses the largest number of bits of 6,3 (110). The final number of bits is 21. Therefore, the summarized method is: for any group of unordered integers, for each num, (num-min)/ GCD, the maximum compression ratio can be obtained. The total number of bits required for each value is (max-min)/ GCD, which is called bitPerValue
If the array is long, its greatest common divisor is likely to be 1, and max-min is likely to be too high to compress. The solution to this problem is to store the array in blocks, ensuring that each block contains at most N digits. This will also have a good compression effect.
Further, there is an optimization direction. As mentioned before, when bitPerValue is (max-min)/ GCD, the compression ratio can be maximized, but the retrieval needs not only the compression ratio, but also the compression rate. Experiments show that when the bitPerValue number is some fixed constant, the speed of decompression and compression can reach the optimal, so Lucene can implement DirectWriter and DirectReader, limit the bitPerValue can only be the following numbers:
final static int SUPPORTED_BITS_PER_VALUE[] = new int[] {
1.2.4.8.12.16.20.24.28.32.40.48.56.64
};
Copy the code
Why is that? If we were to compress a set of numbers: [121, 68, 23, 25], notice that DirectWriter handles objects of type Long:
BitsRequired is 7, and the total encoding cost is 4 bytes, so if you want to read the second number 68, you need to cross two bytes to read it, which is obviously a big performance waste for a column of information like docValue that requires frequent reading. In addition, this cross-byte encoding also allows a number to be processed twice through multiple loops.
But if we use BitsRequired to be 8, the situation is as follows:
Read and write performance is better, and compression is not lost too much.
Notice that these constants, 1, 2, 4, 8, 16, and 24 are all fairly understandable, because they happen to be integers that take up the number of bits in a byte, but think about it, why is 12, 20, and 28 one of these constants? These also need to be read and written across bytes.
Borrow the explanation from this blog. For example, let’s encode the four numbers [69, 25, 261, 23] :
Consider that if you take BitsRequired=9, when you need to fetch the number 25, you need to calculate that the first byte is in the second bit, when you read 261, you need to calculate that the first byte is in the third byte, and when you read 69, you need to calculate that the first byte is in the fourth byte. Each number is counted once until the ninth number does not need to be counted.
However, with bitsRequired=12, the first byte must be the fifth byte when you need to fetch 25, the fifth byte must be the fifth byte when you need to fetch 261, the fifth byte must be the fifth byte when you need to fetch 69, and then every two bytes.
In order to minimize the calculation of the location of the head address, speed up the efficiency of encoding and decoding, and take into account the compression ratio, when the bitsRequired is 9, we finally encode and decode 12 Bits per unit. The following figure shows the average number of times a value needs to be read to calculate the first address between (8 and 16) for each bitPerValue, also from his blog:
.dvm
Header: header file, all available, covering the encoding version used this time and other information.
DocsWithField: Lucene is schema free. That is, document A can have the price field, but document B may not have the price field. Mysql needs to specify a schema for storage, and all fields need to contain this schema. Specifically, there are three cases to discuss:
- No documentation contains this field
Write a -2 and then a 0 to represent the situation.
- All documents contain this field
This is the most common case, for example, when es is used as a log collection system, all docs have a timestamp field, which is written -1 and then 0 to indicate the case.
- Part of the document contains this field
Lucene puts the doc containing the field into a bitset and writes the bitset to the.DVD..dvm takes care of recording the offset and length in the.DVD file.
NumValues: In the case of NumericDocValue, this is the number of documents that contain the field, and in the case of SortedNumericDocValue, this is the sum of the number of values in each document, This value is written as a long.
NumBitsPerValueMetaData is a tricky name I coined. Getting back to the essence of the problem, now the question is, how do you code a set of values of the same type? There are several ways to do this: First, if all values are consistent, we only need to record one value. The second case: If the value of a large number of repeat, then I can directly put the enumeration value written to the meta information inside, and then only record their Numbers in the data, such as for the field “city”, a total of a few values, then I don’t need to write it again, the “Shanghai” marked as 1, directly to the “Beijing” tag to 2, Write the raw information into the meta information and store the values just 1, 2, and 3; The third way is to use compression coding if it is not repeated (that is, the compression algorithm section I mentioned earlier);
Lucene uses the following methods on a case-by-case basis based on the estimated number of bits to be used. The meta information only records which method it uses:
- All values remain the same
I’m just going to say -1.
- Use dictionary encoding (case 2)
First calculate how many unique values there are. If the number of unique values is less than the maximum value/minimum value/GCD of this set of values, then use dictionary encoding. At this time, write the number of independent values first, and then write the independent values in turn.
- Use non-dictionary encoding (case 3)
If the number of unique values is greater than the maximum value/minimum value/GCD of the set of values, the non-dictionary encoding is used. For block storage, write -16; for non-block storage, write -1. The concept of block storage, also discussed earlier, is a good way to reduce max-min and increase GCD values in order to optimize max-min/ GCD encoding methods. When do you use block storage? Block storage is used when it saves about 10% of the space compared with non-block storage. How is this estimated space estimated? We just need to make a note of the number of bits required for global max-min if we don’t use a block. In addition, it can be estimated by counting the sum of max-min bits in each blCOK when block (i.e. every NUMERIC_BLOCK_SIZE 16384 digits).
Min compression encoding formula (num-min)/ GCD min value. When all stored docvalues are consistent, this represents that value.
GCD Similarly, compress the GCD value in the encoding formula (num-min)/ GCD
StartOffset Specifies the start value of the position of the tag in the. DVD
dataLength
The dataLength bytes from startOffset in the. DVD are the stored values of all the DocValues under this field.
.dvd
I’m not going to talk about Header and Footer here, we’ve talked about them before. A DocIDField is a roaringBitMap (lucene IndexedDISI) that stores the docID of the field. We won’t go into details here because we’ll use it when we store Norm. So how do YOU store FieldValues
FieldValues
This diagram should be combined with the diagram of the previous. DVM:
Therefore, there are several cases to discuss:
** All values are the same ** this will not be written to. DVD file,.dvm file min value is this value
Using dictionary encoding * * * *, that is, when the doc required for the number of unique Values – 1 bits value < this group of docValue (Max – min)/the GCD bits, needed to satisfy the dictionary coding (readers may think about it, why is this situation, Can be considered in terms of the number of bytes required for the final encoding).
For example, if we encode [-5, 4, 12, 2, 11,1, 10], the number of unique digits is 7, minus 1 is 6, and bitsRequired is 3, but for faster decoding, BitsRequired becomes 4 (see the previous section ‘Compression Algorithms’), for the guidance of (max-min)/ GCD, which is 17, bitsRquired=5, for the guidance of a faster coder, bitsRquired= 8. Meet the conditions of dictionary coding.
First sort it into [0,3,6,2,5,1,4], then the problem turns to encoding the numbers with packed4, that is, each number has four bits, and finally four bytes. The final result is [3,98,81,64] for each byte. These numbers are our FieldValues, and the initial values corresponding to the numbers are placed in the.dVM file.
A singleblock is a singleblock that is a singleblock that is a singleblock that is a singleblock that is a singleblock that is a singleblock that is a singleblock that is a singleblock that is a singleblock that is a singleblock that is a singleblock that is a singleblock that is a singleblock that is a singleblock that is a singleblock that is a singleblock.
For example: bitsRquired(unique_num – 1) is 4 and (max-min)/ GCD is also 4 for the guidance of non-dictionary coding when the docValues to be encoded are [6,15,12,3,9,12,21]. First convert these docValues through v-min/ GCD to [1,4,3,0,2,3,6] and then packed4, encoding them as fieldValues.
When a multiblock is used in a non-dictionary code, there are two conditions in each block, 16384 doc intervals, when multiblock is used: Where the data are exactly the same, and where the data are not exactly the same.
When the data is exactly the same, write a 0, and then write the exact same number. For example, BLOCK2:
If the data is not completely consistent, then we need to recalculate the bitsPerValue and min of the block and add the bitsPerValue and min and packedValue of the block. Attached is the end position of the packedValue, as shown in the diagram below:
SortedNumericDocValues
In fact, this class is oddly named NumericMultiDocValues, which is more appropriate because it can realize the function of allowing a field to have multiple values. There are also many scenarios. For example, when we store the temperature of a certain day in a city, it may have the temperature at different time periods of morning, middle and night. Therefore, the value can be [20, 25, 18], which can be used to specify the temperature of the desired time period during Query.
SortedNumericDocValues falls in a very similar way to NumericDocValue because its fieldValue is still slapped flat and stored, If doc1’s docValue is [3,2,4], doc2’s docValue is [1,2], and doc3’s docValue is [0,8], its fieldValue will be stored as [2,3,4,1,2,0,8]. Then store the boundary [0,3,5,7] to restore it. Because this boundary is actually an incrementing array, you can introduce a DirectExpress writer to store that incrementing array.
DirectMonotonicWriter
(value-min/ GCD) (value-min/ GCD) (bitsRequired) (DirectWriter) (pack) For example, given an array [0,3,5,7,11, 14,15, 17, 19], the naive idea would be to store the difference between them [0,3,2,2,4,3,1,2,2] and then use the same compression method used to store unordered arrays.
This is the simplest case. What if the array looks like this: [0, 300, 500, 700, 1100, 1400, 1500, 1700, 1900]? Their difference becomes [0300200200400300100200200], the problem is to store the number is too big! To save in this way we need bitsRequired(400) * 9 = 81 bits.
Some students inspired by before you could say, then divided by one of your greatest common divisor of 100 at the same time, it would become,3,2,2,4,3,1,2,2 [0], so think too naive, the distribution of the real data won’t be so beautiful, if if there is any number + 1, They become [0,301,200, 200, 400, 300, 100, 200, 100] and the GCD will instantly degenerate to 1, so dividing by GCD only works if the data is perfect, so what else can you do?
Directformal Express Writer works like this:
- Calculate the average growth avgInc
- Scale data according to avgInc
- The minimum value min is calculated for unsigned processing
- Finally, packedInt storage is carried out
Python code looks like this:
In [94] :def convert(nums) :. : avgInc = (nums[-1] - nums[0) /len(nums) ... :for i in range(0.len(nums)): ... : expected = avgInc * i ... : nums[i] -= expected ... : min_num = nums[0]
...: for i in range(1.len(nums)): ... : min_num =min(min_num, nums[i]) ... :for i in range(0.len(nums)): ... : nums[i] -= min_numCopy the code
Let me give you an example:
Focus on several properties:
- The scaled result is a better result, not an optimal result.
- Long must be used in the calculation, otherwise intensive reading is impaired.
- It has better compression rate for arrays with constant growth rate. In other words, you need to keep the expected as close to 0 as possible.
- DirectMonotomicWriter also uses the idea of group compression in batches to ensure that the compression rate within each group is improved.
SortedNumericDocValue is more than NumericDocValue, in fact, we need to store the number of docvalues of all doc under a field, but here we store the boundary (the sum of the number array is the boundary result). Why store boundaries instead of quantities? DocValue is very fast to read and store. If you store count, you need to convert it to a boundary, which is a lot of overhead. Most importantly, it is assumed that the data segments meet the aforementioned property 3: the growth rate is constant, that is, we believe that the length of the doc Value array under most fields is constant, which meets the requirements of most scenarios. For example, if your SortedDocValue is something that records morning, mid and evening temperatures, then the array size is 3, and the last one that needs to be compressed is an array of [3,6,9,12], which is always 3. Directexpress Writer is the best way to go
.dvm
The format of the meta file is as follows
It’s just a bit more DocValueCountMetaData than NumericDocValue.
NumDocsWithField: NumDocsWithField: how many doc’s have this field in them. What does it mean if the number of docValues is the same as this one? The field has only one value per doc, so the caller uses the SortedNumericDocValue as NumericDocValue, and the offset, blockShift, and other fields need not exist anymore.
Offset: describes the position of the docValueCount in the. DVD file. Specifically, start_offset.
Direct_1600_formal/IC_BLOCK_shift (): This is a constant value used when Directformal/icwriter is initialized. The current value is 16. The size of a Buffer that is used to initialize direct1600x / 1600X writer.
Min: Indicates the minimum number of formal x/S mentioned earlier in DirectFormal X/icWriter.
AvgInc: Indicates the average growth of something like a float or something like an int. This is used to express something like a float or something like an int.
DataLength: is the length of the DocValueCount information in the data file;
BitsRequired: the number of BitsRequired for each digit.
.dvd
The only difference between DVD and Numeric is that there is a DocsWtihCount, which is a file written to express formal support for icWriter, which is an array of packedInt.
SortedDocValues
thinking
After we’ve said how to store numeric docValues, we’ll talk about character docValues. What’s the simplest way to store a character type? For example, given doc0-DOC4 character set [” AB “, “abcDE “,” CBA “, “NBA “,” ABC “], the simplest idea is to concatenate them one by one: [“ababcdecbanbaabc”] then stores the offset array [0, 2, 7, 10, 13]. This offset array is used to compress the previous formal x writer.
This method is good, but not the best, but there are two problems:
- The character set is not compressed, but if you use conventional compression methods like LZ4, the efficiency is compromised, and you have to read the whole block, which is too expensive.
- In addition, if two words are exactly the same, it is equivalent to saving twice, especially if the word is very long, so saving is a very serious waste problem!
For question 1: We consider that there are many words with the same “ab” prefix. Can we use this to compress?
For question 2: We can learn from the previous experience of saving NumericDocValue and replace each word with a serial number in lexicographical order. In this way, if there are many repeated values, the cost is just one more serial number, instead of storing the original value repeatedly.
Storage methods
Based on the above strategy, we can start to talk about how Lucene stores SortedDocValue, which is not a small topic.
[“Lucene in Action”, “Lucene for Dummies”, “Managing Gigabytes”, “The Art of Computer Science”, “C++ Primer”, “I like Lucene”, “Lucene and C++ Primer”, “C++ api”, “C++ Primer”],
Their termID is [0,1,2,3,4,5,6,7,4],
The order of docvalues is as follows: [“C++ Primer”, “C++ api”, “I like Lucene”, “Lucene and C++ Primer”, “Lucene for Dummies”, “Lucene in Action”, “Mananging Giagabytes”, “The Art of Computer Science”]
[4,7,5,6,1,0,2,3] [Ord -> termID] [Ord -> termID] [Ord -> termID]
OrdMap is used to represent the index of each termID in the dictionary order. For example, term0 corresponds to 5, so ordMap is [5,4,6,7,0,2,3,1]. Here we establish the relation of termID ->Ord.
Notice that these two arrays are very important and will be used a lot later.
Personally think docValue storage format is the most complex of all storage format, but step by step creations also is not difficult, computer field is such, viewed from the outside is a very complex system, but its essence is accumulated from one by one the most simple nand gate structure, as long as heavy slowly think, there is no concept of what is very hard to understand.
.dvd
Let’s look at the data file
Header and footer will not be repeated in the future, the format is the same. A DocIDField, as we talked about earlier, is essentially a bitset that stores docID. Because Lucene is schema free, not all doc’s have this field.
What’s left looks like a lot of clutter, but it actually stores just three things: OrdMap, TermDict, and TermIndex
OrdMap
That’s the Ords that we’re drawing, and as I said, this is essentially storing termID->Ord. This ord is the ord after term is sorted. Why store this relation? In essence, TermDict stores sequential terms. Given a TermID, you need to know its location in TermDict to find it in TermDict. OrdMap is an array with a subscript of termID and a value of Ord. Therefore, DirectWriter is used to write the OrdMap. After calculating the number of Bits required by the maximum value, use the Bits in sequence to represent it.
TermDict
This is the essence of the whole sortedDocValue, which stores all the terms under this field, because the terms here are all sequential, so prefix compression is adopted. In coding, each group of 16 values will form a block, and the length and value of the first value of each block will be recorded intact, and each of the next 15 values will be compared with the previous value. Calculate the length of the same prefix, PrefixLength, and the length of different suffixes, suffixLength, and insert their different suffixes, suffixValue. Note that when both lengths are less than 16, PrefixLength and suffixLength are placed under the same byte to save bytes. If one of them is larger than length >=16 then place the value of -15 in the vinT code behind it.
In addition, BlockIndex records the offset of each block relative to the first block in the. DVD. Because it is an incremental sequence, DirectExpress Writer is used for storage. When you want to access any NTH block, you can read the BlockIndex and quickly locate it in the corresponding block.
This part of the code is located in Lucene70DocValuesConsumer. In Java, stick the code:
private void addTermsDict(SortedSetDocValues values) throws IOException {
final long size = values.getValueCount();
meta.writeVLong(size);
meta.writeInt(Lucene70DocValuesFormat.TERMS_DICT_BLOCK_SHIFT);
AddressBuffer is actually BlockIndex
RAMOutputStream addressBuffer = new RAMOutputStream();
meta.writeInt(DIRECT_MONOTONIC_BLOCK_SHIFT);
long numBlocks = (size + Lucene70DocValuesFormat.TERMS_DICT_BLOCK_MASK) >>> Lucene70DocValuesFormat.TERMS_DICT_BLOCK_SHIFT;
DirectMonotonicWriter writer = DirectMonotonicWriter.getInstance(meta, addressBuffer, numBlocks, DIRECT_MONOTONIC_BLOCK_SHIFT);
BytesRefBuilder previous = new BytesRefBuilder();
long ord = 0;
long start = data.getFilePointer();
int maxLength = 0;
TermsEnum iterator = values.termsEnum();
for(BytesRef term = iterator.next(); term ! =null; term = iterator.next()) {
// If it is in the first element of the block, write it as it is
if ((ord & Lucene70DocValuesFormat.TERMS_DICT_BLOCK_MASK) == 0) {
writer.add(data.getFilePointer() - start);
data.writeVInt(term.length);
data.writeBytes(term.bytes, term.offset, term.length);
} else {
// The 15 elements after the first element need to calculate prefixes equal to the last value and unequal suffixes to write
final int prefixLength = StringHelper.bytesDifference(previous.get(), term);
final int suffixLength = term.length - prefixLength;
assert suffixLength > 0; // terms are unique
data.writeByte((byte) (Math.min(prefixLength, 15) | (Math.min(15, suffixLength - 1) < <4)));
if (prefixLength >= 15) {
data.writeVInt(prefixLength - 15);
}
if (suffixLength >= 16) {
data.writeVInt(suffixLength - 16);
}
data.writeBytes(term.bytes, term.offset + prefixLength, term.length - prefixLength);
}
maxLength = Math.max(maxLength, term.length);
previous.copyBytes(term);
++ord;
}
writer.finish();
meta.writeInt(maxLength);
meta.writeLong(start);
meta.writeLong(data.getFilePointer() - start);
start = data.getFilePointer();
addressBuffer.writeTo(data);
meta.writeLong(start);
meta.writeLong(data.getFilePointer() - start);
// Now write the reverse terms index
writeTermsIndex(values);
}
Copy the code
TermIndex
With TermDict and Ords, you can use a termID to find out where the term is, but there is a problem. If a term is given, how can you quickly determine whether it exists? This again requires an index of these terms. Here, the author draws on the method of skip table to take one term out every n terms (n of the current version is 1024) and calculate the first byte of its prefix + suffix that is the same as the last term. Finally, the PrefixValueIndex and BlockIndex function are the same. They record the offset of each prefixValue relative to the first value and use formal order token (S) for storage.
For example, The 1023rd term is “hello”, the 1024th term is “Hellowen” and the first value recorded by TermIndex is “Hellow “, the 2047th term is “hex” and the 2048th term is “hexical”, Now the second value is “hexi”, so if I give you term “hehe”, because “hehe” > “Hellow” is less than “hexi”, I know that term is between term 1024th and term 2048th, So we can lock the search between 64th block and 128th block.
(I think 1024 is a bit too high, especially for Chinese scenarios, 256 is faster but consumes more storage)
TermIndex code is as follows:
private void writeTermsIndex(SortedSetDocValues values) throws IOException {
final long size = values.getValueCount();
meta.writeInt(Lucene70DocValuesFormat.TERMS_DICT_REVERSE_INDEX_SHIFT);
long start = data.getFilePointer();
long numBlocks = 1L + ((size + Lucene70DocValuesFormat.TERMS_DICT_REVERSE_INDEX_MASK) >>> Lucene70DocValuesFormat.TERMS_DICT_REVERSE_INDEX_SHIFT);
RAMOutputStream addressBuffer = new RAMOutputStream();
DirectMonotonicWriter writer = DirectMonotonicWriter.getInstance(meta, addressBuffer, numBlocks, DIRECT_MONOTONIC_BLOCK_SHIFT);
TermsEnum iterator = values.termsEnum();
BytesRefBuilder previous = new BytesRefBuilder();
long offset = 0;
long ord = 0;
for(BytesRef term = iterator.next(); term ! =null; term = iterator.next()) {
if ((ord & Lucene70DocValuesFormat.TERMS_DICT_REVERSE_INDEX_MASK) == 0) {
writer.add(offset);
final int sortKeyLength;
if (ord == 0) {
// no previous term: no bytes to write
sortKeyLength = 0;
} else {
sortKeyLength = StringHelper.sortKeyLength(previous.get(), term);
}
offset += sortKeyLength;
data.writeBytes(term.bytes, term.offset, sortKeyLength);
} else if((ord & Lucene70DocValuesFormat.TERMS_DICT_REVERSE_INDEX_MASK) == Lucene70DocValuesFormat.TERMS_DICT_REVERSE_INDEX_MASK) { previous.copyBytes(term); } ++ord; } writer.add(offset); writer.finish(); meta.writeLong(start); meta.writeLong(data.getFilePointer() - start); start = data.getFilePointer(); addressBuffer.writeTo(data); meta.writeLong(start); meta.writeLong(data.getFilePointer() - start); }Copy the code
.dvm
DVM is a description of the meta information of a. DVD:
FieldNumber
Just like sortedNumericDocValue, it’s the field number
DocValueType
So this is sorted.
DocIDIndex
Like SortedNumericDocValue
NumDocsWithField
The number of docs that own the field
OrdIndex
This is actually installed. DVD ORD meta information, divided into two cases
- When the number of unique values <=1, fill in 3 zeros
- When the number of values is greater than 1, enter the metadata numBitsPerDoc, and Ord in the. DVD startOffset and Length
TermDictMeta
size
Represents how many elements there are
BLOCK_SHIFT
1 shifts the value of BLOCK_SHIFT to the left, which represents the threshold for a block. In this case, this value is 4, which represents a block load of 16.
DIRECT_MONOTONIC_BLOCK_SHIFT
This parameter is used to express something like 1600_writer or something like that to initiate the size of a byte buffer, which is used to hold BlockIndex.
DirectMonotonicWriterMeta in sortedNumericDocValue also mentioned before, when we want to describe an increasing sequence, will use this to indicate, the MetaData record some meta information. This writer is mainly used to record the value of BlockIndex.
Max_length Indicates the maximum term length
BlockMeta represents where block information starts and how long it is.
BlockIndexMeta represents where the BlockIndex information starts and how long it is.
TermIndexMeta
A description of TermIndex, which is essentially a data structure created to speed up Term lookup.
TERMS_DICT_REFERSE_INDEX_SHIFT
That is, how many terms are recorded in sequence. Currently recorded every 1024 (I personally think this number is too high), the shift is actually 10, 1<<10 = 1024.
DirectMonotonicWriterMeta
Used to record the relative address of a PrefixValue, which is also an increasing sequence.
PrefixValueMeta
Where does prefixValue start in.DVD and how long does it last.
PrefixValueIndexMeta
PrefixValueIndexMeta in.DVD where to start and how long.
summary
In a word, Lucene uses sorting first, prefix compression after sorting, and maintains a map of Ord and original termID to store docValue. Ord is also used for comparison of docValue size, rather than the original value.
SortedSetDocValues
So if you understand SortedDocValue for a single domain, then SortedSetDocValues for a multiple domain is easy to understand. Similar to SortedNumericDocValue, both.dvm and.DVD have only one more place for the range of values.
SortedSetDocValue = OrdAddress; SortedSetDocValue = OrdAddress;
OrdAddress
As mentioned earlier, Ord is essentially an OrdMap that stores a mapping of termID->Ord. However, in SortedSetDocValues, we put all doc’s termID together, so we cannot distinguish which termID belongs to which doc. We need OrdAddress to tell us in addition: The 1-3rd is for doc1, the 3-5th is for doc2, and the 5-9th is for doc3. In this case OrdAddress is [3,5,9], which is clearly an increasing sequence. So directformal writing is used to document that. It’s very simple. If you use DirectFormal approval for icWriter, you must write down metaData as before, so the DVM format is as follows:
SingleValue
So if all of the tests find that all of the values are single-valued, this is going to degenerate to SingleValue, and I’m going to say 0, and the rest of the data structure is going to stay the same as SortedDocValue, but if I find any non-single-valued, I’m going to say 1, Replace the metaData of OrdAddress with numDocsWithField.
OrdsAddressMeta
OrdAddress is written to express formal information about someone or something like that. So the Meta information is BLOCK_SHIFT, Min, AvgInc, Length, BitsRequired.
BinaryDocValues
As we said earlier, for a string docValue, we often use sortedDocValue to convert it to ORD and then compress the prefix to store it. So, do we have a more modest way? Like store the contents of these strings directly, right? Just take it out when you use it. Yes, binaryDocValue is the idea, and when your term is short enough, the overhead of using binaryDocValue is even lower than SortedDocValue.
Since binaryDocValue does not do extra prefix compression, de-duplication, etc., so the coding structure is much simpler than the previous several. This part of the code is located in:
public void addBinaryField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException {
meta.writeInt(field.number);
meta.writeByte(Lucene70DocValuesFormat.BINARY);
BinaryDocValues values = valuesProducer.getBinary(field);
long start = data.getFilePointer();
meta.writeLong(start);
int numDocsWithField = 0;
int minLength = Integer.MAX_VALUE;
int maxLength = 0;
for (intdoc = values.nextDoc(); doc ! = DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) { numDocsWithField++; BytesRef v = values.binaryValue();int length = v.length;
data.writeBytes(v.bytes, v.offset, v.length);
minLength = Math.min(length, minLength);
maxLength = Math.max(length, maxLength);
}
assert numDocsWithField <= maxDoc;
meta.writeLong(data.getFilePointer() - start);
if (numDocsWithField == 0) {
meta.writeLong(-2);
meta.writeLong(0L);
} else if (numDocsWithField == maxDoc) {
meta.writeLong(-1);
meta.writeLong(0L);
} else {
long offset = data.getFilePointer();
meta.writeLong(offset);
values = valuesProducer.getBinary(field);
IndexedDISI.writeBitSet(values, data);
meta.writeLong(data.getFilePointer() - offset);
}
meta.writeInt(numDocsWithField);
meta.writeInt(minLength);
meta.writeInt(maxLength);
if (maxLength > minLength) {
start = data.getFilePointer();
meta.writeLong(start);
meta.writeVInt(DIRECT_MONOTONIC_BLOCK_SHIFT);
final DirectMonotonicWriter writer = DirectMonotonicWriter.getInstance(meta, data, numDocsWithField + 1, DIRECT_MONOTONIC_BLOCK_SHIFT);
long addr = 0;
writer.add(addr);
values = valuesProducer.getBinary(field);
for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) {
addr += values.binaryValue().length;
writer.add(addr);
}
writer.finish();
meta.writeLong(data.getFilePointer() - start);
}
}
Copy the code
.dvd
Header, DocIDData, and Footer are actually the same as before, so I won’t go over them again.
TermsValue essentially writes the character value (possibly not string, Binary) intact, unweighted, sorted, and compressed. After writing each term, there is an address offset relative to the starting point. Directexpress ≤ icwriter code is used to form the TermsIndex part
.dvm
This is actually the meta information for.DVD. The Offset of TermValueMeta is the start position of TermsValue in a. DVD, and Length is its Length.
conclusion
This chapter is quite long. The emphasis is to understand the compressed memory of NumericDocValue and the compressed memory of SortedDocValue and BinaryDocValue used to store characters. Directexpress Writer is used over and over again. Is the most important coding method in this chapter. In addition, for multi-value stores, we actually only need to store an additional OrdAddress to tell us which doc contains which values.