takeaway: In our implementation to test whether a string contains another string, simply use a string matching algorithm can be achieved, if we want to realize to test whether a string contains N strings, tens of millions, the N could recycle simple string matching algorithm can’t meet our requirements, tens of millions of words need to be flexible and maintenance, When the business party matches, it can get its own words to match. The matching speed of ten million words needs to be guaranteed, and the results should be produced within seconds. Therefore, we need a solution to this kind of problem — the glossary service.
Full text 5370 words, estimated reading time 12 minutes.
The background,
Content review platforms need to detect whether the author’s article contains special sensitive words. For different lines of business, the requirements of these words are also different, some strict and some loose; Some need words, some need more words; Some need to detect implied words and variant words; Some in the title, some in the body of the effective; Some detect to send a person to try, some detect to refuse directly; Some need thousands of words, some need tens of thousands, millions, even tens of millions of words. For these words, each line of business can maintain itself, convenient to add, delete, modify, each business can configure the effective rules of words according to their own needs; At the time of detection, the business party can get the words it maintains to test the article, and it needs to ensure the time of detection and get the test results in real time.
Second, architecture,
Above is the overall architecture of the glossary service:
(1) Thesaurus management: each line of business maintains its own thesaurus on the thesaurus management platform. Each line of business can add multiple thesaurus groups. Sensitive words can be maintained in each thesaurus group and properties of sensitive words can be added dynamically; The glossary management platform uses ES to achieve efficient word segmentation retrieval ability for glossaries and tens of millions of words. The glossary management will periodically generate the glossary BOS files of each line of business and upload them to the BOS service.
(2) Service layer: the business side calls the glossary service to unify the external matching interface, and the service layer sends the matching task to the policy operator layer to complete the matching function of the glossary. The unified service to the outside of the glossary acts as a simple gateway, providing authentication to verify whether a request is legitimate; Provides the function of flow limit, can set the flow limit value for each requester; It provides the function of result processing, and the attribute of sensitive words returned by the policy operator is only a part. According to the demand of the business side, the attribute of sensitive words returned by the policy operator can be improved. The function of traffic forwarding is provided, and the requests of each line of business can be sent to different clusters according to the configuration, so as to realize the deployment of each business strategy operator in clusters.
(3) Strategy operator layer: the strategy operator realizes the matching of sensitive words in the text. The matching patterns include inclusion matching, strong filtering matching and multi-mode matching. The hit sensitive words will be returned to the glossary service layer. The word list of each line of business will be loaded into memory by each operator of the policy in the way of full refresh or real-time synchronous incremental data, which supports the matching function of operators. Full refresh method: the glossary management platform will regularly generate BOS files by dividing the glossary into business lines and upload them to the BOS service. The policy operator will synchronize the sensitive words from the BOS file to the memory on a regular basis. Real-time synchronization: the policy operator will scan the thorn list database in real time and load the incremental word list into memory.
(4) Basic services: GDP framework realizes the development of glossary service, Pandora platform realizes the deployment of glossary service, MySQL realizes the storage of glossary data, ES realizes the retrieval of glossary by word segmentation, BDRP realizes the function of stream limiting and caching, and BOS service realizes the transmission of glossary files.
Three, glossary management platform
Vocabulary management platform to achieve the various lines of business to maintain their own glossary, each business line you can create multiple word group, convenient business classification management own words, the meaning of each word is set by business side to give, when hit the sensitive word embodies the group belongs to the glossary, whether business according to the different disposal of word is set to do; Sensitive words can be maintained under each word group, and the attributes of the sensitive words can be selected by the business party. For example, the audit type property can be selected by the business party according to the hit of a specific sensitive word to be submitted, and the value of the attribute of the rejected word can be selected if it is rejected.
3.1 glossary management
A glossary that can be added or modified by each line of business and can be retrieved.
(1) Add the glossary, select the business line belonging to it, add the name and remarks, and create the glossary to multiple business lines at one time. If the glossary of other business lines can be reused, you can also copy the glossary of other business lines directly to the newly created glossary, which is convenient and fast, and it is convenient for management personnel to manage the glossary. As shown in figure 1:
(1) Modify the glossary, you can modify the name and remarks of the glossary, and you can re-designate the glossary to the business line. If other business lines have glossaries that can be reused, you can also copy the glossaries of other business lines directly to your own glossaries, which is convenient and fast, and it is convenient for managers to manage the glossaries. As shown in figure 2:
(2) Word list retrieval, which supports the retrieval of word list by word list ID, word list name, line of business and created time; The retrieval of thesaurus name, using the characteristics of ES can realize the word segmentation retrieval of thesaurus name; In the retrieved list, you can see the attributes of thesaurus, such as the id of thesaurus, the name of thesaurus, the line of business, the creation time of thesaurus, the update time, the number of entries under each thesaurus, the remark of thesaurus, the operator of the effective status of thesaurus, etc. You can click in the list of states to change the glossary to the effective or invalid state; In the action bar, click modify, modify glossary, click add to glossary to add a word, and click view to view glossary details. As shown in figure 3:
3.2 Maintenance of sensitive words
In the word list can be efficient and quick maintenance of sensitive words. The attributes of important sensitive words include:
(1) Type of entry: identify whether the sensitive word is submitted for review or filtered word;
(2) Sensitive type: identifies the sensitive classification of the entry;
(3) Matching mode: include matching – detect whether the text contains sensitive words; strong filtering matching – detect whether the text contains sensitive words after the combination of Chinese characters, letters, numbers and special characters; multi-mode matching – detect whether two or three words are hit in the text, and the spacing of multiple words is within the effective range.
(4) Effective position: the effective position of sensitive words in the article, such as the title, the body, the text given in the picture, etc.
(5) Exemption word: contains the attribute of the sensitive word in the match. If the sensitive word is A, the exempt word is B, and there is the word AB in the text, the sensitive word A will not be hit.
(6) Extension strategy: multi-mode position displacement — if there is the multi-mode word AB and the word BA in the text, the AB sensitive word can be hit; Alphanumeric case conversion – ignore case, if the sensitive word is CD, CD, CD, CD in the text, will hit the CD word.
(7) Expiration time: it provides two options of long-term validity and specific expiry time.
Sensitive word maintenance provides a single addition, batch addition, single modification, batch modification, glossary retrieval, entry retrieval and other functions:
(1) A single append, the name of the appended glossary has been determined, the business party can select the attribute of the word according to its own business, the operation in the append, if the matching pattern attribute of the word has selected the containing word, can add the exemption word of this word. As shown in figure 4:
(2) the batch add, support synchronous biggest add 3000, can be added to the different lines of business at the same time different words in the table, convenient and quick, convenient administrator maintenance work of words, all the words to add attributes must agree to use this feature, phase ii don’t support to contain words add immunity properties, You can enter more than one line break in the sensitive word input box. As shown in figure 5:
(3) Batch creation. The business party can maintain the sensitive words and attributes into the Excel table according to its own business, with the maximum support of 30,000 words for each file. After submission, a creation task can be generated and run in the background.
(3) Single modification, you can modify any attribute of the entry, if the sensitive word is a synchronized batch addition of the containing word, you want to add the sensitive word exemption word can be modified here. As shown in figure 7:
(3) Batch modification. The business party can maintain the sensitive words and attributes to be modified into the Excel table according to its own business. Each file can support a maximum of 30,000 words. Figure 8:
(4) words retrieval, can according to the words of effective audit type, matching mode, position, sensitive type and operation, its business lines, subordinate to the glossary, sensitive word creation time, such as property index, sensitive word retrieval using the characteristics of the ES participle retrieval, can support participle retrieval, also can achieve precise retrieval; The retrieved list shows the name, business line, word list, operator, operation time, remarks and other fields of the sensitive word. You can view the total number, export and remove in batch. In the operation bar, you can click “Modify” to enter the modification page, and click “Remove” to remove this sensitive word. As shown in figure 9:
IV. Unified entry of glossary services
The unified entry of the glossary service provides a standard API interface, the business side calls the glossary service to unify the external matching interface, and the service layer sends the matching task to the policy operator layer to complete the matching function of the glossary. The unified service to the outside of the glossary acts as a simple gateway, providing authentication to verify whether a request is legitimate; Provides the function of flow limit, can set the flow limit value for each requester; It provides the function of result processing, and the attribute of sensitive words returned by the policy operator is only a part. According to the demand of the business side, the attribute of sensitive words returned by the policy operator can be improved. The function of traffic forwarding is provided, and the requests of each line of business can be sent to different clusters according to the configuration, so as to realize the deployment of each business strategy operator in clusters. The specific process is shown in Figure 10.
V. Policy loading vocabulary
After the iteration of multiple schemes, the scheme is finally mature and stable.
The effective scheme of the first version of the glossary in the policy: the glossary management platform will generate a glossary file of all business lines and upload it to BOS, and the glossary policy will be scanned and loaded once every 30 minutes. All lines of business are concentrated in a glossary file and loaded at once, resulting in a slow policy loading glossary.
In the second version of the scheme, the 30 minutes effective time could not meet the needs of the business party, so the glossary management platform generated multiple glossary files according to the business lines and pushed them to the BOS system. When the glossary strategy was fixed, the multi-thread loading of the glossary was started by different business lines, and the glossary effective time was reduced from 30 minutes to 5 minutes.
Single and three versions of the scheme, the 5min time is not enough for special scenes, so we add the real-time synchronization scheme of the glossary. The strategy of the glossary is 10s timing to scan the database and load the incremental data into the memory. However, this scheme is not suitable for the loading of tens of thousands of incremental data, but only for the loading of words within ten thousand levels.
Now the glossary strategy loads the glossary. The second edition and the third edition exist at the same time with complementary advantages. The whole evolution process is shown in Figure 11:
BOS file format, multiple columns separated with tabs, multimode words connected with a & symbol, contains the word prefix + number recognition, the main information are sensitive word id, name of the words, vocabulary words belong to id, multimode word word spacing, the types of failure time, the types of audit, matching effect, its lines of business, position, sensitive type and extension strategy, exemption from word. As shown in figure 12:
Full load and incremental real-time synchronous loading process. Full load will be loaded once at startup, and the loading frequency will be more than half an hour, which can be configured according to the line of business. Incremental real-time synchronization 10s goes to the database to check for incremental data once, and then paging loads into memory. As shown in figure 13:
The loading structure of the policy cache vocabulary into memory is as follows:
(1) Line of business, effective location, sensitive word, sensitive word ID dictionary mapping. To match the sensitive word, the ID of the sensitive word can be quickly found according to the line of business and the effective position, and the attribute rules of the sensitive word can be obtained through the ID of the sensitive word, which is used to calculate whether the matched sensitive word is valid. As shown in figure 14:
(2) Sensitive word ID and sensitive word attribute rule dictionary mapping, BOS file for each line of sensitive word processing storage. The attribute rules of the sensitive words can be quickly searched through the sensitive word ID, which is used to calculate whether the matched sensitive words are valid or not. As shown in figure 15:
(3) Sensitive words are attached to the Trie tree, and each line of business and effective position generates a dictionary tree. The dictionary tree is the core of thesaurus strategy, and tens of millions of matches of sensitive words can return the configuration results within 10ms. As shown in figure 16:
VI. Implementation of lexical strategy matching
6.1 The process of glossary policy matching
Policy configuration matching process, as shown in Figure 17:
(1) Enter matching parameters, request\_id request unique identification, used for upstream and downstream location, req\_from request source, used to identify the request business party, token used for permission verification, service\_line business line identification, used to identify the word list used for matching, conent to match text, And the configuration of the text to identify the sensitive words that need to be in effect. As shown in figure 18:
(2) Chinese, letters, numbers and special symbols in the text are extracted and combined to generate text fragments of different combinations for strong filtering and matching. As shown in figure 19:
(3) According to the business line and the position of the text, the text is sent to the corresponding dictionary tree to match a single sensitive word. The information includes the sensitive word, the position of the sensitive word in the text, and the length of the sensitive word. The position and length are used for multimodal words, and whether the word spacing is effective is calculated. The matching result is shown in Figure 20:
(4) Through the line of business, the effective position and the sensitive word are obtained from the match\_data (Figure 13) cache, and then the sensitive word ID is obtained from the line\_cahe cache through the sensitive word ID. If matched to the sensitive word is contain word or filter word, directly hit the output; If it is a multi-modality word, then find out whether other words in the multi-modality word are hit or not. If the order and spacing of the two words meet the attribute rules of the multi-modality word, then the output is hit. The result is returned, as shown in Figure 21:
6.2 Large text matching timeout solution
PGC text and text often have hundreds of thousands of words in the word list, because the number of words is too large, the number of recalled words can reach tens of thousands of words, these words in the calculation of matching rules is too long, resulting in matching timeout.
The optimization scheme is shown in Figure 21:
(1) Before optimization, a large text article has 100 words in the title and 19.9W in the body. When the word list is matched, the title is first matched, and the time is 10ms, and then the body is matched. Due to the large number of words in the body, the time is 19s, and the total time for final matching reaches 20S.
(2) After optimization, large text articles are processed through the glossary, and the text with more than 5000 words is first split into multiple texts less than or equal to 5000. When the glossary is matched, multiple text fragments are matched in parallel, and the final time-consuming result is the one that takes the largest time among the multiple parallel computations. The example I gave is 50ms.
6.3 Implementation of dictionary tree (TRIE tree)
Dictionaries tree matching algorithm uses Dictmatch, an open source C++ library in the factory. Dictmatch implements the simplest Trie tree algorithm without any threading improvements, so it needs to be backtraced. However, it uses two tables to represent the TRIE tree, and it takes up a lot of space to optimize the problem, characteristic is in the time of tree construction is slow, but in the time of query is very fast.
Dictionary tree structure, as shown in Figure 23:
Seven, development & thinking
Support for special characters of thesaurus: the current storage of thesaurus words and dictionary tree matching algorithm do not support expressions and other special characters. The next optimization iteration of thesaurus service will focus on the support for special characters to meet more business requirements.
The deployment of thesaurus is divided into business lines: now the service of thesaurus is 60+ business parties, and each business line is mixed. The vocabulary of all business lines is loaded in the instance, which consumes a lot of memory, and the problem of thesaurus service will affect all business parties. If each line of business is clustered, it will increase maintenance costs, so we are exploring a way to automatically cluster each line of business.
Recommended reading:
| reveals Baidu micro-service monitoring: the evolution of Baidu game service monitoring
How | Optimizes User Experience Like Baidu Live (Part 1)
| Baidu search stability analysis story (second)
———- END ———-
Baidu said Geek
Baidu’s official technical public account has been launched!
Technology dry goods, industry information, online salon, industry conference
Recruitment information, internal information, technical books, Baidu surrounding
Welcome to your attention