1, the introduction

In the application scenario of IM clients, the full-text search function based on local data plays an important role, such as searching chat history and contacts, as shown in the following figure.

▲ Wechat chat records search function

Such functions as IM chat search and contact search can greatly improve the efficiency of content search with the full-text search capability. Otherwise, it will reduce user experience if users manually search.

This article will specifically talk about netease Yunxin is how to achieve IM client full text search ability, hope to bring you inspiration.

2. About the author

Li Ning: Senior front-end development engineer of netease Yunxin, responsible for application development, componentalization development and solution development of AUDIO and video IM SDK, with rich practical experience in componentalization design of React and PaaS, and multi-platform development and compilation.

3. Related articles

IM client full text Retrieval

Optimization of Local Data Full-text Retrieval on wechat Mobile Terminal
Wechat Team Share: Solutions to the problem of full-text retrieval of polyphonic characters on wechat Mobile Terminal

Other articles shared by netease technical Team:

Netease Video Cloud Technology Sharing: A Quick Introduction to Audio Processing and Compression Technology
Some Optimization Ideas of netease yunxin Real-time Video Live Broadcast in TCP Data Transmission Layer
Netease Cloud Communication Technology Sharing: Practical Summary of The Technical Solution of 10,000 People Chat in IM
Instant Messaging practice on the Web: How to Make your WebSocket Down and Reconnect Faster?
Behind bullet SMS: The chief architect of netease Yunxin shares the technical practice of 100-million-level IM Platform

4. What is full text retrieval

Full-text search is the technique of finding the occurrence of a word in a large amount of content.

In traditional relational databases, it can only be implemented through LIKE conditional query, which has several disadvantages:

1) Unable to use database index, need to traverse the whole table, poor performance;
2) Poor search effect, only fuzzy match at the beginning and end, unable to achieve complex search requirements;
3) The correlation between content and search criteria cannot be obtained.

We have implemented the local data full-text retrieval function based on SQLite and other libraries in iOS, Android and DESKTOP of IM, but lack this function in Web and Electron.

Because on the Web side, the only local storage database available is IndexDB due to browser environment limitations, it is beyond the scope of this discussion. On Electron, though, the Chromium kernel is built in, but the ability to use Node.js gives you a wider range of choices. In this paper, we take the Electron based IM client as an example to discuss the realization of full-text retrieval technology (the technical thinking is the same, not limited to the specific terminal).

PS: If you don’t know what Electron technology is, read up on Quick Electron: The Next Generation of Web-based Cross-platform Desktop Technology.

Let’s take a concrete look at how to achieve full text retrieval.

To achieve full text retrieval, can not do without the following two knowledge points:

1) Inverted index;
2.

These two technologies are the realization of the full text retrieval technology and difficulties, its implementation process is relatively complex, before talking about the realization of the full text index, we specifically learn the principles of these two technologies.

5, full text retrieval knowledge 1: inverted index

The concept of an inverted index is different from that of an inverted index:

1) Forward index: it takes the unique ID of the document object as the index and the document content as the structure of the record;
2) Inverted index: it takes the word in the document content as the index, and the document ID containing the word as the structure of the record.

For a practical example, use the inverted search-index index library:

In our IM, each message object has an idClient as its unique ID, and then we type “nice weather today”, breaking each of its Chinese characters into separate words (the concept of participle will be shared in more detail below), so the input becomes “today”, “tian”, “tian”, “qi”, “zhen”, and “good”. It is then written to the library using the PUT method of search-index.

Finally, take a look at the structure of the content stored in the above example:

As shown in the figure, you can see the structure of the inverted index. Key is a single Chinese word after segmentation, and value is an array of IDClients containing the Chinese message object.

Search-index contains some other contents, such as Weight, Count, and column data. These are used for sorting, paging, and searching by field.

6, full text retrieval knowledge 2: word segmentation

6.1 Basic Concepts

Word segmentation is to divide the original content of a message into multiple words or phrases according to the semantics. Considering the effect of Chinese word segmentation and the need to run on Node, we choose Nodejieba as the basic word segmentation library.

The following is the flow chart of Jieba segmentation:

Take “go to Peking University to play” as an example, we choose the most important several modules to analyze.

6.2 Loading a Dictionary

Jieba is initialized to load the dictionary first, with the following content:

6.3 Building a Prefix Dictionary

The prefix dictionary is then constructed from the dictionary, with the following structure:

Where: “Peking University” as the prefix of “Peking University”, its word frequency is 0, which is convenient for the subsequent construction of DAG diagram.

6.4 Building a DAG Diagram

DAG is short for Directed Acyclic Graph.

Based on the prefix dictionary, the input content is segmented.

Among them:

1) “go” has no prefix, so there is only one way to slice it;
2) For “North”, there are three segmentation methods: “North”, “Beijing” and “Peking University”;
3) There is only one way to slice “Beijing”;
4) For “big”, there are two segmentation methods: “big” and “university”;
5) There is still only one way to split “learn” and “play”.

In this way, we can get the segmentation of each word as a prefix.

Its DAG diagram is as follows:

6.5 Maximum Probability path calculation

All paths of the above DAG diagram are as follows:

Go to/Beijing/university/study/play

Go to/Beijing/university/study/play

Go to/Beijing/university/play

Go to/Beijing University/play

Because each node has a Weight, for words in the prefix dictionary, its Weight is its word frequency. So our problem is to get the maximum path that gives the maximum weight to the whole sentence.

This is a typical dynamic programming problem, first we confirm the two conditions of dynamic programming.

1) Repetitive sub-problems:

For node I and its possible multiple successors J and K:

1) The weight of any path to j through I = the weight of the path through I + the weight of j, that is, R(I -> j) = R(I) + W(j);
2) The weight of any path through I to K = the weight of the path through I + the weight of k, that is, R(I -> k) = R(I) + W(k).

That is, for j and K with common precursor node I, the weight of the path to I needs to be calculated repeatedly.

2) Optimal substructure:

Suppose that the optimal path of the whole sentence is Rmax, the end node is X, and several possible precursor nodes are I, j, and K.

The formula is as follows:

Rmax = max(Rmaxi, Rmaxj, Rmaxk) + W(x)

So the problem becomes solving Rmaxi, Rmaxj and Rmaxk, and the optimal solution in the substructure is part of the global optimal solution.

As above, the optimal path is finally calculated as “go/Peking University/play”.

6.6 HMM Implicit Markov model

For unlogged words, the HMM (abbreviation of Hidden Markov Model) Model is used in jieba word segmentation.

The word segmentation problem is regarded as a sequence labeling problem, with sentence as observation sequence and word segmentation result as state sequence.

Jieba writer mentioned in the issue that the parameters of HMM model are based on the synced corpus of 1998 People’s Daily which can be downloaded online, an MSR corpus and TXT novels collected by himself, synced by ICTCLAS, and finally counted the word frequency with Python script.

The model consists of a quintuple and has two basic assumptions.

Five yuan group:

1) Set of state values;
2) Set of observed values;
3) State initial probability;
4) State transition probability;
5) State emission probability.

Basic assumptions:

1) Homogeneity hypothesis: that is, it is assumed that the state of the hidden Markov chain at any time t only depends on the state of t-1 at the previous time, and has nothing to do with the state and observation at other times, and has nothing to do with time T;
2) Observation value independence hypothesis: that is, it is assumed that the observed value at any moment is only related to the state of the Markov chain at that moment, and has nothing to do with other observations and states.

The set of state values is _{B: begin, E: end, M: middle, S: single}_, which represents the position of each word in the sentence. B is the beginning position, E is the end position, M is the middle position, and S is the single word.

The set of observations is the set of each word we type in the sentence.

The initial state probability indicates the probability that the first word in a sentence belongs to B, M, E and S, among which the probability of E and M is 0, because the first word can only be B or S, which is consistent with the reality.

State transition probability represents the probability of transition from state 1 to state 2, which satisfies the homogeneity hypothesis. The structure can be represented by a nested object:

P = {

B: {E: -0.510825623765990, M: -0.916290731874155},

E: {B: -0.5897149736854513, S: -0.8085250474669937},

M: {E: -0.33344856811948514, M: -1.2603623820268226},

S: {B: -0.7211965654669841, S: -0.6658631448798212},

}

P[‘B’][‘E’] means that the probability of moving from state B to state E is 0.6. Similarly, P[‘B’][‘M’] means that the probability of the next state being M is 0.4, indicating that when a word is at the beginning, The probability of the next word being at the end is higher than the probability of the next word being in the middle, which is intuitive because two-word words are more common than multi-word words.

The state emission probability represents the current state, which satisfies the observation value independence hypothesis. The structure is the same as above, and can also be represented by a nested object:

P = {

B: {‘ tu ‘: 2.70366861046,’ the ‘ ‘: 10.2782270947,’ optimal ‘: 5.57547658034},

M: {‘ to ‘: 4.26625051239,’ or ‘: 2.1517176509,’ a ‘: 5.11354837278},

S: {… },

E: {… },

}

P[‘B’][‘ spike ‘] means that the log of the probability that the observed word is “spike” is equal to -2.70366861046.

Finally, Viterbi algorithm is adopted to input the set of observed values, and the initial state probability, state transition probability and state emission probability are taken as parameters to output the set of state values (i.e. the segmentation result of maximum probability). The Viterbi algorithm is not detailed in this article, but can be consulted by interested readers.

7. Technical implementation

The two techniques of full-text retrieval, described in the previous section, are the technical core of our architecture. Based on this, we improve the Electron terminal technology architecture of IM. These will be described in detail below.

7.1 Architecture Diagram

Considering that full-text retrieval is only one function in IM, in order not to affect other IM functions, and to faster iteration requirements, the following architecture scheme is adopted.

The architecture diagram is as follows:

The previous technical architecture, shown on the right, uses indexDB for the underlying repository and has read and write modules for the upper layer.

The specific functions of the read-write module are:

1) When a user sends, synchronizes, deletes, or receives a message, the message object is synchronized to indexDB;
2) When users need to query keywords, they will go through all the message objects in indexDB, and then use indexOf to determine whether each message object contains the queried keyword (similar to LIKE).

So, when there is a large amount of data, the speed of the query is very slow.

On the left is a new architectural solution with the addition of word segmentation and inverted index database, which does not change the previous solution at all, but just adds one layer above the previous solution.

Now, the read/write module’s working logic:

1) When users actively send messages, actively synchronize messages, actively delete messages and receive messages, messages in each message object will be synchronized to the inverted index database after word segmentation;
2) When the user needs to query the keyword, the user will first find the idClient of the corresponding message in the inverted index database, and then find the corresponding message object in the indexDB according to the idClient and return it to the user.

7.2 Architecture Advantages

The scheme has the following four advantages:

1) Fast speed: Reverse index is realized through search-index, thus improving the search speed.
2) Cross-platform: Since both Search-Index and indexDB are based on levelDB, search-Index also supports the browser environment, which provides the possibility of full-text retrieval on the Web side;
3) Independence: The inverted index library is separated from the IM main business library indexDB;
4) Flexibility: Full text retrieval is accessed in the form of plug-ins.

When data is written to the indexDB, it is automatically notified to the write module of the inverted index library, which splits the message contents into words, inserts them into the storage queue, and finally inserts them into the inverted index database. When full-text retrieval is required, the idClient of the message object corresponding to the keyword can be quickly found by inverting the reading module of the index library, and then the message object can be found in indexDB according to the idClient and returned.

For point “4)” above: It exposes a higher-order function that wraps around IM and returns the new extended IM. Because of the prototype-oriented mechanism of JS, methods that do not exist in the new IM are automatically searched in the prototype chain (i.e., the old IM), thus allowing plug-ins to focus on their own method implementation. It does not need to care about the specific VERSION of IM, and the plug-in supports self-defined word segmentation functions to meet different users’ different word segmentation requirements

7.3 Effect

After using the above architecture, our test shows that with 20W data, the search time decreases from ten seconds to less than one second, and the search speed is about 20 times faster.

8. Summary of this paper

In this paper, we implement full-text RETRIEVAL of IM chat messages on Electron based on Nodejieba and search-index, which speeds up the search speed of chat records.

Of course, we will make more optimization for the following aspects in the future, such as the following two points:

1) Write performance: In actual use, it is found that when the amount of data is large, levelDB, the underlying database that Search-index relies on, will have write performance bottleneck and consume large CPU and memory. After investigation, SQLite’s write performance is much better. From observation, write speed is only proportional to the amount of data, and CPU and memory are relatively stable. Therefore, it may be considered to compile SQLite into Node native module to replace search-index.

2) Scalability: The current decoupling of business logic is not thorough enough. Some business fields are stored in the inverted index library. You can then consider the inverted index library to find the idClient of the message object only by keyword, and to put the search with business attributes into indexDB to completely decouple the inverted index library from the main business library.

Above, is all the share of this article, I hope my share can help you. (This article has been simultaneously published at: www.52im.net/thread-3651…)

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

IM development dry goods sharing: The practice of full-text chat message retrieval technology of netease Yunxin IM client

1, the introduction

2. About the author

3. Related articles

4. What is full text retrieval

5, full text retrieval knowledge 1: inverted index