I’m participating in nuggets Creators Camp # 4, click here to learn more and learn together!
demand
The company has a project to make a text semantic analysis tool and use NLP word segmentation tool for model training. There are two main aspects of interaction on the front end: text highlighting and manual training
Let’s start with text highlighting
Text is highlighted
First of all, text is paragraph structure, there are multiple paragraphs,
Secondly, there are multiple sentences in each paragraph of text.
For example: ‘Wuhan University is located at Luojiashan, Wuhan’
So the data format returned by the back end
Const data = [{"paraId": 0, "paraText": "Wuhan University is located at Luojiashan ", "paraEntity": [{"category":" noun ", "labelText": "Wuhan university", "startIndex" : 0, "endIndex" : 4, "color" : 'rgba (240215, 12, 5)}, {" category ":" name ", "labelText" : "Wuhan", "startIndex" : 7, "endIndex" : 9, "color" : '# 00 baff'}}]];Copy the code
How does the front end highlight text?
The initial method:
Hardcore replace
The format of the data returned from the back end is extracted into an array, where the elements are stored as objects, and then iterated over, replacing the words in the text that need to be highlighted with
The problem is obvious: it is impossible to give different parts of speech for the same words
Such as:
Wuhan University is located at Luojiashan, Wuhan
Wuhan University is a noun, wuhan is a place name, through the above method, all wuhan will be marked as a place name or other parts of speech;
So how can it be improved? I’m thinking of the split method
Split method
Iterate over each word in the text, replacing it with an EM tag;
The em tag text formed by splitting, we can display different colors according to each word, that is, the part of speech;
< em > wu < / em > < / em > < em > han large < em > < / em > < em > to learn < / em > < em > by < / em > < em > fall < / em > < em > to < / em > < em > wu < / em > < em > han < / em > < em >, < / em > < em > no < / em > < em > yoga < / em > < em > < / em > mountainCopy the code
So, how do you locate it,
Here we’re using the DATA property in HTML
The data attribute
Data -* Global attributes are a class of attributes called custom data attributes, which give us the ability to embed custom data attributes on all HTML elements and to exchange proprietary data between HTML and DOM representations through scripting.
Use the getAttribute method in JS so that we can locate the corresponding text based on the starting position and length given in the background
CreateElement (textStr, tagName = 'em', k) {let NewTextStr = ''; for (let i = 0; i < textStr.length; i += 1) { NewTextStr += `<${tagName} data-paranum="${k}" data-index="${i}">${textStr[i]}</${tagName}>`; } return NewTextStr; },Copy the code
Note that we need to escape this because of the presence of special characters in the text, such as the book name ‘ ‘and the parenthesis’ ()’
/* escape */ escapeStr(STR) {let regStr = "; const specialArry = ['(', ')', '[', ']', '\\', '+', '*', '?', '.', '|']; for (let k = 0; k < str.length; k += 1) { if (specialArry.indexOf(str[k]) > -1) { regStr += `\\${str[k]}`; } else { regStr += str[k]; } } return regStr; },Copy the code
The effect
Summarizes the train of thought
1. First, cut and merge each word of paragraph text into EM label, and add data attribute (paragraph and word position);
2, loop through the data, the minimum unit is each word, so it will be traversed 3 times;
data.forEach((e) => {
for (let m = 0; m < e.paraEntity.length; m += 1) {
for (let i = 0; i < e.paraEntity[m].labelText.length; i += 1) {
}
}
}
Copy the code
Concatenate unhighlighted strings and highlighted strings in loops
Note: notHightStr and notHightStr are variables declared in forEach
notHightStr = `<em data-paranum="${k}" data-index="${i + e.paraEntity[m].startIndex}">${e.paraEntity[m].labelText[i]}</em>`; hightStr = `<em data-paranum="${k}" data-index="${i + e.paraEntity[m].startIndex}" style="background: ${ e.paraEntity[m].color};" >${e.paraEntity[m].labelText[i]}</em>`;Copy the code
4, replace,
Note: emStr is outside the loop
let emStr = createElement(fileText, 'em', k);
emStr = emStr.replace(notHightStr, hightStr);
Copy the code
5. Generate spliced, highlighted HTML
markTxt += `<p>${emStr}</p>`;
Copy the code
The complete code
Github.com/642134542/H…
The last
In the project, because of the developer, the returned data will be slightly different, in addition to the field name, the data structure and the position of each word will be different, so it needs to adjust according to the actual situation, in general, the idea of splitting is easy to locate
However, there are corresponding disadvantages, such as increasing HTML tags and too much loop nesting, which affects performance.
In addition to semantic model recognition, manual adjustment is also required. The general interaction is to right click to obtain data attributes and terms, and then interact with the back end to achieve the effect of manual correction.