background
The text needs to be filtered in the project. The main expressions in the text are mainly wechat expressions in the form of “[expression name]”. As they are all strings, I plan to use the regular method to match and replace them.
steps
Search the wechat emoji library
To replace the expression, we need to have the Wechat expression library, so the first step is to find what the expression of Wechat has. After searching, WE found that there is a list of Wechat emoticons on this page, but it is a page, it can not be directly copied and pasted, so we have to find a way to take down the name of the expression.
Get wechat name list
For the content of the page, we can use JavaScript to obtain the value of the DOM node, open the console, view its node, and simply write a section of JS code to print the emoticon name according to its characteristics.
var doms = document.getElementsByClassName('emoji_card_list');
for(var i=0; i<doms.length; i++){var tds = doms[i].getElementsByTagName('td');
for(var j=0; j<tds.length; j++){var text = tds[j].innerText;
if(text.indexOf('[') = = =0 || text.indexOf('/') = = =0) {
console.log(text); }}}Copy the code
After copying to the console for execution, you get text in this format (only part of it) :
[Let me see] Debugger Eval Code :7:21 [666] Debugger Eval Code :7:21 [eye roll] Debugger Eval Code :7:21 / smile Debugger Eval Code :7:21 / Mouth curl debugger Eval code:7:21 / Color Debugger Eval Code :7:21 / Shy Debugger Eval Code :7:21 / Shut up Debugger Eval Code :7:21 / sleep Debugger Eval Code :7:21 / Wancha Debugger Eval Code :7:21 / LadybugCopy the code
Content in this form can be replaced in a text editor, or of course in Python.
Handle wechat emoji names
In the Python console, it is easy to process, with the following code, to get the final set of emoji names.
data = """ """ """ ""
data_list = [x.split(' ') [0] for x in data.split('\n')]
emoji_list = []
for x in all_emoj:
if x[0] = ='/':
emoji_list.append('[%s]' % x[1:)else:
emoji_list.append(x)
Copy the code
Regular matching expression
With the emoji content, we use the re to match the text as follows:
Def remove_emoji(text): """ return: STR """ if "[' not in text: return text reg_expression = '|'.join([x.replace('[', '\[').replace(']', '\]') for x in emoji_list]) pattern = re.compile(reg_expression) matched_words = pattern. Findall (text) # For matched_word in set(matched_words): text = text.replace(matched_word, "") return textCopy the code
The test results are as follows and meet the requirements:
>>> remove_emoji("[ha ha][shut up][shut up]") '[ha ha]'Copy the code
Write in the last
Because our project uses non-standard Unicode emojis, if your project uses standard Unicode emojis, you can use the Python Emoji package, which has a list of standard emojis. For details, you can refer to an article on the Internet to use Python environment to filter the emojis in the text.