Python implements filtering of emoticons in text

background

The text needs to be filtered in the project. The main expressions in the text are mainly wechat expressions in the form of “[expression name]”. As they are all strings, I plan to use the regular method to match and replace them.

steps

Search the wechat emoji library

To replace the expression, we need to have the Wechat expression library, so the first step is to find what the expression of Wechat has. After searching, WE found that there is a list of Wechat emoticons on this page, but it is a page, it can not be directly copied and pasted, so we have to find a way to take down the name of the expression.

Get wechat name list

For the content of the page, we can use JavaScript to obtain the value of the DOM node, open the console, view its node, and simply write a section of JS code to print the emoticon name according to its characteristics.

var doms = document.getElementsByClassName('emoji_card_list');
for(var i=0; i<doms.length; i++){var tds = doms[i].getElementsByTagName('td');
    for(var j=0; j<tds.length; j++){var text = tds[j].innerText;
        if(text.indexOf('[') = = =0 || text.indexOf('/') = = =0) {
            console.log(text); }}}Copy the code

After copying to the console for execution, you get text in this format (only part of it) :

[Let me see] Debugger Eval Code :7:21 [666] Debugger Eval Code :7:21 [eye roll] Debugger Eval Code :7:21 / smile Debugger Eval Code :7:21 / Mouth curl debugger Eval code:7:21 / Color Debugger Eval Code :7:21 / Shy Debugger Eval Code :7:21 / Shut up Debugger Eval Code :7:21 / sleep Debugger Eval Code :7:21 / Wancha Debugger Eval Code :7:21 / LadybugCopy the code

Content in this form can be replaced in a text editor, or of course in Python.

Handle wechat emoji names

In the Python console, it is easy to process, with the following code, to get the final set of emoji names.

data = """ """ """ ""
data_list = [x.split(' ') [0] for x in data.split('\n')]
emoji_list = []
for x in all_emoj:
    if x[0] = ='/':
        emoji_list.append('[%s]' % x[1:)else:
        emoji_list.append(x)
Copy the code

Regular matching expression

With the emoji content, we use the re to match the text as follows:

Def remove_emoji(text): """ return: STR """ if "[' not in text: return text reg_expression = '|'.join([x.replace('[', '\[').replace(']', '\]') for x in emoji_list]) pattern = re.compile(reg_expression) matched_words = pattern. Findall (text) # For matched_word in set(matched_words): text = text.replace(matched_word, "") return textCopy the code

The test results are as follows and meet the requirements:

>>> remove_emoji("[ha ha][shut up][shut up]") '[ha ha]'Copy the code

Write in the last

Because our project uses non-standard Unicode emojis, if your project uses standard Unicode emojis, you can use the Python Emoji package, which has a list of standard emojis. For details, you can refer to an article on the Internet to use Python environment to filter the emojis in the text.

Python implements filtering of emoticons in text

background

steps

Search the wechat emoji library

Get wechat name list

Handle wechat emoji names

Regular matching expression

Write in the last

Related Posts

Go coroutine management

1. Interviewers often ask about your ability to handle large amounts of data

Java Lock ReentrantLock