Microblog sentiment analysis and crawler, water army detection stable 95% accuracy

I won’t go into what weibo is. The benefit of sentiment analytics is that you can categorize users into categories and push ads to them, or you can open your imagination and brainstorm. The sentiment analysis I chose for this task was to divide users into real users and marketers/marketers. According to most of the papers on micro-blog water army detection, the method they use is mainly to classify users through logistic regression on various indicators of users (attention, number of fans, average time of Posting micro-blog, etc.). In my opinion, this method is inaccurate and unstable for different test sets. I think this kind of task needs the help of NLP model, because the biggest difference between water army and real people is their behavior habit of writing micro blog.

Attach github link: github.com/timmmGZ/Wei… Like to give a star support

Important things to say three times:

Please use Google Colab because you want XX, please use Google Colab because you want XX, Please baidu “not xx, use colab” to change xx to other words, xx is not much to say, as a programmer should be able to guess ha ha due to prevent alchemy go astray, this library white piao Google COLab free TPU, please be sure to use colab. Throw two notebooks here, this is the navy test program and this is the training and testing process

Structure of model inputs and outputs

Input │─ user information index :[Number of followers, number of fans, number of interactions, member level, member category, total number of microblog posts, Microblog level, authentication or not, Authentication type] │ ─ ─ the latest article 1 weibo │ │ ─ ─ the text │ │ ─ ─ topic/ultra words │ │ ─ ─ is forwarded microblog features: [topic/super, ever video and key information? And have you photos and key information?] Or "forward" free │ └ ─ ─ post information indicators: [map number, video playback, forwarding number, number of comments, thumb up several] │ ─ ─ the latest article 2 weibo │ ─ ─ the latest... Tweet │ ─ ─ the latest n - 1 the weibo └ ─ ─ the latest article n weibo output │ ─ ─ is the real user └ ─ ─ is water armyCopy the code

Architecture of the model

In my opinion, it is not enough to analyze users only through a random microblog. We need to analyze continuous tweets from a single user. In other words, perform sentiment analysis for every N tweets in parallel (I use the BI-LSTM model), then put these N outputs (try to treat them as tokens of a sentence) into the network, and finally get classification. In addition, humans have writing habits. For example, some people post happy content every three days and then serious content the next day, while others may post only sad content every day. Suppose xiao Ming (Xiao Ming: “how am I again? Let’s find Xiao Gang “) the last eight tweets will beHappy, happy, happy, serious, happy, happy, serious“, so we know he has two serious tweets out of every eight. although[Serious, serious, happy, happy, happy, happy, happy, happy, happy, happy]They have the same frequency of Posting microblog types, but the “shape” of the sequence is different because of the different order. We can’t say that this is Xiao Ming’s habit. then[Happy, happy, serious, happy, happy, happy, serious, happy]It’s just that the sequence has shifted one unit to the left (it doesn’t matter how many units), but the “shape” is the same, which is xiao Ming’s habit. Therefore, the network used to connect n outputs will also be a recursive model (I use LSTM again). Since this is a nested parallel LSTM model, to prevent the gradient from disappearing, MOST of the activation functions I use are Tanh, and I perform 40% Dropout on some layers to prevent overfitting. Here is the structure of the model.

The crawler

I have a user_id dataset with 568 samples (274 marines and 294 real users). All samples are annotated through manual inspection to ensure the logicality and distribution of the data set as far as possible, thus objectively guaranteeing the fairness of the accuracy of the test set. Input the user_id dataset into my crawler code, which then outputs a new dataset for the model (as described above, “Structure of model input”). Meanwhile, {body} and {topic/supertopic; Features of the forwarded tweets} have very different syntax, vocabulary and sentence length. Therefore, when embedding them, I create their own dictionary, which also makes {topic/superword; The dimension reduction effect is achieved before embedding as one-hot encoding, which can be described as killing two birds with one stone.

Some Baselines results

The split rate of the training set Test set Accuracy Article n Baseline file
85% 98.84% 20 weibo_baselines
50% 90.14% 20
15% 90.48% 20

The “20 Tweets” dataset dictionary has 27890 tokens. Each different training set has a different dictionary. For example, when the training set is 85% of the data set, there are 25,000 tokens in the dictionary. When it’s 15%, there are 10,000 tokens. However, even though there were too many unknown tokens in the test set, the accuracy of all test sets was still higher than 90%.

digression

Interested parties can annotate more data sets in the following format:

uid Whether water army
532871947 (scrawled) 0
214839591 (scribbled) 1

Or some other emotion analysis

uid music The fine arts dance .
532871947 (scrawled) 1 1 0 .
214839591 (scribbled) 0 1 1 .
uid How much you like music
532871947 (scrawled) 5
214839591 (scribbled) 0