After writing, I found that the article was a little long. If you want to see my little sister, please wait until the end and see it once and for all. After reading, please click the praise of the sage and walk again.
I. Background introduction
There is a special topic “Making Friends in Simple books” in simple books. People often write about themselves and post pictures of themselves and then contribute to this special topic. The detailed introductions are as shown in the picture below, which are orderly and comprehensive.
However, most of them are messy and messy. Cities, ages and so on May appear anywhere in the article. For such unstructured text data, it is not easy to extract high-quality structured data and dig out some information. Although very interested, but the helpless level is not enough, so a long time ago, a xiaobian private letter I climb and analysis of the topic is just pushed to say “monkey years horse month” again.
Now it’s back, crawling more than 2,700 articles on the topic (thought: why not crawl all of them?). “, and use a set of random wind hammer method, all kinds of text mining, face recognition, appearance level scoring, photo wall stitching and so on, the beautiful name is “messy stew” simple book dating data, in fact, is to practice, familiar with, review and application of various technologies.
Two, “messy stew” Jane book dating data
2.1 Data Overview
Due to a small problem in the crawler part, this part is skipped first. The crawler data format is as follows, mainly covering: author, homepage URL, article title, publication time, number of views, number of comments, number of likes, article summary, article URL, article picture URL list, article content and other dimensions.
First of all, what year and month are the articles published? Using the ECharts3 code template used previously for visualization: If the chart is too ugly to break, ECharts will fly you! It can be clearly seen that the proportion of articles in 2018 is close to 75%. It can be seen that there are many active people in Jane’s book. The topic has held many essay activities, which seems to have a good response.
If we look at the 24 distribution of the publication time of articles, we can see that although there is a small peak at 22 o ‘clock, the difference is relatively small. Except for the sleeping time from 1 o ‘clock to 8 o ‘clock, there is no obvious tendency to be lonely at night and want to make friends.
The 3D chart is drawn based on the data of the number of views, comments and likes. It can be seen that some articles have a high number of views, more comments and more likes. The specific articles are not listed and dug here, and those who are interested can go to the “popular” column of the topic to find out. In addition, I originally wanted to see if K-means could be used for clustering, but as shown in the figure, it seems not very separable. Then left.
Next, look at the relationship between the number of words and the number of pictures in the article. They are not directly provided, but can be calculated separately. The linear correlation is not obvious, but it is found that the number of pictures in some articles is as high as dozens, quite surprising.
Next, how does the number of photos and the length of the article affect the number of views, likes and comments? Seaborn library was used to plot heatmap and Pairplot respectively, and it can be seen that only the number of comments and likes was linearly correlated with the number of reading and likes.
2.2 Text Mining
After having a preliminary understanding of the data of special articles, we simply mine the text content of the articles. First of all, jieba is used to divide the Chinese text, and the Top30 high-frequency words are counted by removing the stop words:
Segment count 19510 like 9535 2574 a 9314 41918 don't 4949 48850 know 3571 3530 together 3481 50805 Jane 2948 46592 live 2787 27192 hope 2735 57188 feel 2636 38347 friend 2621 46137 now 2365 28984 many 2363 4634 won't 2069 35718 article 1981 3342 all the way to 1926 48521 really 1697 36888 time 1606 17484 May 1567 48268 see 1539 12231 actually 1505 35200 story 1452 35642 text 1448 26817 work 1440 31077 feel 1368 2901 certain 1326 27022 has 1290 7464 things 1283 31458 I will 1264 21945 University 1231 5641 world 1229Copy the code
Tabular data may not seem intuitive, but word clouds might work better. However, Word Cloud images generated by wordCloud library are not very beautiful, so a website is generated by online Word Cloud: HTML5 Word Cloud. Does the figure below match your expectations for this topic?
Then do the same for the title of the article to get the Top30 high-frequency words:
Segment count 3525 Jane 733 655 friends 658 2813 Tree hole 303 205 a 190 1867 teenager 158 1271 classmate 149 2164 Heartache 144 1365 like 119 2733 future 109 4231 Meet 91 2192 Love Letter 80 939 Write 78 2709 friend 73 246 a letter 63 468 special topic 59 3277 Boyfriend 54 2924 Beg to leave 53 783 hello 52 299 together 50 1663 girl 48 3107 Roll Call 47 2052 Essay 47 318 Tanabata 47 2461 Story 44 2991 activity 43 1941 Hope 42 3728 United 41 778 battle 40 3101 soul 40 2707 fun 39Copy the code
And draw the word cloud map:
It can be seen that there are indeed a large number of articles are essay activities series, such as “Jane book friends”, “heartache”, “Tanabata”, “love letter” and so on.
Because I still have the data of the popular articles in my book. And call bosonNLP to draw the word cloud of Top100 keywords: Jane book = chicken soup? Crawl today’s stats: visualization of 1916 popular short Book articles. From the horizontal comparison can see some differences, this time did not compare more popular topics, knowledgeable people can dig.
Returning to the text content of the article, high-frequency words such as “yi” and “together” appear frequently but have little information, so we continue to call jieba library.
import jieba.analyse as analyse
textrank = " ".join(analyse.textrank(contents, topK=200, withWeight=False, allowPOS=('ns', 'n')))
print(textrank)
Copy the code
Extracting common noun (N) and place name (NS) keywords of Top200 based on TextRank algorithm:
When Jane book friend article everyone make friends time university text story Jane friend feel teacher I will love school topic world career life contribute film place some author city things classmate student tree hole place experience things photo reading child problem major graduation character Girl Beijing campus exchange novel nickname girl Shanghai photography mobile phone inner girl Chinese individual dream constellation parents name boy music youth age hometown time affection literature each other culture appearance unable company beautiful material literature social record brother soul mother single Food day family home love letter platform mood relation result Gender reason ability eye aspect coffee editor singing game comment hour reality drawing voice childhood history sister emotion ideal way running man imagine mortal library content times meet contest body Clothes dormitory guest woman general meeting Public scenery society Unfamiliar interest Basic education spirit Mr. Chengdu good friends Alumni habit works Classroom art Thought primary school boyfriend offline contact community Wuhan Family information appearance gift world grow up taste stranger Guangzhou father circle of friends Impression opportunity female weight space sister rose memory marriage people Chongqing enthusiasm Hangzhou plan situation reader boy Xi ‘an small partner inspirational member girl train experience Shenzhen fantasy character accompany mood family meaning roommate college student country girl shandong state programmer sky Link thinking criteria
You can see that it does provide a lot more information. Originally, I wanted to use gensim to train word2vec to see the word vector distribution of the above words based on the content of the article, but I failed, so I had to work on it later.
However, when I was looking for examples and pictures of Chinese word vector, I saw the API document of natural language processing technology of Baidu Cloud AI, so AFTER free registration, I directly transferred the word vector to represent the interface to obtain the corresponding Chinese word vector.
# pip install baidu-aip
from aip import AipNlp
""" 你的 APPID AK SK """
APP_ID = '你的 APP_ID'
API_KEY = '你的 API_KEY'
SECRET_KEY = '你的 SECRET_KEY'
client = AipNlp(APP_ID, API_KEY, SECRET_KEY)
word = "张飞"
""" 调用词向量表示 """
data = client.wordEmbedding(word);
print(data)
Copy the code
Each of these terms is represented as a vector of 1024 dimensions (isn’t that a bit high? We used t-SNE to visualize high-dimensional word vector data. First, let’s take a look at the effect of dimensionality reduction to 2d. Most of them are distributed together, and there are no semantically related and similar words gathered together. The effect of less related words scattered far away may be caused by the insufficient correlation of the corpus used by Baidu cloud AI to train word2vec?
Reduced to 3 d, the effect is still not good, “elder brother”, “elder sister” two words scattered far, also some confused, but found that “elder brother” – “woman” and “man” – “elder sister” these two groups of distance seems to be closer to…… Cannot be described, cannot be described……
2.3 LDA topic model
Next, let’s take a look at the topics covered by all the articles. Since there are more than 2700 articles, it is not feasible to view them one by one, so we need to use the topic model. Each article (document) can be thought of as consisting of a number of topics, and each term or word can be assigned to a topic.
LDA (Latent Dirichlet Allocation) is a document topic generation model, which contains three layers of word, topic and document.
Because of the complexity of the mathematics behind it, I will skip it here (in fact, I also can’t). If you are interested, you can read by yourself: LDA Mathematical Gossip.
After extracting high-frequency words from corpus with word bag model and LDA modeling with Gensim, 10 themes were printed with the first 6 words of each theme. When I saw the results, I was a little confused. I didn’t know what to do. Here, by the way, although LDA is named theme model, each theme is not really “entertainment”, “sports”, “economy” and other general themes and topics, and the number of themes is not known, which needs to be tested in specific cases:
"Jane's book" + 0.004 * 0.005 * "like" + 0.004 * "a" "article" + 0.003 + 0.003 * * "no" + 0.002 * 0.015 * "life", "a" "like" + 0.007 + 0.011 * * "no" + "Friends" + 0.005 * 0.005 * "know" + 0.005 * 0.009 * "now", "a" + "with" + 0.005 * 0.005 * "like" "many" + 0.004 + 0.004 * * "know" + 0.003 * "life" + 0.006 * 0.006 * "a" "like" "no" + 0.004 + 0.004 * * "know" "feel" + 0.003 + 0.004 * * "together" "like" + 0.012 * 0.014 * "a" + 0.005 x + "know" "No" + 0.004 * 0.004 * "hope" + "with" 0.014 * 0.004 * "a" + "like" + 0.011 * 0.012 * "no" "know" + 0.006 + 0.007 * * "with" + 0.004 * "feel" "Like" + 0.010 * 0.029 * "a" + "with" + 0.006 * 0.007 * "no" "hope" + 0.005 + 0.005 * * "Jane books" 0.011 * "a" "take off a single" + 0.007 + 0.008 * * "no" + "Mortal" + 0.005 * 0.006 * "like" + 0.005 * 0.006 * "know", "a" "like" + 0.005 + 0.006 * * "no" + 0.003 * "feel" "hope" + 0.003 + 0.003 * * "together" + 0.009 * 0.009 * "a" "tree" "submission" + 0.008 + 0.008 * * "like" "no" + 0.007 + 0.008 * * "know"Copy the code
Later, a parameter passes=15 was added to LDA modeling and 10 themes were printed with the first 6 words of each. The effect was improved:
+ 0.001 * 0.007 * "little hen has" "father-in-law" "boom" + 0.001 + 0.001 * * "wutong" "winter" + 0.001 + 0.001 * * "key" + 0.012 * 0.028 * "like", "a" + "Jane books" + 0.011 * "Article" + 0.007 * 0.009 * "friends" + 0.007 * 0.018 * "no", "a" + "like" + 0.008 * 0.012 * "no" "know" + 0.007 + 0.007 * * "with" + 0.006 * "life" "Take off a single" + 0.011 * 0.012 * "like" "human" + 0.008 + 0.008 * * "a" "work" + + 0.005 * 0.005 * 0.008 * "life" "programmer" "play" + 0.002 + 0.006 * * "technology +" "Liar" + 0.002 * 0.002 * "the cloud" + 0.002 * 0.013 * "marriage" "activity" "dating" + 0.006 + 0.008 * * "special" "Jane books" + 0.005 + 0.006 * * "author" + 0.004 * "time" "Rabbit" + 0.005 * 0.006 * "Sir" "dear" + 0.003 + 0.005 * * "answer" "small base" + 0.002 + 0.003 * * "is" 0.008 * "teacher" "students" + 0.003 + 0.004 * * + "school" + 0.002 * 0.003 * "a" "travel" + "is" 0.008 * 0.002 * "together" "no" + 0.007 + 0.007 * * "a" "like" + 0.005 + 0.006 * * "classmate" + 0.005 * "campus" "Tree" + 0.013 * 0.013 * "submission" + 0.010 * "a" "know" + 0.010 + 0.010 * * "no" + 0.009 * "Jane books"Copy the code
Then change the word bag model into TF-IDF model, that is, not only give high weight to the words that appear frequently in a single document, but also give low weight to the words that often appear in many documents. Similarly, 10 themes will be printed after LDA modeling, and the first 6 words will still be embarrassing. Further improvement is needed:
"Mother-in-law" + 0.000 * 0.001 * "tree" "submission" + 0.000 + 0.000 * * "played a" + + 0.000 * 0.000 * "huanhuan", "the authors" "tree" + 0.001 * 0.002 * + 0.001 * "anonymous" + "submit" "Account" + 0.001 * 0.001 * "venue" + "is derived from the" 0.001 * 0.001 * "tree" "like" + 0.001 + 0.001 * * "submission" + 0.000 * "together" "university" + 0.000 + 0.000 * * "no" "School" + 0.000 * 0.000 * "like" "Jane books" + 0.000 + 0.000 * * "no" "friends" + 0.000 + 0.000 * * "submission" 0.001 * "like" "compare" + 0.000 + 0.000 * * "hope" + + 0.000 * 0.000 * "Jane books" "of" company + 0.000 * 0.001 * "friend" "tree" "submission" + 0.000 + 0.001 * * "anonymous" "accounts" + 0.000 + 0.000 * * "venue" + 0.000 * "from" "Alumni" + 0.001 * 0.001 * "mortal" "campus" + 0.000 + 0.001 * * "to take off the list" "like" + 0.000 + 0.000 * * "activity" 0.000 * "child" "like" + 0.000 + 0.000 * * + "know" 0.000 * "no" "work" + 0.000 + 0.000 * * "text" "tree" + 0.001 * 0.001 * "submission" "like" + 0.000 + 0.000 * * "trouble" "with" + 0.000 + 0.000 * * "talk" "Like" + 0.001 * 0.001 * "with" + "hope" + 0.001 * 0.001 * "life" "feel" + 0.001 + 0.001 * * "know"Copy the code
2.4 Face detection and appearance level scoring
Now comes the climax. In the topic of “Making Friends with Simple books”, many people posted photos, even selfies. Therefore, I took out the column of article link from the CSV data at the beginning of the article, crawled all the photos and reconstructed them by MD5, and got a total of 9887 photos, with a total of 6.96GB.
But so many photos how to better browse and view it, first with the last image retrieval (a) : preconditioned and forward-looking (not filled pit, shame) technology, with pre-trained deep learning image recognition model to extract more than 1000 photos of the features, and then dimensionality reduction and visualization. While it doesn’t bring similar images together, it’s a way to visualize a lot of photos, or to splice them together into a wall of photos that can then be used to recognize faces.
The next step is to automatically identify faces from the nearly 10,000 photos and screen out the better-looking ones.
Before I noticed this aspect of the article, originally wanted to implement according to Python crawler + face detection – Zhihu high appearance level image grab article. But recently, how to find a beautiful sister on Douyin robot? One article is really some fire, and the original author provided the ID, KEY and other parameters after the registration, without the trouble of their registration, so in a small modification of the direct use of up, some appearance level scoring logic judgment did not modify, slightly faulty, but generally can be used. Face recognition and appearance level scoring, after cutting the profile picture in a unified folder, the effect is as follows (invasion deletion) :
Put together a few more photo walls (see more examples of photo walls: easily put together a hundred photos using Python’s PIL library) and you see that all of these people were probably once interested in making friends. Apart from a few random memes and celebrities, they were almost as much a portrait of a group of people in the “Jian Shu Jian Jian” theme.
Please view it on wifi or with sufficient traffic. As for who these people are and what articles they appear in, it is naturally impossible to tell you without causing unnecessary harassment:
Three, stalls and summary
This article doesn’t cover much code, so it’s a “free edition.” If anyone is interested, we will post the code as a “code/full” version, but since Posting to the post is cumbersome and verbose, we will probably dump it on Github as a Jupyter notebook.
The title refers to “scrambled” data, which is also intended to use a lot of known techniques from the beginning, whether text mining, or image processing, after the scrambled results do not know how to evaluate?
In this project also learned some code trick, or quite fun. The shortcomings of the article include: the information excavated is scattered and not systematic; Word2vec, LDA theme model and other parts need to be further studied. NLP is not used to mine the named entities in the article, such as cities and professions. You can try it later. Pandas also needs to become proficient in practice and real games. Also think of what Python library you can use to make giFs or small videos of massive photos that you and others can quickly view.
This article sort of fills in the holes in the original private message, but the last article in the series on image retrieval has gone nowhere. However, this is probably life, escape……
Welcome to come to our public account: Niuyi Gu Liu (ID: Deserts -x), and welcome to add QQ group: Python Friends entertainment Club (613176398) ha. Recreation club, no young models.