This summer, Chang ‘an Twelve Hours became popular. Yi Yangqianxi, Lei Jiayin starring; Adapted from ma Boyong’s novel of the same name. It took seven months to build chang ‘an city of more than 70 mu. This drama historical data rigorous, elegant modeling, can be called conscience production. Douban also has a high score of 8.6.
First, demand background
The story is mainly about the tang Tianbao three years, the day of the Lantern Festival of chang ‘an, the imperial capital. In the flourishing scene of a school of singing and dancing, a group of turkic Wolf guards secretly sneaked into the plot of killing the city. Only one prisoner on death row can save Chang ‘an, and the time limit is 12 hours, leading to a thrilling story.
The big official correspondence technique
The big official correspondence technique
The big official correspondence technique
Ii. Function Description
Because recently we just talk about crawler and data analysis, so I want to use our modern big document to analyze why this TV series is popular, what is everyone’s evaluation of this TV series? (900 words of the highest frequency of all bullets)
Iii. Technical scheme
- Analyze a cool bullet screen loading method and use the Requests library to crawl
- Need a lot of data capture, near may be more
- Focus on data cleaning, such as: story, name, high-energy Jun bullet screen and so on
- Turn bullets into cloud words
Fourth, technical implementation
Brother Pig will explain each step of the process in detail, I hope that interested students can read carefully, and then do their own practice, so that they can really learn knowledge.
This course is only for learning communication, not for commercial profit, the consequences! If there is any infringement or adverse impact on any company or individual, please inform to delete
1. Analyze and obtain the URL of the danmu interface
Step 1: Open a cool website, and then click the TV play, right-click the page and choose Check (or F12) to bring up the browser debugging window.
Headers
Referer
User-Agent
Service.danmu.youku.com/list?jsonca…
2. Crawl to obtain ammunition data
Once the URL is found, we can start coding, which is the same as before: we can grab, extract and save a piece of data first, and then we can study batch fetching.
For those of you who don’t know what the Requests library is, check out this article: Introduction to the Requests Library.
3. Data extraction
The first step: Extract the json data we observe the data returned, as well as in the previous, cross-domain request is using json, so we need a little bit of the returned data capture, is to the outside jQuery111203412576115734338_1562833192066 (and final) take out, Only the intermediate JSON data is retained.
r.text.index('(')
Step 2: After extracting the barrage data to get JSON, we will analyze where the barrage data is. We can view it in the Preview of the browser debugging window
result
content
4. Data preservation
After the desired data is extracted, we can save the data. Data preservation we still use files to save, the reason is easy to operate, to meet the needs.
5. Batch crawl
After a request has been crawled, extracted, and saved, let’s look at how to save data in batches. This is different from other batch crawls: how do I crawl multiple sets of batch data?
When encountering problems and difficulties, brother Pig always likes to quantify things or work, and then refine, step by step to solve!
Here we divide the batch crawl into two steps: the first step is to batch crawl all the bullets in an episode; Step 2: Crawl multiple rounds! .
The key to batch crawls is to find the paging parameters. The trick is to compare the parameters of two request urls to see if there is any difference.
The same set of
First request and second request
mat
Changes the paging parameter in the original URL to a mutable parameter passed in by the method. Then create a batch crawler method, cyclic call single crawler method, each call incoming pages can!
Step 2: Crawl all rounds of multiple sets
Compare the first barrage request URL from episode 1 to Episode 2
iid=1061156738
iid=1061112026
iid
At this time, we still need to go back to the web page to find the answer. We copy the iID value of the first episode 1061156738 to the browser debug window, and find that the IID is the VID value of an interface.
Another request header needs to be introduced here: cookies. What are cookies for?
Because HTTP protocol is stateless, that is to say, the server does not know who you are the next time it requests you, so it uses cookies and Seesion to record the status. The simplest example is that after the user logs in, the server inherits a string of encrypted strings (key) to the browser, and then the server caches a key-value. This way, the browser will carry this key every time it makes a request, and the server will know which user you are!
Due to the limited space, I will give you a brief introduction today. Considering its importance, Brother Pig will write a special article about Cookie later.
So where are we going to find Cookie? The answer is the browser.
However, this form of table cannot be copied at all. Are there any tips for copying cookies? Cookie: Yes, click the Console button in the browser debug window and type document.cookie to see all cookies.
We got the episode id, now we can double cycle to climb all the rounds, up the code.
6. Data cleaning + word cloud generation
What data are we cleaning? It’s really hard to guess beforehand, so we don’t do data cleaning and just generate cloud words to see what happens, and then adjust. About the generation of cloud word introduction pig brother in the last climb jingdong commodity evaluation and generation of word cloud has spoken for everyone!
7. Analyze the word cloud
From the word cloud picture above, we can see that:
- Some of the main characters in the TV series: Zhang Xiaojing, Li Bi, Cui Qi, Long Bo, Xu Bin, and even some people like Cao Piaoyan.
- Some people say it is good, some people say it is not understood, which means the plot may be a little deep
- It might look like an Assassin’s Creed
- Four words brother, thousand seal, that the play has yi Yangqian seal
- There could be a surprise at the end of the song
- Datang and Chang ‘an illustrate the background
- Bullet screen, IQ, maybe everyone is reminding you: close bullet screen, protect IQ!
At present this drama has updated the first season (20 sets), it is really a homemade conscience drama, picture quality, clothing, etiquette, filming, script, acting can be rated first-class, recommend everyone to see!
Five, the summary
Let’s analyze and summarize today’s article from the technical level. This article seems to be similar to the process of crawling jingdong product evaluation and generating word cloud in the last article, but it is a little more difficult:
- This time, not only paging parameters, but also diversity parameters should be found
- Cookies are required and have an expiration date
- The large amount of data may be a bit of a test of computer performance
- There is data cleansing when generating the word cloud
Weekends, melon seeds and peanuts and beer, watching TV programming two mistakes, life is not enjoyable!
Project address: github.com/pig6/youku_…