The pursuit of a person can not rely on “moved”.
Being on time for good morning and good night, and delivering three meals in all weal or shine, won’t make someone like you. They may even hate you.
It’s better to improve yourself and make your partner feel attractive and down-to-earth.
A woman can fall in love with you, is someone who loves you, not your kneel lick.
Know how to be considerate, but have no pursuit, do not recognize the reality.
Every day thinking of a variety of bibimbap to please her, although it will make her happy, but will not let her love, let alone fall in love with you.
This kind of man will not really give the other party to bring [security], even if the other party said [I love you], it will not be too long, licking to the end will only be empty.
Obtain the video information of station B
Today we bring the actual combat project is to obtain the video information data of STATION B. What are the main video information data of station B?
- Video likes.
- Number of video coins
- Video collection
- Number of video bullets
- Video Playback duration
- Video Release Date
The above points are the main information data of the video, and our goal is to capture this data and save it in the database.
Target Page Analysis
The data we’re crawling is all the video data from Watcher.
For every video in station B every video has an AV number.
For example, we click the first video, in which we can see various data of the video, such as: the number of likes, coins, favorites, bullets, video playing time, video release time.
See? The main data of the video is just right to be presented on the web.
In order to ensure the accuracy and accuracy of the data, we need to click Network and check Response to see if it also contains such data. If it contains the same data, it means that the data is indeed contained in the web page, rather than asynchronously loaded by JS or Ajax. Of course, I prefer to render data through these two means, which is easier for me.
As you can see above, the data is indeed contained in the web page, which means that it is possible to access and retrieve the desired data using Python as normal.
Analysis to here, I will first take the code framework to write it out first
# Get web information
def get_html(url) :
pass
# parse data
def parse_html(html) :
pass
# Save data
def save_data(data) :
pass
Copy the code
Get web page information
The next step is to extract the data we need from the web page. Here I choose xpath to extract the data.
Xpath is relatively simple, as shown below:
def parse_html(html) :
html = etree.HTML(html)
title = html.xpath('//span[contains(@class, "tit")]/text()') [0]
danmu = int(re.findall('\d+', html.xpath('//span[@class="dm"]/text()') [0[])0])
times = html.xpath('//div[@class="video-data"]/span[3]/text()') [0]
like = int(re.findall('\d+', html.xpath('//div[@class="ops"]/span[@class="like"]/@title') [0[])0])
coin =int(re.findall('\d+', html.xpath('//div[@class="ops"]/span[2]/@title') [0[])0])
collect = int(re.findall('\d+', html.xpath('//div[@class="ops"]/span[@class="collect"]/@title') [0[])0])
return title, danmu, times, like, coin, collect
Copy the code
Let me make a brief note of the above code:
Title is the title of the video;
Danmu represents the number of bullets in the video;
Times is when the video was released;
“Like” stands for “likes”;
Coin stands for the number of coins placed;
Collect stands for number of collections.
I’m sure xpath syntax doesn’t need any more explanation from me.
If you read this article carefully, you will find that there is still one data missing, that is the video duration. After my careful search, I could not find the playing time of the video in the response. To be honest, I was momentarily confused.
So I go back to the previous link, which looks like this:
And refreshed the page, I was lucky, I finally found it.
There are a total of 30 videos in one page, and the title, ID number and playing duration of the video are all on this side. As you can clearly see from the figure above, the data is json data loaded asynchronously through Ajax.
Get links and times for all videos
Through the above analysis, I have a general idea. I have the ID numbers of the 30 videos here. This is all the information about the first video. First OF all, I have to know how the API it gives changes on different pages.
# # the first page of https://api.bilibili.com/x/space/arc/search?mid=10330740&ps=30&tid=0&pn=1&keyword=&order=pubdate&jsonp=jsonp page two https://api.bilibili.com/x/space/arc/search?mid=10330740&ps=30&tid=0&pn=2&keyword=&order=pubdate&jsonp=jsonp page # 3 https://api.bilibili.com/x/space/arc/search?mid=10330740&ps=30&tid=0&pn=3&keyword=&order=pubdate&jsonp=jsonpCopy the code
The variable parameter is PN, which changes from page to page.
So, I’ve defined two methods here to get the link to the video and the duration of the video.
The specific code is as follows:
# Get links to videos on pages 1-10
def get_url() :
urls = []
# times = []
for page in range(1.10):
api_url = f'https://api.bilibili.com/x/space/arc/search?mid=10330740&ps=30&tid=0&pn={page}&keyword=&order=pubdate&jsonp=jsonp'
data = requests.get(api_url, headers=headers).json()
bvids = jsonpath.jsonpath(data, '$.data.. vlist.. bvid')
time.sleep(0.5)
url = ['https://www.bilibili.com/video/'+bvid for bvid in bvids]
urls.extend(url)
return urls
Get the video playback duration of 1-10 pages
def get_length() :
times = []
for page in range(1.10):
api_url = f'https://api.bilibili.com/x/space/arc/search?mid=10330740&ps=30&tid=0&pn={page}&keyword=&order=pubdate&jsonp=jsonp'
data = requests.get(api_url, headers=headers).json()
length = jsonpath.jsonpath(data, '$.data.. vlist.. length')
times.extend(length)
time.sleep(0.5)
return times
Copy the code
Modify the code for obtaining web page information
Now I have successfully got the playing time of the video. In order to facilitate storage, therefore, I need to call the method of obtaining the playing time of the video in the code of obtaining web page information. The modified code looks like this:
def parse_html(html) :
global pages
lengths = get_length()
html = etree.HTML(html)
title = html.xpath('//span[contains(@class, "tit")]/text()') [0]
danmu = int(re.findall('\d+', html.xpath('//span[@class="dm"]/text()') [0[])0])
times = html.xpath('//div[@class="video-data"]/span[3]/text()') [0]
like = int(re.findall('\d+', html.xpath('//div[@class="ops"]/span[@class="like"]/@title') [0[])0])
coin =int(re.findall('\d+', html.xpath('//div[@class="ops"]/span[2]/@title') [0[])0])
collect = int(re.findall('\d+', html.xpath('//div[@class="ops"]/span[@class="collect"]/@title') [0[])0])
return title, danmu, times, like, coin, collect, lengths[pages]
Copy the code
Since the playback duration is retrieved in a list, I’ve defined a global variable called pages, which I increment by 1 after calling the **save_data()** method.
This Pages actually represents how long each video is played.
Save the data
The data saved this time is stored in MySQL database, the specific code is as follows:
def save_data(data) :
host = 'localhost'
user = 'root'
password = 'password'
port = 3306
db = pymysql.connect(host=host, user=user, password=password, port=port, db='bilibli')
cursor = db.cursor()
sql = 'insert into data3(title, danmu, vediotime, likecount, coin, collect, vediolength) values (%s, %s, %s, %s, %s, %s, %s)'
print(data)
try:
cursor.execute(sql, data)
db.commit()
print('Inserted successfully')
except Exception as e:
print(e)
db.rollback()
print('Insert failed')
Copy the code
Note that the database bilibli and table data3 must be created before the above method is called to save the data to the database.
This article only does simple crawler knowledge, not data analysis.
In this paper, for example, you can crawl the detailed data of your favorite UP host videos. You can crawl one more data, such as the number of plays, to do the data analysis of Top100 videos of station B, to consolidate your crawler knowledge and skills, and also learn simple data visualization knowledge.
The last
Other people’s advice is only reference answer, life has no standard answer.
You don’t have to copy others, force yourself to be the same as others, just write what you think is right.
After reading the article, feel helpful to you, you can point a [in reading], I will continue to work hard, and you grow up together.
Every word of the article is my heart to knock out, only hope to live up to every attention to my people.
Click on “Watching” to let me know that you are doing your best for life.
I am book-learning, a person who concentrates on learning. The more you know, the more you don’t know. See you next time for more exciting content!
respect