Use Python to capture and store web data.

Crawler is an important application of Python. Using Python crawler, we can easily grab the data we want from the Internet. This paper will introduce the basic process of Python crawler in detail based on the example of crawling and storing the data in the hot search list of B station video. If you are still at the beginning of the crawler stage or do not know how crawlers work, then you should read this article carefully!

Step 1: Try the request

First go to the homepage of b station, click the ranking and copy the link

https://www.bilibili.com/ranking?spm_id_from=333.851.b_7072696d617279506167655461623.
Copy the code

Now launch Jupyter Notebook and run the following code

import requests

url = 'https://www.bilibili.com/ranking?spm_id_from=333.851.b_7072696d61727950616765546162.3'
res = requests.get('url')
print(res.status_code)
# 200
Copy the code

In the code above, we did three things

(1) import requests

(2), construct the request using the get method

(3), use status_code to obtain the web page status code

You can see that the return value is 200, indicating that the server is responding properly, which means we can continue.

Code word is not easy nonsense two sentences: there is a need to learn information or technical problems exchange”Click on the”

Step 2: Parse the page

After requesting data from the Web site through Requests in the previous step, we successfully got a Response object containing the server resources, which we can now view using.text

See returns a string, and there are the hot list we need video data, but is directly extracted from string content is complicated and inefficient, so we need to parse, converts a string to a web page structured data, so that we can easily find the HTML tags and attributes and content.

There are many ways to parse web pages in Python, using regular expressions, BeautifulSoup, PyQuery, or LXML, and this article will be based on BeautifulSoup.

Beautiful Soup is a third-party library that extracts data from HTML or XML files. From bS4 import BeautifulSoup, here is a simple example of how it works

from bs4 import BeautifulSoup

page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.title.text 
print(title)
# ゜-゜ -゜ feeling ロ Cheers ~-bilibili
Copy the code

In the code above, we convert the HTML string from the previous step into a BeautifulSoup object using the BeautifulSou class in BS4, noting that we need to specify a parser to use, which is html.parser.

You can then fetch one of the structured elements and their attributes, such as soup.title.text for the page title, and soup.body, soup.p for any element you want.

Step 3: Extract the content

In the previous two steps, we used Requests to request data from the Web page and parsed the page using BS4, and now we come to the most important step: how to extract the content from the parsed page. In Beautiful Soup, we can use find/find_all to locate elements, but I prefer to use the CSS selector.select because you can access the DOM tree as you would with CSS to select elements.

Now we use the code to explain how to extract the hot list data of station B from the parsed page. First of all, we need to find the label that stores the data, press F12 on the list page and follow the instructions below to find

You can see that each video message is wrapped under the li tag class=”rank-item”, so the code would look like this:

all_products = []

products = soup.select('li.rank-item')
for product in products:
    rank = product.select('div.num') [0].text
    name = product.select('div.info > a') [0].text.strip()
    play = product.select('span.data-box') [0].text
    comment = product.select('span.data-box') [1].text
    up = product.select('span.data-box') [2].text
    url = product.select('div.info > a') [0].attrs['href']

    all_products.append({
        "Video Ranking":rank,
        "Video Name": name,
        "Play quantity": play,
        "Projectile volume": comment,
        "Up the Lord": up,
        "Video link": url
    })
Copy the code

In the above code, we use soup.select(‘li.rank-item’), which returns a list containing each video message, then iterates through each video message, still using the CSS selector to extract the field information and store it as a dictionary in the empty list defined at the beginning.

Note that I used a variety of selection methods to extract elements, and this is the flexibility of the SELECT method so that interested readers can explore it further.

Step 4: Store data

Through the previous three steps, we successfully extracted the data from the site using Requests + BS4, and then simply wrote the data to Excel for storage.

If you are unfamiliar with pandas, you can use the CSV module to write to pandas. The encoding is set to ‘UTF-8-sig’

import csv
keys = all_products[0].keys()

with open('B station Video Hot list Top100.csv '.'w', newline=' ', encoding='utf-8-sig') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(all_products)
Copy the code

If you are familiar with pandas, you can easily convert a dictionary to a DataFrame in one line of code

import pandas as pd
keys = all_products[0].keys()

pd.DataFrame(all_products,columns=keys).to_csv('B station Video Hot list Top100.csv ', encoding='utf-8-sig')
Copy the code

summary

At this point we have successfully used Python to store the popular video list data locally, and most requests based crawlers basically follow the above four steps.

Although it looks simple, each step is not so easy in the real scene. From the beginning of the request for data, the target website has various forms of reverse crawling and encryption, and there are a lot of further exploration and learning to parse, extract and even store data.

This paper chooses the video hot list of station B just because it is simple enough. I hope to let you understand the basic process of crawler through this case, and finally attach the complete code

import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd

url = 'https://www.bilibili.com/ranking?spm_id_from=333.851.b_7072696d61727950616765546162.3'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

all_products = []

products = soup.select('li.rank-item')
for product in products:
    rank = product.select('div.num') [0].text
    name = product.select('div.info > a') [0].text.strip()
    play = product.select('span.data-box') [0].text
    comment = product.select('span.data-box') [1].text
    up = product.select('span.data-box') [2].text
    url = product.select('div.info > a') [0].attrs['href']

    all_products.append({
        "Video Ranking":rank,
        "Video Name": name,
        "Play quantity": play,
        "Projectile volume": comment,
        "Up the Lord": up,
        "Video link": url
    })


keys = all_products[0].keys()

with open('B station Video Hot list Top100.csv '.'w', newline=' ', encoding='utf-8-sig') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(all_products)

### Write data to pandas
pd.DataFrame(all_products,columns=keys).to_csv('B station Video Hot list Top100.csv ', encoding='utf-8-
Copy the code

As a Python developer, I spent three days to compile a set of Python learning tutorials, from the most basic Python scripts to Web development, crawlers, data analysis, data visualization, machine learning, etc. These materials can be “clicked” by the friends who want them

Use Python to capture and store web data.

Step 1: Try the request

Step 2: Parse the page

Step 3: Extract the content

Step 4: Store data

summary

Related Posts

Don’t compare C++ to other languages

Kubernetes one-stop observability System based on eBPF

[] = =! Why is the result of [] true