Python Web crawler -BS4, Re (2)

“This is the second day of my participation in the Gwen Challenge in November. See details: The Last Gwen Challenge in 2021”

Bs4-beautiful Soup is mainly used for fetching data from web pages. Before using BS4, you need to install the library in Python, which is not expanded here.

1. Use BS4 to parse the nuggets front page

The Beautiful Soup library, which makes it easy to parse web page information, is integrated into the BS4 library and can be called from the BS4 library when needed. Its expression is as follows.

from bs4 import BeautifulSoup import requests url = 'https://juejin.cn/' strhtml = requests.get(url) print(strhtml.text)  soup = BeautifulSoup(strhtml.text,'lxml') data = soup.select('#main > div > div.mtop.firstMod.clearfix > div.centerBox > ul.newsList > li > a') print(data)Copy the code

The code results are as follows:

Note: First, the HTML document will be converted to Unicode encoding, and Then Beautiful Soup selects the most appropriate parser to parse the document, specifying the LXML library for parsing. Parsing transforms a complex HTML document into a tree structure, and each node is a Python object. Here, the parsed document is stored in the newly created variable SOUP as follows.

soup = BeautifulSoup(strhtml.text,'lxml')
Copy the code

At this point, we have some HTML code, but we still haven’t extracted the data, so enter the following code in PyCharm.

for item in data:
    result = {
        'title':item.get_text(),
        'link':item.get('href')
    }
    print(result)
Copy the code

2 Regular Expressions

The following re symbols are used:

\ D: Match the number
+ : matches the previous character one or more times

When calling regular expressions in Python, use the RE library, which does not need to be installed and can be called directly.

for item in data:
    result = {
        'title':item.get_text(),
        'link':item.get('href')
        'ID':re.findall('\d+',item.get('href'))
    }
    print(result)
Copy the code

Here we use the findAll () method of the RE library, where the first argument represents the regular expression and the second argument represents the text to extract.

3 proxy IP

Crawler simulates the browsing behavior of human and crawls data in batches. If the volume of fetching data is too large, it will strain the server and may even crash. In other words, servers don’t like people grabbing their data. So the website will take some anti – crawling strategies for these crawlers.

Usually there are two solutions, one is to add a delay, three seconds to catch, the code is as follows:

import time
time.sleep(3)
Copy the code

The other option is to build a proxy pool as follows:

Proxies = {" HTTP ":" http://10.10.1.10:3128 ", "HTTP" : "http://10.10.1.10:1080", } response = requests.get(url,proxies = proxies)Copy the code

Today’s first stop here, tomorrow we will try to climb netease cloud comments!

1. Use BS4 to parse the nuggets front page

2 Regular Expressions

3 proxy IP

Related Posts

Child Tuning: Dropout of Backpropagation Edition

YOLOv5 has released v6.0, what has been updated?

Tensorflow 1.x Tutorial — The learning rate decays and displays variable changes in the TensorBoard