Learn these 10,000 jokes and become a professional IT humorist. Python Crawler Lessons 8-9

“This is the 17th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Modern professionals should be affectionate, interesting, useful and tasteful. All right, take out the “have” and you’re done. How to become a professional humor expert, we need a certain material is a joke, read more jokes to tell more jokes, and can also say advanced jokes.

Analysis work before climbing

The climb take the goal of website is: www.wllxy.net/gxqmlist.as…

The overall difficulty of climbing is not difficult, and analysis work can be omitted. After all, for you who have learned this point, you have mastered 7~ 80% of Requests.

This article focuses on introducing you to the proxy in Requests.

Time for the popularization of basic knowledge of reptiles

What is the agent

Proxy means to obtain network information on behalf of network users. In plain English, the user’s OWN IP and other network related information is hidden from the target site.

Type of agency

High anonymous proxy High anonymous proxy will forward the packet intact, from the server of the target website, it looks like a real ordinary user is visiting, and the IP used is also the IP address of the proxy server, which can perfectly hide the original IP of the user. Therefore, high anonymous proxy is the first choice of crawler proxy.

Common Anonymous Proxy The common anonymous proxy makes some changes to the packet and adds HTTP header fixed parameters. Because of the existence of fixed parameters, the target server can track the real IP address of the user. Anti-crawler websites can easily judge whether the user is a crawler.

Transparent proxy this need not go into detail, generation of white generation, the target server is easy to detect.

In the type of proxy, sometimes according to HTTP and HTTPS distinction, now most websites have upgraded to HTTPS protocol, but HTTP has not been abandoned, generally can also be crawled. Note that HTTPS requires multiple handshakes, which is slow. After using a proxy, it will become slower. Therefore, any website that can climb HTTP should try to climb HTTP, including using a proxy.

Requests using the proxy

Requests can be proxies, proxies, proxies, proxies, proxies, proxies, proxies, proxies, proxies, proxies, proxies, proxies, proxies, proxies, proxies, proxies

import requests

proxies = {
  "http": "http://10.10.1.10:3128"."https": "http://10.10.1.10:1080",
}

requests.get("http://example.org", proxies=proxies)
Copy the code

Note that the proxy is a dictionary parameter that can contain either HTTP or HTTPS.

Note that Requests supports SOCKS proxies.

Coding time

Now that the knowledge related to agents has been introduced, let’s move on to the actual coding.

import requests
import re
import threading

headers = {
    "User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"}

flag_page = 0

# regular expression parsing, the final need to merge the three tuples, using zip function
def anay(html) :
	The regular expression is matched three times. We can make it more efficient and leave it to everyone.
    pattern = re.compile(
        '<td class="diggtdright">[.\s]*<a href=".*?"  target="_blank">\s*(.*?)</a>')
    titles = pattern.findall(html)
    times = re.findall('date of issue :(\d+[-]\d+[-]\d+)', html)
    diggt = re.findall('votes :(\d+) people', html)
    return zip(titles, times, diggt)

def save(data) :
    with open("newdata.csv"."a+", encoding="utf-8-sig") as f:
        f.write(f"{data[0]}.{data[1]}.{data[2]}\n")

def get_page() :
    global flag_page
    while flag_page < 979:
        flag_page += 1
        url = f"http://www.wllxy.net/gxqmlist.aspx?p={flag_page}"
        print(F "is crawling{url}")
        r = requests.get(url=url, headers=headers)

        ok_data = anay(r.text)
        for data in ok_data:
            print(data)
            The save to local method completes itself
            # save(data)

if __name__ == "__main__":
    for i in range(1.6):
        t = threading.Thread(target=get_page)
        t.start()
Copy the code

Note that the zip function takes an iterable object as an argument, packs the corresponding elements of the object into tuples, and returns a list of those tuples. Zip returns an object. To display a list, you need to manually convert the list().

If the number of elements in each iterator is inconsistent, the length of the list is the same as that of the shortest object.

The rest involves the data-saving part, the save function in the code above, which you can write yourself.

Last but not least, two words

This series of crawler tutorials will focus on the Requests library, and you should have a complete understanding of the Requests library.

Learn these 10,000 jokes and become a professional IT humorist. Python Crawler Lessons 8-9

Analysis work before climbing

Time for the popularization of basic knowledge of reptiles

What is the agent

Type of agency

Requests using the proxy

Coding time

Last but not least, two words

Related Posts

The Linux IO model is implemented based on Golang

SpringBoot Automatic configuration @Conditional Annotation – SpringBoot Automatic configuration (IV)

Are you still using BeanUtils for object copying?