“This is the 17th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”
Modern professionals should be affectionate, interesting, useful and tasteful. All right, take out the “have” and you’re done. How to become a professional humor expert, we need a certain material is a joke, read more jokes to tell more jokes, and can also say advanced jokes.
Analysis work before climbing
The climb take the goal of website is: www.wllxy.net/gxqmlist.as…
The overall difficulty of climbing is not difficult, and analysis work can be omitted. After all, for you who have learned this point, you have mastered 7~ 80% of Requests.
This article focuses on introducing you to the proxy in Requests.
Time for the popularization of basic knowledge of reptiles
What is the agent
Proxy means to obtain network information on behalf of network users. In plain English, the user’s OWN IP and other network related information is hidden from the target site.
Type of agency
High anonymous proxy High anonymous proxy will forward the packet intact, from the server of the target website, it looks like a real ordinary user is visiting, and the IP used is also the IP address of the proxy server, which can perfectly hide the original IP of the user. Therefore, high anonymous proxy is the first choice of crawler proxy.
Common Anonymous Proxy The common anonymous proxy makes some changes to the packet and adds HTTP header fixed parameters. Because of the existence of fixed parameters, the target server can track the real IP address of the user. Anti-crawler websites can easily judge whether the user is a crawler.
Transparent proxy this need not go into detail, generation of white generation, the target server is easy to detect.
In the type of proxy, sometimes according to HTTP and HTTPS distinction, now most websites have upgraded to HTTPS protocol, but HTTP has not been abandoned, generally can also be crawled. Note that HTTPS requires multiple handshakes, which is slow. After using a proxy, it will become slower. Therefore, any website that can climb HTTP should try to climb HTTP, including using a proxy.
Requests using the proxy
Requests can be proxies, proxies, proxies, proxies, proxies, proxies, proxies, proxies, proxies, proxies, proxies, proxies, proxies, proxies, proxies, proxies
import requests
proxies = {
"http": "http://10.10.1.10:3128"."https": "http://10.10.1.10:1080",
}
requests.get("http://example.org", proxies=proxies)
Copy the code
Note that the proxy is a dictionary parameter that can contain either HTTP or HTTPS.
Note that Requests supports SOCKS proxies.
Coding time
Now that the knowledge related to agents has been introduced, let’s move on to the actual coding.
import requests
import re
import threading
headers = {
"User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"}
flag_page = 0
# regular expression parsing, the final need to merge the three tuples, using zip function
def anay(html) :
The regular expression is matched three times. We can make it more efficient and leave it to everyone.
pattern = re.compile(
'<td class="diggtdright">[.\s]*<a href=".*?" target="_blank">\s*(.*?)</a>')
titles = pattern.findall(html)
times = re.findall('date of issue :(\d+[-]\d+[-]\d+)', html)
diggt = re.findall('votes :(\d+) people', html)
return zip(titles, times, diggt)
def save(data) :
with open("newdata.csv"."a+", encoding="utf-8-sig") as f:
f.write(f"{data[0]}.{data[1]}.{data[2]}\n")
def get_page() :
global flag_page
while flag_page < 979:
flag_page += 1
url = f"http://www.wllxy.net/gxqmlist.aspx?p={flag_page}"
print(F "is crawling{url}")
r = requests.get(url=url, headers=headers)
ok_data = anay(r.text)
for data in ok_data:
print(data)
The save to local method completes itself
# save(data)
if __name__ == "__main__":
for i in range(1.6):
t = threading.Thread(target=get_page)
t.start()
Copy the code
Note that the zip function takes an iterable object as an argument, packs the corresponding elements of the object into tuples, and returns a list of those tuples. Zip returns an object. To display a list, you need to manually convert the list().
If the number of elements in each iterator is inconsistent, the length of the list is the same as that of the shortest object.
The rest involves the data-saving part, the save function in the code above, which you can write yourself.
Last but not least, two words
This series of crawler tutorials will focus on the Requests library, and you should have a complete understanding of the Requests library.