preface
I believe that every reptilian friends in life must have encountered writing half a day to write out the reptilian, finally not easy to run, the result is slow, anyway, I have encountered. Now is the era of big data, the light will write crawler is not competitive, so learn to optimize the crawler code, optimize the crawler robustness or crawl speed and so on, these can improve their competitiveness, if this blog can help you, don’t forget one key three even oh (^▽^)!
This article crawler to Qiushi Encyclopedia as an example, to ordinary crawler and multithreaded crawler == running time ==, I believe that we can appreciate the powerful of multithreading!!
If you’re not familiar with XPTH, check out this blog post
Introduction to Xpath
After reading this blog post, you can continue to conduct multithreading combat, refer to my following blog post
Multithreaded emoji package classification to get combat
Don’t say more, do it!!
1. Common reptile
import requests
from lxml import etree
import time
import sys
headers = {
"User-Agent": "Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36"
}
# crawl
def Crawl(response) :
e = etree.HTML(response.text)
# select class by class value
span_text = e.xpath("//div[@class='content']/span[1]")
with open("duanzi.txt"."a", encoding="utf-8") as f:
for span in span_text:
info = span.xpath("string(.) ")
f.write(info)
#main
if __name__ == '__main__'Start = time.time() base_URL ="https://www.qiushibaike.com/text/page/{}"
for i in range(1.14) :Print the number of pages currently climbed
print("Climbing page {}".format(i))
new_url = base_url.format(i)
Send a get request
response = requests.get(new_url,headers=headers)
Crawl(response)
Write down the end time
end = time.time()
Get run time by subtracting
print(end - start)
Copy the code
2. Multi-threaded crawlers
import requests
from lxml import etree
First in, first out (FIFO
from queue import Queue
from threading import Thread
import time
# request header
headers = {
"User-Agent": "Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36"
}
# Data acquisition
# inheritance Thread
class CrawlInfo(Thread) :
# override init function
def __init__(self,url_queue,html_queue) :
Thread.__init__(self)
self.url_queue = url_queue
self.html_queue = html_queue
# override the run method
def run(self) :
while self.url_queue.empty()== False:
url = self.url_queue.get()
response = requests.get(url,headers=headers)
if response.status_code == 200:
Put the data into the queue
self.html_queue.put(response.text)
# Data parsing and saving
# inheritance Thread
class ParseInfo(Thread) :
def __init__(self,html_queue) :
Thread.__init__(self)
self.html_queue = html_queue
def run(self) :
Check whether the queue is empty. If it is not empty, continue traversing
while self.html_queue.empty() == False:
Fetch the last data from the queue
e = etree.HTML(self.html_queue.get())
span_text = e.xpath("//div[@class='content']/span[1]")
with open("duanzi.txt"."a", encoding="utf-8") as f:
for span in span_text:
info = span.xpath("string(.) ")
f.write(info)
# start
if __name__ == '__main__':
start = time.time()
# instantiation
url_queue = Queue()
html_queue = Queue()
base_url = "https://www.qiushibaike.com/text/page/{}"
for i in range(1.14) :print("Climbing page {}".format(i))
new_url = base_url.format(i)
url_queue.put(new_url)
crawl_list = []
for i in range(3):
Crawl = CrawlInfo(url_queue,html_queue)
crawl_list.append(Crawl)
Crawl.start()
for crawl in crawl_list:
#join() waits until the queue is empty before doing anything else
crawl.join()
parse_list = []
for i in range(3):
parse = ParseInfo(html_queue)
parse_list.append(parse)
parse.start()
for parse in parse_list:
parse.join()
end = time.time()
print(end - start)
Copy the code
3. Operation comparison
== Common crawler ==
Multithreaded crawler ==
There may be little partner feel this few seconds it’s not a big deal, again, is now = = = = big data era, and is prone to = = = = millions, tens of millions of data quantity, if you still can only write ordinary crawler code, then in = = degree spell without winning the = =, you even spell doesn’t win the = = = = page technology, That’s what makes it so sad!!
The last
I am aCode pipi shrimp, a prawns lover who loves to share knowledge, will update useful blog posts in the future, looking forward to your attention!!
Creation is not easy, if this blog is helpful to you, I hope you can == one key three even oh! Thank you for your support. See you next time
Share the outline
Big factory interview question column
Python Crawler Column
The source of this article is available at GitHub github.com/2335119327/… Has been included (more content of this blog not crawler, interested partners can see), will continue to update, welcome Star.