This is the 8th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021 “.
1. pyquery
1.1 introduction
If you’re familiar with CSS selectors and Jquery, there’s another parser library for you –Jquery
Website pythonhosted.org/pyquery/
1.2 installation
pip install pyquery
1.3 Usage Mode
1.3.1 Initialization Mode
- string
from pyquery import PyQuery as pq
doc = pq(str)
print(doc(tagname))
Copy the code
- url
from pyquery import PyQuery as pq
doc = pq(url='http://www.baidu.com')
print(doc('title'))
Copy the code
- file
from pyquery import PyQuery as pq
doc = pq(filename='demo.html')
print(doc(tagname))
Copy the code
1.3.2 Selecting a Node
- Get the current node
from pyquery import PyQuery as pq
doc = pq(filename='demo.html')
doc('#main #top')
Copy the code
- Get child nodes
- Write it in doc layer by layer
- After the parent tag is obtained, use the children method
from pyquery import PyQuery as pq
doc = pq(filename='demo.html')
doc('#main #top').children()
Copy the code
- Get the parent node
- After getting the current node, use the parent method
- Get sibling nodes
- After getting the current node, use the siblings method
1.3.3 Obtaining Attributes
from pyquery import PyQuery as pq
doc = pq(filename='demo.html')
a = doc('#main #top')
print(a.attrib['href'])
print(a.attr('href'))
Copy the code
1.3.4 Obtaining Content
from pyquery import PyQuery as pq
doc = pq(filename='demo.html')
div = doc('#main #top')
print(a.html())
print(a.text())
Copy the code
1.3.5 sample
from pyquery import PyQuery as pq # 1. You can load an HTML string, or an HTML file, or a URL, D =pq("< HTML ><title>hello</title></ HTML >") d=pq(filename=path_to_html_file) d=pq(url='http://www.baidu.com') note: Here the URL seems to have to write all # 2.html() and text() -- get the corresponding HTML block or text block, P = pq (" < head > < title > hello < / title > < / head > ") p (" head "). The HTML # () returns < title > hello < / title > p (" head "). The text () return to the hello # 3 #. Get elements based on HTML tags, D = pq (' < div > < p > test 1 < / p > < p > test 2 < / p > < / div > ') d (' p ') # return / < p > < p > print d # (' p ') returns 1 < / p > < p > test < / p > < p > test 2 print D ('p').html()# return test 1 # Note: when more than one element is retrieved, the HTML () method returns only the corresponding content block of the first element # 4.eq(index) -- the specified element is retrieved based on the given index number. Print d('p').eq(1).html() # return test 2 # 5.filter() D = pq (" < div > < p id = '1' > test 1 < / p > < p class = '2' > test 2 < / p > < / div > "), d (" p "). The filter (' # 1 ') # returns [< 1 > p#] d (" p "). The filter (' 2 ') [<p.2>] # 6.find() D = pq (" < div > < p id = '1' > test 1 < / p > < p class = '2' > test 2 < / p > < / div > "), d (' div '). The find (' p ') # returns [< 1 > p#, <p.2>] d('div').find('p').eq(0)# return [<p#1>] #7. Get elements directly from class name, id name, for example: D = pq (" < div > < p id = '1' > test 1 < / p > < p class = '2' > test 2 < / p > < / div > "), d (' # 1 '). The HTML () returns the test # 1 d (' 2 '). The HTML () return to the test # 2 # 8. Get the attribute value, for example: D = pq (" < p id = 'my_id' > < a href = "http://hello.com" > hello < / a > < / p > "), d (' a '). Attr (' href ') # back to http://hello.com Attr ('id')# return my_id # 9. Attr ('href', 'http://baidu.com') to baidu # 10.addClass(value) D =pq('<div></div>') d.addClass('my_class')# return [<div. My_class >] # 11.hasClass(name) # return whether the element contains the given class, e.g. D =pq("<div class='my_class'></div>") d.hasClass('my_class')# return True # 12. Children (selector=None) D = pq (" < span > < p id = '1' > hello < / p > < p id = '2' > world < / p > < / span > "), dc hildren # () returns [< 1 > p#, < 2 > p#], dc hildren (' # 2 ') # returns [< 2 > p#] # 13. Parents (selector = None), access to the parent element, example: D = pq (" < span > < p id = '1' > hello < / p > < p id = '2' > world < / p > < / span > "), d (" p "). Parents () # returns [< span >] D (' # 1 '). Parents (' span ') # return / < span > d (' # 1 '). Parents (" p ") return to [] # # 14. Clone copies of () - returns a node # 15. Empty () - # remove node content 16. NextAll (selector=None) -- return all subsequent elements, e.g. D = pq (" < p id = '1' > hello < / p > < p id = '2' > world < / p > < img SCR = "/ >"), d (' p: first). NextAll # () returns [< 2 > p#, The < img >] d (' p: last). NextAll # () returns [the < img >] # 17. Not_ (selector) - returns the does not match the selector elements, example: D = pq (" < p id = '1' > test 1 < / p > < p id = '2' > test < / p > "2) d (" p"). Not_ (' # ') # returns [< 1 > p#]Copy the code
Use of multithreading
1. The introduction of
All the crawlers we’ve written before are single threaded, right? How is that enough? Once a place gets stuck, isn’t that waiting forever? To do this we can use multi-threading or multi-processing.
It’s not recommended, but you can check it out if you want, so you don’t want to waste time reading it
2. How to use it
Crawlers use multiple threads to process network requests, use threads to process urls in the URL queue, and then store the results returned by urls in another queue, which other threads read and write to files
3. Main components
3.1 URL Queue and result queue
Place the url to be climbed in a Queue, using the standard library Queue. The result after accessing the URL is stored in the result queue
Initialize a URL queue
from queue import Queue
urls_queue = Queue()
out_queue = Queue()
Copy the code
3.2 class wrapping
Use multiple threads to fetch the URL from the URL queue and process it:
import threading
class ThreadCrawl(threading.Thread):
def __init__(self, queue, out_queue):
threading.Thread.__init__(self)
self.queue = queue
self.out_queue = out_queue
def run(self):
while True:
item = self.queue.get()
Copy the code
If the queue is empty, the thread blocks until the queue is no longer empty. Once a piece of data in a queue is processed, the queue needs to be notified that it has been processed
3.3 Functional Packaging
from threading import Thread
def func(args)
pass
if __name__ == '__main__':
info_html = Queue()
t1 = Thread(target=func,args=(info_html,)
Copy the code
3.4 the thread pool
Import threading import time import queue class Threadingpool(): def __init__(self,max_num = 10): self.queue = queue.Queue(max_num) for i in range(max_num): self.queue.put(threading.Thread) def getthreading(self): return self.queue.get() def addthreading(self): self.queue.put(threading.Thread) def func(p,i): time.sleep(1) print(i) p.addthreading() if __name__ == "__main__": p = Threadingpool() for i in range(20): thread = p.getthreading() t = thread(target = func, args = (p,i)) t.start()Copy the code
4. Common methods in Queue module:
Python’s Queue module provides synchronous, thread-safe Queue classes, including FIFO (first in, first out) Queue, LIFO (last in, first out) Queue, and PriorityQueue. These queues implement locking primitives that can be used directly in multiple threads. Queues can be used for synchronization between threads
- Queue.qsize() returns the size of the Queue
- Queue.empty() returns True if the Queue is empty, False otherwise
- Queue.full() returns True if the Queue is full, False otherwise
- Queue. Full corresponds to maxsize
- Queue.get([block[, timeout]]) Gets the Queue, timeout wait time
- Queue. Get_nowait () quite a Queue. Get (False)
- Queue. Put (item) Write to Queue, timeout Wait time
- Queue. Put_nowait (item) equivalent to Queue. Put (item, False)
- Queue.task_done() After a task has been completed, the queue.task_done () function sends a signal to the Queue that the task has completed
- Queue.join() actually means to wait until the Queue is empty before doing anything else