What is a pyquery

Pyquery is a jquery-like web page parsing tool that allows you to iterate through XML documents in jquery-style. It uses LXML to manipulate HTML XML documents. The parsing libraries xpath and Beautiful Soup are much more flexible and simple than the ones we talked about earlier, and add the ability to add classes and remove nodes, which can sometimes make it easier to extract information.

Using pyquery

If you know something about the Web and prefer CSS selectors, there is a parser library for you — jquery.

The preparatory work

Make sure you have the QyQuery library installed before using it. The installation tutorial is as follows:

pip install pyquery
Copy the code

Initialize the

As with Beautiul Soup, when you initialize PyQuery, you pass in HTML text to initialize a PyQuery object.

There are generally three ways to pass in initialization: a string, a URL, and an HTML file.

  • String initialization
html = ''' 
       '''

from pyquery import PyQuery as pq


doc = pq(html)
print(doc)
print(type(doc))
print(doc('li'))
Copy the code

Start with a brief description of the above code:

We’ll start with a PyQuery object called PQ. We then declare a long HTML string and pass it as an argument to the PyQuery class, which successfully initializes.

Next we pass in the CSS selector as a parameter to the initialization object, in this case we pass in the Li node so that we can select all of the Li nodes.

  • The URL to initialize

The parameter to the initialization object can be not only a string, but also the URL of the web page, which can be passed as a parameter to the initialization object.

The specific code is as follows:

from pyquery import PyQuery as pq


doc = pq('https://www.baidu.com', encoding='utf-8')
print(doc)
print(type(doc))
print(doc('title'))
Copy the code

Try to run the above code you will find that we successfully obtained baidu’s title node and web page information.

The PyQuery object requests this URL and then initializes it with the resulting HTML content, which is essentially the equivalent of passing the web source code to the initializer as a string.

Therefore, you can also write code like this:

from pyquery import PyQuery as pq
import requests


url = 'https://www.baidu.com'
doc = pq(requests.get(url).content.decode('utf-8'))
print(doc)
print(type(doc))
print(doc('title'))
Copy the code

The result is the same as the result of the above code.

  • File initialization

In addition to passing the URL, you can also pass the local filename. In this case, you only need to pass the local filename and specify filename as the parameter.

The specific code is as follows:

from pyquery import PyQuery as pq


doc = pq(filename='baidu.html')
print(doc)
print(type(doc))
print(doc('title'))
Copy the code

All three initializations are acceptable, but the most common one is passed as a string.

Basic CSS selector

html = ''' 
       '''

from pyquery import PyQuery as pq


doc = pq(html)
print(doc('#container .list li'))
print(type(doc('#container .list li')))

Copy the code

After initializing the PyQuery object, pass in the CSS selector # container. list Li to output all the qualified nodes, and run the above code to see that it is still of Type PyQuery.

Find nodes

Here are some common query functions, which are used exactly the same as jQuery functions.

  • Child nodes

The find() method is used to find the child nodes and is passed in a CSS selector, as in the previous HTML example.

from pyquery import PyQuery as pq


doc = pq(html)
print(doc.find('li'))
print(type(doc.find('li')))
Copy the code

Call the find() method and pass in the node name li to get everything that meets the criteria. The type is still PyQuery.

Of course we can also write:

from pyquery import PyQuery as pq


doc = pq(html)
items = doc('.list')
print(type(items))
lis = items.find('li')
print(type(lis))
print(lis)
Copy the code

First, select the node whose class is list, then call find(), pass in the CSS selector, select the internal ‘ ‘li’ node, and finally print it out.

The find() method is used to find all children. To get all children, call the chirdren() method. The specific code is as follows:

from pyquery import PyQuery as pq


doc = pq(html)
items = doc('.list')
lis = items.children()
print(lis)
print(type(lis))
Copy the code

If you want to filter the eligible nodes in the child nodes, you can pass a CSS selector to the Chirdren () method. The specific code is as follows:

from pyquery import PyQuery as pq


doc = pq(html)
items = doc('.list')
lis = items.children('.active')
print(lis)
print(type(lis))
Copy the code

Try running the code above and you will see that you have successfully obtained the node whose class is active.

  • The parent node

We can call the parent() method to get the parent of a node.

html = ''' 
       '''

from pyquery import PyQuery as pq


doc = pq(html)
items = doc('.list')
container = items.parent()
print(container)
print(type(container))
Copy the code

First, a brief explanation of the above code:

We first select a node whose class is list, and then call the parent() method to get its parent, which is still PyQuery.

The parent node here is the direct parent, but if you want to get the grandfather node, you can call the parents() method.

html = ''' 
       '''

from pyquery import PyQuery as pq


doc = pq(html)
items = doc('.list')
container = items.parents()
print(container)
print(type(container))
Copy the code

Container, wrap, body, and HTML are the grandparents of the list. Container, wrap, body, and HTML are the grandparents of the list. The body and HTML nodes are added when the object is initialized.

  • Brother nodes

In addition to getting the parent and child nodes, you can also get the sibling nodes. You can call the siblings() method if you want to get the siblings.

The specific code is as follows:

html = ''' 
       '''

from pyquery import PyQuery as pq


doc = pq(html)
items = doc('.list .item-0.active')
print(items.siblings())
Copy the code

Select the.item-0. Active node and call the siblings() method to get the siblings of the node.

Try running the code above and you’ll see that you get the other four siblings.

traverse

As you can see from the code above, pyQuery can select multiple nodes or a single node of type PyQuery, with no list like Beautiful Soup.

For a single node, it can be printed directly or converted to a string.

from pyquery import PyQuery as pq


doc = pq(html)
items = doc('.list .item-0.active')
print(items)
print(str(items))
print(type(items))
Copy the code

For multiple nodes, you can call the item() method to convert the retrieved content to a generator type and output it in a traversal.

The specific code is as follows:

from pyquery import PyQuery as pq


doc = pq(html)
lis = doc('li').items()
print(lis)
for li in lis:
    print(li, type(li))
Copy the code

By running the code above, you can see that the output variable lis results in a generator, so you can iterate over the output.

Access to information

Generally speaking, there are two types of information we need to obtain in web pages: one is text content and the other is node attribute value.

  • Retrieve attributes

Once a PyQuery node is retrieved, the attr() method is used to retrieve the properties.

The specific code is as follows:

from pyquery import PyQuery as pq


doc = pq(html)
a = doc('.list .item-0.active a')
print(a.attr('href'))
Copy the code

The attr() method is called and the attribute value href is passed to the node a under the list whose class is item-0 active and the variable a is of type PyQuery.

You can also get attributes by calling the attr attribute.

print(a.attr.href)
Copy the code

You’ll find the output is the same as the code above.

Of course, we can also obtain the attributes of all a nodes, the specific code is as follows:

html = ''' 
       '''

from pyquery import PyQuery as pq


doc = pq(html)
a = doc('a').items()
for item in a:
    print(item.attr('href'))
Copy the code

But if the code is written like this:

from pyquery import PyQuery as pq


doc = pq(html)
a = doc('a')
print(a.attr('href'))
Copy the code

After you run the code above, you’ll see that you only get the href attribute for the first a node.

All this is something to pay attention to!!

  • Text extraction

The logic for extracting text is the same as for extracting attributes, first fetching a node with class PyQuery and then calling the text() method to retrieve the text.

Start by getting the text content of a node. The specific code is as follows:

html = ''' 
       '''

from pyquery import PyQuery as pq


doc = pq(html)
a = doc('.list .item-0.active a')
print(a.text())
Copy the code

Try running the above code and you will see that the text content of the A node has been successfully retrieved.

Let’s get the text content of multiple LI nodes.

The specific code is as follows:

html = ''' 
       '''

from pyquery import PyQuery as pq


doc = pq(html)
items = doc('li')
print(items.text())
Copy the code

Run the above code, and you’ll see that it succeeds in getting all the text with the node name li, separated by Spaces.

If you want to get them one by one, you still need a generator, which looks like this:

from pyquery import PyQuery as pq


doc = pq(html)
items = doc('li').items()
for item in items:
    print(item.text())
Copy the code

Node operation

Pyquery provides a series of methods for dynamically modifying a node, such as adding a class to a node or removing a node, which can sometimes facilitate extracting information.

  • Add_class and remove_class
html = ''' 
       '''

from pyquery import PyQuery as pq


doc = pq(html)
li = doc('.list .item-0.active')
print(li)
li.remove_class('active')
print(li)
li.add_class('active')
print(li)
Copy the code

The running result is as follows:

<li class="item-0 active"><a href="link3.html"><span class="" bold="">third item</span></a></li> <li class="item-0"><a href="link3.html"><span class="" bold="">third item</span></a></li> <li class="item-0 active"><a href="link3.html"><span  class="" bold="">third item</span></a></li>Copy the code

The first step is to get an LI node and then remove the active class attribute. The third step is to add the active class attribute.

Pseudo class selector

Another important reason CSS selectors are powerful is that they can support a wide variety of pseudo-class selectors, such as first-node selection, last-node selection, odd and even nodes, and nodes containing a certain text.

html = ''' 
       '''

from pyquery import PyQuery as pq


doc = pq(html)
li = doc('li:first-child')	The first li node
print(li)
li = doc('li:last-child')	The last li node
print(li)
li = doc('li:nth-child(2)')	# the li node in the second position
print(li)
li = doc('li:gt(2)')	# after the third li node
print(li)
li = doc('li:nth-child(2n)')	# Even-numbered li nodes
print(li)
li = doc('li:contains(second)')	# The li node containing the second text
print(li)
Copy the code

Now that we’ve covered all of PyQuery, it’s time to get into the real world, where you can actually learn what you’ve just learned.

In actual combat

This time I bring the actual content is to climb the TOP100 list and score cat’s eye movie.

To prepare

To do a good job, he must sharpen his tools. First, we need to prepare several libraries: PyQuery, Requests.

The installation process is as follows:

pip install pyquery
pip install requests
Copy the code

preface

Winter vacation is coming again. How are you going to spend it?

In the winter, hiding in the quilt watching drama is the most comfortable, good miss the life of that year ~

So check out the TOP100 cat’s eye movies today and get ready for hibernation.

Website links:

https://maoyan.com/board/4

Requirement analysis and function realization

Get movie name

From the above figure, we can see that the information we need is hidden in the A tag under the div tag whose class is board-item-main, so we need to obtain its text information.

The core code is as follows:

movie_name = doc('.board-item-main .board-item-content .movie-item-info p a').text()
Copy the code

Get information about the leading actors

As can be seen from the figure above, the information of the leading actor is located in the p tag of the child node of board-item-main, so we can obtain the information of the leading actor in this way.

The core code is as follows:

p = doc('.board-item-main .board-item-content .movie-item-info')
star = p.children('.star').text()
Copy the code

Get release time

As can be seen from the previous picture, the nodes of the release date information and the main actor information are brothers, so we can write the code like this.

p = doc('.board-item-main .board-item-content .movie-item-info')
time = p.children('.releasetime').text()
Copy the code

Get score

Getting a rating for each movie is a bit more complicated, why? Let’s look at the picture below.

As you can see from the picture above, the whole and fractional parts are split into two parts. Therefore, it is necessary to obtain the data of the two parts respectively, and then splice them.

The core code is as follows:

score1 = doc('.board-item-main .movie-item-number.score-num .integer').text().split()
score2 = doc('.board-item-main .movie-item-number.score-num .fraction').text().split()
score = [score1[i]+score2[i] for i in range(0.len(score1))]
Copy the code

On the page

When you open up the web page, you will find that the list has 10 pages, and each page has a different URL. What if? You can’t manually change the URL every time.

Take a look at the URLS on the first four pages.

https://maoyan.com/board/4 # https://maoyan.com/board/4?offset=20 page 1 https://maoyan.com/board/4?offset=10 # 2 # 3 pages https://maoyan.com/board/4?offset=30 page # 4Copy the code

After observing it, I don’t think I need to describe its characteristics too much.

Now we can build the URL for each page as follows:

    def get_url(self, page) :
        url = f'https://maoyan.com/board/4?offset={page}'
        return url
    if __name__ == '__main__':
    maoyan = MaoYan()
    for page in range(10):
        url = maoyan.get_url(page*10)
Copy the code

The results show

The last

Biting books says:

Every word of the article is MY heart to knock out, only hope to live up to every attention to my people, at the end of the article for my point [praise], let me know, you are also for their own learning, efforts and struggle.

That’s the end of this post. If you’ve read from the beginning to this point, it’s probably helpful, which is why I’m sharing this knowledge.

Nothing can be achieved overnight, so is life, so is learning!

The way ahead is so long without ending, yet high and low I’ll search with my will unbending.

I am shujun, a person who concentrates on learning. The more you know, the more you don’t know, and I’ll see you next time!