Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

What is a reptile

Crawler is to read the content of a website, here we will learn a knowledge, we see the web page is written in a language called HTML, it can show different styles of text, such as:

hello

will show the paragraph: hello

Two, how to get the content of the web page

Hello

If you want to see the source code in a browser, you can right-click on the web page and choose View the source code. How do you use Python to crawl the source code down? To download a module, type in CMD:

pip install requests
Copy the code

You can then use the module Requests to crawl the page

import requests  # import module

url = 'https://sina.com.cn'  # The url to climb sina
html = requests.get(url)  Get the source code of the web page
print(html.text)  Note: The text function is required to return the source code
Copy the code

Output:

Observant people can see that there are coding problems in the following code

To convert the code to UTF-8 Chinese encoding

import requests

url = 'https://sina.com.cn'
html = requests.get(url)
html.encoding = 'utf-8'  Set the encoding to UTF-8 Chinese encoding
print(html.text)
Copy the code

The output

Analyze the source code

Finally, to filter out the data we want in the source code, we need to use the module LXML and type in CMD:

pip install lxml

Then you use LXML to filter the data

import requests
from lxml import etree

url = 'https://sina.com.cn'
html = requests.get(url)
html.encoding = 'utf-8'
element = etree.HTML(html.text)  # get HTML
result = element.xpath('//a/text()')  # filter

for i in result:
    print(i)  # output
Copy the code

Output:

Where the core statement is

 result = element.xpath('//a/text()') 
Copy the code
And //a/text() means to get the values of all a tagsCopy the code

The common xpath syntax is as follows:

nodename Selects all children of this node
/ Selects direct child nodes from the current node
// Selects descendant nodes from the current node
. Select the current node
. Selects the parent of the current node
@ Select properties
  • | wildcard, select all the element node and element nameCopy the code

@ * | select all [@ attrib] | select with the given attribute all the elements / @ attrib = ‘value’ | select all the elements in a given attribute with the given value (tag) | select all have direct child nodes of the specified element tag = ‘text’ | Selects all nodes with the specified element and whose text content is text

Four, screening examples

If you want to insina.com.cnTo read part of the news, press on the keyboardF12Click the button in the upper left corner

Hover over the news and click to find news in the code bar

Find the parent element of all the stories

The ul class is list-a news_top.

import requests
from lxml import etree

url = 'https://sina.com.cn'
html = requests.get(url)
html.encoding = 'utf-8'
element = etree.HTML(html.text)
result = element.xpath('//ul[@class="list-a news_top"]//a/text()')  # filter

for i in result:
    print(i)
Copy the code

The output

For more on Python learning experiences, follow me and keep me updated.