Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.
What is a reptile
Crawler is to read the content of a website, here we will learn a knowledge, we see the web page is written in a language called HTML, it can show different styles of text, such as:
hello
will show the paragraph: hello
Two, how to get the content of the web page
Hello
If you want to see the source code in a browser, you can right-click on the web page and choose View the source code. How do you use Python to crawl the source code down? To download a module, type in CMD:
pip install requests
Copy the code
You can then use the module Requests to crawl the page
import requests # import module
url = 'https://sina.com.cn' # The url to climb sina
html = requests.get(url) Get the source code of the web page
print(html.text) Note: The text function is required to return the source code
Copy the code
Output:
Observant people can see that there are coding problems in the following code
To convert the code to UTF-8 Chinese encoding
import requests
url = 'https://sina.com.cn'
html = requests.get(url)
html.encoding = 'utf-8' Set the encoding to UTF-8 Chinese encoding
print(html.text)
Copy the code
The output
Analyze the source code
Finally, to filter out the data we want in the source code, we need to use the module LXML and type in CMD:
pip install lxml
Then you use LXML to filter the data
import requests
from lxml import etree
url = 'https://sina.com.cn'
html = requests.get(url)
html.encoding = 'utf-8'
element = etree.HTML(html.text) # get HTML
result = element.xpath('//a/text()') # filter
for i in result:
print(i) # output
Copy the code
Output:
Where the core statement is
result = element.xpath('//a/text()')
Copy the code
And //a/text() means to get the values of all a tagsCopy the code
The common xpath syntax is as follows:
nodename | Selects all children of this node |
---|---|
/ | Selects direct child nodes from the current node |
// | Selects descendant nodes from the current node |
. | Select the current node |
. | Selects the parent of the current node |
@ | Select properties |
-
| wildcard, select all the element node and element nameCopy the code
@ * | select all [@ attrib] | select with the given attribute all the elements / @ attrib = ‘value’ | select all the elements in a given attribute with the given value (tag) | select all have direct child nodes of the specified element tag = ‘text’ | Selects all nodes with the specified element and whose text content is text
Four, screening examples
If you want to insina.com.cn
To read part of the news, press on the keyboardF12
Click the button in the upper left corner
Hover over the news and click to find news in the code bar
Find the parent element of all the stories
The ul class is list-a news_top.
import requests
from lxml import etree
url = 'https://sina.com.cn'
html = requests.get(url)
html.encoding = 'utf-8'
element = etree.HTML(html.text)
result = element.xpath('//ul[@class="list-a news_top"]//a/text()') # filter
for i in result:
print(i)
Copy the code
The output
For more on Python learning experiences, follow me and keep me updated.