The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with
The following article is from Tencent Cloud, the author: user 7678152
preface
Python is ideal for developing web crawlers for the following reasons:
1. The interface of crawling web pages itself is more concise than that of other static programming languages, such as Java, c#, c++, python. Compared to other dynamic scripting languages such as Perl, shell, and Python, urllib packages provide a more complete API for accessing web documents. (Ruby is also a good choice.) In addition, crawling sometimes requires emulating the behavior of the browser, and many sites block crappy crawler crawling. This is where we need to simulate the behavior of the User Agent to construct appropriate requests, such as simulating user login, simulating session/cookie storage and setting. There are great third-party packages in Python for you, Requests, mechanize
2, after the processing of web crawling web pages usually need to be processed, such as filtering HTML tags, extract text and so on. Python’s BeautifulSOAP provides concise document processing, most of which can be done in very little code. Many languages and tools can do this, but Python is the fastest and cleanest way to do it.
Life is short, you need python.
Copy the code
PS: Python2. x and Python3. x are quite different, and this article only discusses the crawler implementation of Python3. x.
The crawler architecture
Architecture of
URL manager: manages the set of urls to be crawled and the set of urls to be crawled, and transmits the URL to be crawled to the web page downloader.
Urllib: crawls the web page corresponding to the URL, stores it as a string, and sends it to the web page parser.
BeautifulSoup: Parses out valuable data, stores it, and adds urls to the URL manager.
Run the process
URL manager
The basic function
- Adds a new URL to the set of urls to be climbed.
- Determines whether the url to be added is in the container (including the set of urls to be crawled and the set of urls that have been crawled).
- Gets the URL to be crawled.
- Determines whether a URL is waiting to be crawled.
- Move the url that has been crawled from the set of urls to be crawled to the set of urls that have been crawled.
storage
Set () set() set() set() set()
Their link to other websites is simply how websites link to other websites.
3. Cache (redis) URL set to be crawled: set Url set to be crawled: set
Large Internet companies, because of the high performance of cached databases, usually store urls in cached databases. Small companies typically store urls in memory or, if you want permanent storage, in a relational database.
Web Page Loader (URllib)
Download the corresponding web page of the URL to the local computer and store it as a file or string.
Basic method
Create biduo.py with the following content: import urllib.request response = urllib.request.urlopen('http://www.baidu.com') buff = response.read() html = buff.decode("utf8") print(html)Copy the code
Run python biduo.py on the command line to print out the retrieved page.
To construct the Request
The above code can be modified to:
import urllib.request
request = urllib.request.Request('http://www.baidu.com')
response = urllib.request.urlopen(request)
buff = response.read()
html = buff.decode("utf8")
print(html)
Copy the code
With parameters
Create baidu2.py and the content is as follows:
import urllib.request import urllib.parse url = 'http://www.baidu.com' values = {'name': 'voidking','language': 'Python'} data = urllib.parse.urlencode(values).encode(encoding='utf-8',errors='ignore') headers = { 'User-Agent' : 'the Mozilla / 5.0 (Windows NT 10.0; WOW64; Rv :50.0) Gecko/20100101 Firefox/50.0'} request = urllib.request.Request(url=url, data=data,headers=headers,method='GET') response = urllib.request.urlopen(request) buff = response.read() html = buff.decode("utf8") print(html)Copy the code
Use Fiddler to listen for data
We want to check to see if our request actually carries parameters, so we need to use Fiddler. When I opened Fiddler, I was surprised to find that the code would report error 504, both bidu. Py and Baidu2.py.
Although Python reported an error, in Fiddler we can see that the request message does carry parameters.
After searching for information, it is found that previous versions of Python Request do not support HTTPS access in the proxy environment. However, the latest version should support it. The easiest way to do this is to use an HTTP url, for example, www.csdn.net. As a result, the error is still reported, but with 400 errors.
Yet, yet, yet… God turn appears!! When I changed the URL to www.csdn.net/, the request was successful! Yeah, just…
Add processor
Cj = http.cookiejar. Cookiejar () # create opener opener = Urllib. Request. Build_opener (urllib. Request. HTTPCookieProcessor (cj) # to urllib. Request to install opener Request = urllib.request.Request('http://www.baidu.com/') Response = urllib.request.urlopen(request) buff = response.read() html = buff.decode("utf8") print(html) print(cj)Copy the code
Web page parser (BeautifulSoup)
Extract valuable data and new URL lists from web pages.
Parser selection
To implement a parser, you can choose to use regular expressions, HTML. Parser, BeautifulSoup, LXML, and so on. BeautifulSoup is BeautifulSoup. Of these, regular expressions are based on fuzzy matching, while the other three are based on DOM structured parsing.
BeautifulSoup
Install the test
1. To install, run PIP install Beautifulsoup4 on the cli. 2, test,
Import bS4 print(bS4) instructionsCopy the code
Basic usage
1. Create BeautifulSoup object
Import bS4 from bS4 import BeautifulSoup # Create BeautifulSoup object from HTML page string HTML_doc = """ < HTML ><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">... </p> """ soup = BeautifulSoup(html_doc) print(soup.prettify())Copy the code
2. Access nodes
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)
print(soup.p)
print(soup.p['class'])
Copy the code
3. Specify a tag, class, or ID
print(soup.find_all('a'))
print(soup.find('a'))
print(soup.find(class_='title'))
print(soup.find(id="link3"))
print(soup.find('p',class_='title'))
Copy the code
4. Find links to all < a > tags in the document
for link in soup.find_all('a'):
print(link.get('href'))
Copy the code
A warning appears, and we are prompted to specify a parser when we create the BeautifulSoup object.
soup = BeautifulSoup(html_doc,'html.parser')
Copy the code
5. Get all the text from the document
print(soup.get_text())
Copy the code
6. Regular matching
link_node = soup.find('a',href=re.compile(r"til"))
print(link_node)
Copy the code
Afterword.
The basics of python crawlers are enough. Next, learn more advanced knowledge in the field.