This is the 14th day of my participation in the August More Text Challenge. For details, see: August More Text Challenge
Web crawler
Want to md file can comment or private letter me!
1) The first step of crawler network request
A. Urllbi library
1.urlopen
Returns a file-like handle object that parses the web page
resp=request.urlopen('http://www.baidu.com')
print(resp.read())
Copy the code
2.urlretrieve
Save the page locally and name it ‘bidus.html’
request.urlretrieve('http://www.baidu,com'.'baidu.html')
Copy the code
3.urlencode
Convert dictionary data to URL-encoded data
If the url is In Chinese, the browser will encode the Chinese into %+ hexadecimal number, the browser sent to the server, the server is not able to receive Chinese
data={'name':'crawlers'.'great':'hello world'.'age':100}
qs=parse.urlencode(data)
print(qs)
Copy the code
4.parse_qs
The encoded URL parameters can be decoded
qs='xxxxx'
print(parse.parse_qs(qs))
Copy the code
5.urlparse & urlsplit
Urlparse & urlsplit splits the URL into parts and returns those parts
Urlparse returns one more parameter, params, as urlsplit does
6. The request. The request type
When the request header is added, add some data (to prevent anticrawlers), such as adding user-agent
headers={
'User-Agent':'xxx' This is to let the server know about the browser and not a crawler.
}
req=request.Request('http://www.baidu.com',headers=headers) # Send the request with the information in the request header.
Copy the code
7.ProxyHandler
Principle of proxy: before requesting the destination server, first request the proxy server, and then let the proxy server to request the destination server website, the proxy server gets the data of the destination website, and then forward our code
handler=request.ProxyHandler({"http":"xxxxxx"})
Use ProxyHandler to create a handler. Use ProxyHandler to create a handler
opener=request.build_opener(handler)
# Create opener with handler
req=request.Request("http:xxxxxx")
resp=opener.open(req)
# Call opener to send the request, and the page can be accessed at the proxy IP address
print(resp.read())
Copy the code
Common proxy requests:
1. Westthorn Free agent IP (free agent is not feasible, it is easy to fail)
2. Quick agent
3. The agent for the cloud
8.Cookie
The data is given to the server, and then the user’s data is returned to the browser, letting the browser know who the user is (typically 4KB in size)
Set-Cookie:NAME=VALUE; Expires/max-age=DATE; Path=PATH; Domain=DOMAIN_NAME; SECURE Meaning of the parameter: NAME: Indicates the NAME of the cookie. VALUE: indicates the VALUE of the cookie. Expires: indicates the time when the cookie Expires
Use cookies:
from urllib import request
request_url="http://xxxxxxx"
headers={
'User-Agent':"xxxx".# Simulate the request as a browser, not a crawler mechanism, to prevent anti-crawlers
'cookie':'xxxx'
# Add cookie, put user information in, simulate packaging, make it more like a crawler
}
request.Request(url=request_url,headers=headers) # Send request
resp=request.urlopen(req) # Parse web page
print(resp.read().decode('utf-8'))
# Read it down, but remember to decode it! Otherwise everything that's returned is encoded
with open('xxx.html'.'w',encoding='utf-8') as fp:
The encoding is used to change STR to bytes because STR must be written to the hard disk as bytes
# Machine read and write it after all
The write function must write a data type of STR
#resp.read() reads a bytes data type
#bytes is changed to STR by decode
# STR should be changed to bytes through encode
fp.write(resp.read().decode('utf-8'))
# UtF-8 decoded in order to make the contents readable
Copy the code
9. HTTP. CookieJar module
1.CookieJar
Manage the storage of cookie objects, and store them in memory
2.FileCookieJar(filename, delayload=None, policy=None)
This is derived from the CookieJar, which is used to store cookies as soon as a file is created. Dalayload means that you can support delayed access to files (access to files only when needed).
3.MozillaCookieJar(filename, delayload=None, policy=None)
Derived from FileCookieJar, create cookies with the Mozilla browser.
from urllib import request,parse
from http.CookieJar import CookieJar
headers={
'User-Agent':'xxxxx'
}
#1. Landing page
def get_opener() :
cookiejar=CookieJar()
Create a CookieJar object that supports HTTP requests
handler=request.HTTPCookieProcessor(cookiejar)
Create an HTTPCookieProcess object with the CookieJar
#HTTPCookieProcess is a process that processes cookie objects and builds handler objects
opener=request.bulid_opener(handler)
#1.4 Using the handler created in the previous step, call the build_opener() method to create an opener object, taking the constructed handler
#1.5 Use Opener to send login requests
return opener
def login_the_url(opener) :
data={"name":"xxxxx"."password":"xxxxxx"}
data=parse.urlencode(data).encode('utf-8')
Note that the message sent to the server must be encoded before being accepted by the server
login_url='http//:xxxx'
# This page is the login page
req=request.Request(login_url,headers=headers,data=data)
# Do not create a opener when accessing a personal page,
opener.open(req)
# Just use the previous opener, which already contains the cookies needed for login
#2. Visit the home page
def visit_profile(opener) :
# The information of opener here also contains cookies, so there is no need to create a new opener again
url="http://xxxxxx"
# This page is the page for information to be crawled
req=request.Request(url,headers=headers)
resp=opener.open(req)
# Request. Urlopen is not allowed. The request is not supported with parameters
with open('xxx.html'.'w',encoding='utf-8') as fp:
fp.write(resp.read().decode("utf-8"))
# Note that the write is to decode the display
if __name__='main':
opener=get_opener()
login_the_url(opener)
visit_profile(opener)
Copy the code
2. Request library
Crawler the first step of the web crawler request library import Request
1. Send get request:
1. No parameters
response=request.get("http//:xxx") # This will allow you to request access to the web page
Copy the code
2. Take parameters
import request
kw={"wd":"xxx"}
headers={"User-Agent":"xxx"}
response=request.get("http//:xxx",params=kw,headers=headers)
# Params accepts a dictionary or string as a query parameter. The dictionary type is automatically converted to a URL encoding. Urlencode () is not required.
print(response.text)
Response. text returns data in Unicode format, that is, a string encoded by Unicode. Chinese characters may be garbled
print(response.content)
Response. content returns byte stream data
# Response. Content-decode (' utF-8 '
Copy the code
3.response.text&response.content
Response. content: This is the data captured directly from the network without any decoding, so it is of bytes type. The string transmitted on the disk and the network is of bytes type 2. This is the string that requests decode response.content. Decoding now requires an encoding, and requests will make their own guesses about which encoding to use. So sometimes you might guess wrong, and that would lead to garbled decoding. This is where response.content-decode (‘ UTF-8 ‘) should be used for manual decoding
4.other
1. Print (response.encoding) #
2. Print (response.status_code) #
2. Send post request:
1. Post requires parameters
import request
url='http://xxx'
headers={
'User-Agent':'http//:xxx'.This is the user agent that lets the server know it's a browser, not a crawler
'Referer':'http//:xxx'
# is used to indicate where to link to the current page so that the server can get some information to process so that the server does not fail to treat it as a crawler
}
data={ # This is looking at the data above
'first':'true'.'pn':1.'kd':'python'
}
resp=request.post(url,headers=headers,data=data)
print(resp.json) Convert it to JSON format
Copy the code
3. Join the agent mechanism:
import requests
proxy={
'http':'xxx' # Proxy IP address
}
response=requests.get("http//:xxx",proxies=proxy)
print(response.text)
Copy the code
4. About the session
(This session is not the web development session):
import request
url="http//:xxx"
data={
"name":"xxx"."password":"xxx"
}
headers={
'User-Agent':"xxx"
}
session=requests.session() The difference between #session and #session is that it has built-in cookies
session.post(url,data=data,headers=headers)
response=session.get('http//:xxx')
print(response.text)
Copy the code
If you want to share cookies across multiple requests, you should use session
5. Handle the untrusted SSL certificate
(The certificates of some websites are not trusted) The url will be red insecure, so you can directly access the request for the trusted certificate
resp=requests.get('http://xxxxxx',verify=False)
print(resp.content.decode('utf-8'))
Copy the code
2) The second step of data analysis of crawler
Analytical tools | Parsing speed | Use the difficulty |
---|---|---|
BeautifulSoup | The slowest | The most simple |
lxml | fast | simple |
regular | The fastest | The most difficult |
Xpath can search its XML and HTML documents for the desired information
Install driver: xpath Helper (Chrome)
XPath syntax:
1. Select nodes:
1)nodename(selects all the children of the bookstore node) eg: Bookstore selects all the children of the bookstore node
2)/(if in the front, represents the selection from the root node. Otherwise select a node under a node) Local eg:/bookstore selects all bookstore nodes under the root element eg: You can’t find /div on a web page, because you’re looking at the root node, and there’s no div above the root node HTML, and div is in the grandson node body, and/HTML /div is not
3) select a node from the global node, select a node from the global node, select a node from the global node, select a node from the global node, select a node from the global node, select a node from the global node, select a node from the global node. //script Selects the script from the global, not only the head script, but also the body script
4) @ (select the attribute of a node) is similar to the attribute of an object-oriented class
<book price="xx">The price is the book attribute eg: //book[@price] selects all book nodes that have the price attributeCopy the code
<div id="xxx">//div[@id] select all div nodes that have an ID attributeCopy the code
2. Call point
Nodes that are used to find a specific node or contain a specified value are embedded in square brackets
1) eg:// bookstore/book[1] selects the first child element of the bookstore eg://body/div[1
2) eg:/bookstore/book[last()] selects the penultimate book element of the bookstore
Eg ://body/div[position()<3] Selects the first two position elements under the div of the body element
4) eg://book[@price] select the book element with the price attribute
//div[@class=’s_position_list’]; //div[@class=’s_position_list’]
Contains:
eg:<div class="content_1 f1">//div[contains(@class,"f1")] uses contains for fuzzy matching to match the f1 attribute under classCopy the code
3. A wildcard
(* indicates a wildcard)
1) * Matches any element eg:/bookstore/* Selects all child elements of bookstore
2) @* matches any attribute in the node eg://book[@*] selects all book elements with attributes
4. Select multiple paths
(through the | operator to select multiple paths)
1) eg: / / bookstore/book | / / book/title selection # all bookstore elements under the element book and the book under the elements of all the title element eg: / / dd [@ class = “job_bt”] | //dd[@class=”job-advantage”] # select job_bt and job-advantage attributes for all dd classes
There are other operators and or and things like that
summary:
1. Use // to get the whole page element, and then write the tag name, and then write the predicate to extract. [@ eg: / / div class = ‘ABC’] 2. / just get child nodes directly, and 3. / / is to obtain the sons of node contains: sometimes a property contains multiple values, you can use the contains functions eg: //div[contains(@class,’xxx’)]
LXML library
1. Basic use:
1) Parse HTML strings: parse using lXML.etree.html
from lxml importEtree (this is written in C) text="Here's the code."
# The code here is not canonical incomplete HTML
html=etree.HTML(text)
# Use the etree.html class to turn the string into an HTML document and parse it, but this is an object
result=etree.tostring(text,encoding='utf-8')
# Serialize the HTML document as a string, but this is bytes. To prevent garbled characters, add encoding=' UTF-8 '.
Parse the page using UTF-8 encoding to prevent garbled characters, because the default is Unicode
result.decode('utf-8')
# To decode in order to be readable
Copy the code
2. Parse the HTML file
Parse continues with lxml.etree. Parse
parser=etree.HTMLParser(encoding='utf-8')
# Build an HTML parser to prevent the loss of source code for web pages
html=etree.parse("Tencent. HTML (put the address)",parser=parser)
# Parse can parse directly, but sometimes some pages are incomplete
If # is missing a div or something, it will cause an error. The solution is to add the parser parser
result=etree.tostring(text,encoding='utf-8')
result.decode('utf-8')
Copy the code
The effect is the same as above
This method uses an XML parser by default, so if you encounter some non-standard HTML code, you will have to create your own HTML parser
from lxml import etree
parser = etree.HTMLParser(encoding="utf-8") # Build an HTML parser to prevent incomplete pages from being parsed
html = etree.parse("tencent.html", parser=parser)
The xpath function returns a list
Get all tr tags //tr
trs = html.xpath("//tr")
for tr in trs:
# print(tr)
# Return an iterator that returns an iterator object that is unreadable to humans.
print(etree.tostring(tr, encoding="utf-8").decode("utf-8"))
Etree. Tostring tostring, then encode, then decode
# You can use decode instead of decode
# 2. Get the second tr tag
trs = html.xpath("//tr[2]") This is to return an element, the iterator element
print(trs)
trs = html.xpath("//tr[2]") [0] # that's just taking the first element here
print(trs)
print(etree.tostring(trs, encoding='utf-8').decode("utf-8"))
The iterator element is rendered as a string. The utF-8 encoding is decoded to render the iterator element, which is the source code of the web page
# 3. Get all tr tags with class equal to even
evens = html.xpath("//tr[@class='even']")
for even in evens:
print(etree.tostring(even, encoding="utf-8").decode("utf-8"))
Write tr first, then all tags whose class attribute is equal to even
# 4. Get the href attribute of all a tags. This is the attribute here, return the attribute
ass = html.xpath("//a/@href")
print("http://hr.tencent.com/" + ass) # can directly click on the page
# 4.1 Get all the href attributes of a, and this one shows everything in a, after all []
ass = html.xpath("//a[@href]")
# 5. Get all job information (plain text)
"" "< tr > < td class = "XXX" > < a target = "XXX" href = "XXX" > I was the first text < / a > < / td > < td > I am the second text < / td > < td > I was the third text < / td > < / tr > "" "
words = html.xpath("//tr[position()>1]") # get everything except the first tr tag
all_things=[]
for word in words:
# href=tr.xpath("a")
Gettag (' a '); gettag (' a ');
# because a is not a direct subtag of tr,td is
# href=tr.xpath("//a")
This ignores the default for tr., since // is the global a tag
href = tr.xpath(".//a")
// select * from tr; // select * from tr; // select * from tr; And is limited to the a tag under the tr. tag
href = tr.xpath(".//a/@href")
The href attribute for the first a tag. Href is the part of the url at the end of the page
title = tr.xpath(".//a/text()")
So I'm going to get all the text under the a tag which is "I'm the first text"
title = tr.xpath("./td/text()")
# so you can get all the text under the td tags, but here just to get to the "I am the second text", so the above that "I was the first text" this information is under a label is not directly belong to td
title1 = tr.xpath("./td[1]//text()")
This is the first td tag. Note that this is not the same as python's index, which starts at 1, whereas Python's index starts at 0
//text(); //text(); //text();
title2 = tr.xpath("./td[2]//text()") You get the second text, "I am the third text."
all_thing={
"first": title1, # Make it a list
"second": title2
}
all_things.append(all_thing) # Put it in the list
print(href)
break
LXML with xpath
# Practice works
Copy the code
3. Xpath Combat Douban
import requests
from lxml import etree
# 1. Grab the page from the target site
headers={
'User-Agent': "Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/79.0.3945.16 Safari/537.36".# Copy browser, wrap the crawler as a browser
'Referer': "https://www.baidu.com/s?wd=%E8%B1%86%E7%93%A3&rsv_spt=1&rsv_iqid=0xded42b9000078acc&issp=1&f=8&rsv_bp"
"=1&rsv_idx=2&ie=utf-8&tn=62095104_19_oem_dg&rsv_enter=1&rsv_dl=ib&rsv_sug3=8&rsv_sug1=5&rsv_sug7=100"
"&rsv_sug2=0&inputT=1250&rsv_sug4=1784 "
# Tell the server which page the page is linked from, so the server can get some information for processing, usually used for multi-page crawling
}
url = 'https://movie.douban.com/'
response = requests.get(url, headers=headers)
text = response.text # took its web page down
#text=open("Douban.text",'r',encoding="utf-8")
# print(response.text)
# Response. text: Return a decoded string, which is of type STR (Unicode). Garbled characters may occur, because the decoding method may be different, resulting in garbled characters
# Response. content: Return a string of bytes from the web page
# 2. Extract the captured data according to certain rules
html = etree.HTML(text) # Parse the page and decode the text
print(html)
# HTML = html.xpath("//ul/li/a/@href"
# HTML = html.xpath("//ul/li/a/text()")
ul = html.xpath("//ul") [0]
print(ul)
lts=ul.xpath("./li")
for li in lts:
title=li.xpath("@data-title")
data_release=li.xpath("@data-release")
#data_duration=li.xpath("@data-ticket data-duration")
data_region=li.xpath("@data-region")
data_actors=li.xpath("@data-actors")
post=li.xpath(".//img/@scr")
print(data_actors)
print(post)
movie={
'title':title,
'data_release':data_release
}
Copy the code
4. Xpath Combat movie paradise
# Creepingmovieheaven
import requests
from lxml import etree
BASE_URL='https://www.dytt8.net/'
url = 'https://www.dytt8.net/html/gndy/dyzz/index.html'
HEADERS = {
'Referer': 'https://www.dytt8.net/'.'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/79.0.3945.16 Safari/537.36'
}
def get_detail_urls(url) :
response = requests.get(url, headers=HEADERS)
# print(response.text)
The #requests library, by default, decodes the page crawl using the encoding you guessed, and stores it in the text property
Print (Response.content-decode (encoding=' GBK ', errors='ignore'))
#F12 Input document.charset on console to check the encoding mode, add this errors to make the program run through Response. content will change the decoding mode into the one you need
text = response.content.decode(encoding='gbk', errors='ignore')
html = etree.HTML(text) # Parse web page
detail_urls = html.xpath("//table[@class='tbspan']//a/@href")
# in table with class=tbspan attribute, because a page has many classes,
# This class= tbSPAN is specific to the characteristics of the table to be climbed
< span style = "box-sizing: border-box; color: RGB (74, 74, 74); line-height: 22px; white-space: inherit;
#for detail_url in detail_urls:
#print(BASE_URL + detail_url)
detail_urls=map(lambda url:BASE_URL+url,detail_urls)
return detail_urls
This code is equivalent to:
#def abc(url):
# return BASE_URL+url
#index=0
#for detail_url in detail_urls:
# detail_url=abc(detail_url)
# detail_urls[index]=detail_url
# index+=1
def spider() :
movies = []
base_url="https://www.dytt8.net/html/gndy/dyzz/list_23_{}.html"
# leaves a {} so it fills the slot
for x in range(1.7) :# for to find a few pages of the page
print("= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =")
print(x)
print("= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =")
If there is a code that GBK does not recognize, there will be an error because there is a special character that GBK does not recognize
Text = response.content-decode (' GBK ',errors='ignore')
url=base_url.format(x)
detail_urls=get_detail_urls(url)
for detail_url in detail_urls:
This for loop iterates through all the movie details urls in a page
# print(detail_url)
movie = parse_detail_page(detail_url)
movies.append(movie)
print(movies) # After the climb will be all displayed, the time is a little slow
def parse_detail_page(url) :
movie={}
response = requests.get(url,headers=HEADERS)
text = response.content.decode('gbk') # decoding
html=etree.HTML(text) # Return element
#titles=html.xpath("//font[@color='#07519a']")
# crawl the title from the top of the detail page, but this alone will also crawl the rest of the same standard, so the unique tag is restricted
title=html.xpath("//div[@class='title_all']//font[@color='#07519a']/text()") [0] This will allow the specified div to crawl down the specific title, and the addition of text will print out the text in the object's encoded object
#print(titles)
This is to print out the list of objects obtained
#for title in titles:
#print(etree.tostring(title,encoding='utf-8').decode('utf-8'))
# is printed as a string, otherwise as a byte stream
movie['titile']=title
zoomE=html.xpath("//div[@id='Zoom']") [0]
# Zoom has a lot of information to crawl, whereas xpath returns a list so you take the first element
post_imgs=zoomE.xpath(".//img/@src")
movie['post_imgs']=post_imgs
#print(post_imgs)
infos=zoomE.xpath(".//text()")
Get all the information under zoom
#print(infos)
def parse_info(info,rule) :
return info.replace(rule,"").strip()
# Define a function that passes in the original string and prints the later modified string
#for info in infos:
for index,info in enumerate(infos):
The following table and elements are printed
if info.startswith("When s") :# print(info)
#info = info.replace(" = ", "").strip()
This code is the same as the following line of function execution code
# After the time is replaced, the space about the time is replaced
info=parse_info(info,"When s")
movie["year"]=info
elif info.startswith("When producing area") :# info = info. Replace (" when producing area ", ""). The strip ()
info = parse_info(info, "When producing area")
movie["country"]=info
elif info.startswith("When category") :#info = info.replace("◎类 别", "").strip()
info = parse_info(info, "When category")
movie["category"]=info
elif info.startswith("◎ Douban score"):
info=parse_info(info,"◎ Douban score")
movie["douban_score"]=info
elif info.startswith("When running time"):
info=parse_info(info,"When running time")
movie["duration"]=info
elif info.startswith("When the director"):
info=parse_info(info,"When the director")
movie["director"]=info
elif info.startswith("When starring"):
info=parse_info(info,"When starring")
# Because this source code is a line of a list of subscripts, so it is more special, according to the subscript to obtain data
actors=[info]
# Include the first one
for x in range(index+1.len(infos)):
#index is the first row of the main character, so we should start at the second row,
The first line above # is already included
actor=infos[x].strip()
# Remove the Spaces on both sides
if actor.startswith("When label") :break
actors.append(actor)
# Get the first one in
movie['actors']=actors
elif info.startswith("When profile"):
info = parse_info(info, "When profile")
This profile is the same as the actor's
movie["director"] = info
for x in range(index+1.len(infos)):
profile=infos[x].strip()
if profile.startswith("[Download address]") :break
movie["profile"]=profile
download_url=html.xpath("//td[@bgcolor='#fdfddf']/a/@href")
movie["download_url"]=download_url
return movie
if __name__ == '__main__':
spider()
Copy the code
BeautifulSoup4
Like LXML, BeautifulSoup is an HTML and XML parser, and its main function is how to extract the data
While LXML does only local traversal, BeautifulSoup is based on HTML DOM, loads the entire document, parses the entire DOM tree, and therefore has much more time and memory overhead, so the performance is lower than LXML
BS is relatively simple to parse HTML with a very user-friendly API, and supports CSS selectors, HTML parsers from the Python standard library, and XML parsers from LXML.
But BeautifulSoup is still based on LXML, just like Python is still based on C, so parsing is still dependent on a third-party parser
The parser | Method of use | advantage | disadvantage |
---|---|---|---|
The python standard library | BeautifulSoup(markup,”html.parser”) | Python’s built-in standard library is fast and fault-tolerant | Versions prior to python3.3 are less effective |
LXML HTML parser | BeautifulSoup(markup,”lxml”) | Fast and fault tolerant | You need to install C language, which is PIP install |
LXML XML parser | BeautifulSoup(markup,[“lxml”,”xml”]) BeautifulSoup(markup,”xml”) | Fast, only support XML parser | You need to install C language, which is PIP install |
html5lib | BeautifulSoup(markup,”html5lib”) | The best fault tolerance is to parse the text in a browser-like fashion and produce an HTML5 document | Slow speed does not depend on external extension |
If it is a strange page, it is recommended to use HTML5Lib to parse the page, to prevent errors, he will automatically fix the existence of the error
Simple use:
from bs4 import BeautifulSoup
html=""" xxxxxx """
bs=BeautifulSoup(html,"lxml") # Turn it into HTML mode to fill in the missing elements
print(bs.prettify()) Print it out in a beautiful way
Copy the code
1. Four commonly used objects:
BeautifulSoup replaces a complex HTML document with a complex tree of nodes, each of which is a Python object, all of which can be boiled down to four types:
-
Tag
Tags are HTML tags
-
NavigatebleString
-
BeautifulSoup
-
Comme
2.find&find_all
find:
1) Only the first tag can be extracted. If only one tag is found, it is returned
find_all:
0) can extract all tags and return multiple elements as a list
1) When extracting a tag, the first parameter is the name of the tag. Then, if you want to use attributes to filter when extracting tags, you can pass in the name of the attribute and its corresponding value as a keyword argument in this method. Or use the ‘attrs’ attribute and pass all the attributes and their values to the ‘attrs’ attribute in a dictionary
2) Sometimes, when extracting tags, you don’t want to extract that many tags, so you can use ‘limit’ to limit how many tags to extract
3.string,strings,stripped_strings,get_test
string:
Gets a non-tag string under a tag, just one, returned as a normal string
strings:
Gets all the descendants of a tag and returns the generator. You can add list to make it a list
stripped_strings:
Gets the string of all descendants of a tag with whitespace removed, and returns the generator
get_text:
Gets all descendant non-tag strings under a tag, but instead of returning as a list, returns as a normal string
from bs4 import BeautifulSoup
html="""
xxxxxx
"""
soup=BeautifulSoup(html,"lxml")
#1. Get all tr tags
trs=soup.find_all('tr')
for tr in trs:
print(tr)
print(type(tr))
This is a Tag type, but the repr method in BeautifulSoup prints the Tag as a string
#2. Get 2 tr tags
trs=soup.find_all('tr',limit=2)
#limit gets at most two elements and returns the list, with [1] at the end returning the second element
#3. Get all tr tags with class equal to even
trs=soup.find_all('tr',class_='even') #class is a Python keyword, so it's marked with an underscore in BS4
for tr in trs:
print(tr)
trs=soup.find_all('tr',attrs={'class':"even"}) You can use attrs information as an argument
for tr in trs:
print(tr)
Select * from a where id = test and class = test
aList=soup.find_all('a'.id='test',class_='test') # how many features can go all the way up
for a in aList:
print(a)
aList=soup.find_all('a',attrs={"id":"test"."class":"test"}) # how many features can go all the way up
for a in aList:
print(a)
#5. Get the href attribute for all a tags
aList=soup.find_all('a') # Find all a tags
for a in aList:
# 1. Operate by subscript
href=a['href'] # This is easy
print(href)
#2. Pass the ATTRs attribute
href=a.attrs['href'] Get the href attribute under the a tag
print(href)
#6. Get all job information (plain text)
trs=soup.find_all('tr') [1:] The # job information is within the TR tag. The first one is not
infos_=[]
for tr in trs:
info={}
Method a #
tds=tr.find_all("td") Find all td tags under tr tag
title=tds[0] The #title element is hidden inside
print(title.string) # to extract the string
title=tds[0].string The first element in # TDS is the title
category=tds[1].string The second element in # TDS is the classification
nums=tds[2].string The third element in # TDS is the number
city=tds[3].string The fourth element in # TDS is the city
pubtime=tds[4].string The fifth element in # TDS is the release time
info['title']=title
info['category']=category
info['nums']=nums
info['city']=city
info['pubtime']=pubtime
infos_.append(info)
Method # 2
#infos=tr.strings
# You can crawl down all the plain text (not the tag), so you get a generator, an object
#for info in infos:
# print(info) #
#infos = list(tr.string)
infos=list(tr.stripped_strings) # removes whitespace from the string
info['title']=infos[0]
info['category']=infos[1]
info['nums']=infos[2]
info['city']=infos[3]
info['pubtime']=infos[4]
infos_.append(info) # more concise and simple
Copy the code
import requests
from bs4 import BeautifulSoup
import html5lib
from pyecharts.charts import Bar
ALL_Data = []
def parse_page(url) :
headers = {
'User-Agent': 'Mozilla / 5.0(Windows NT 10.0; WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 79.0.3945.16Safari / 537.36'.'Referer': 'http: // www.weather.com.cn / forecast / index.shtml'
}
response = requests.get(url, headers=headers)
text = response.content.decode('utf-8')
soup = BeautifulSoup(text, 'html5lib')
conMidTab = soup.find('div', class_='contentboxTab1') Select * from div where class='contentboxTab1'
# print(conMidTab)
tables = conMidTab.find_all('table')
for table in tables:
trs = table.find_all('tr') [2:]
for index, tr in enumerate(trs): This returns the subscript and the value
tds = tr.find_all('td') Get the td tag, which returns a list
city_td = tds[0] # City was the first hashtag
if index == 0:
city_td = tds[1] # Change the value of the first subscript to the first (Harbin) instead of the first (Heilongjiang) due to the structure problem.
city = list(city_td.stripped_strings)[0] # Return the generator so convert it to a list, and then take the zeroth element out
temp_td = tds[-2] # minimum temperature is the penultimate td tag
min_temp = list(temp_td.stripped_strings)[0] # Grab all the text in it
ALL_Data.append({"city": city, "min_temp": int(min_temp)})
# print({"city": city, "min_temp": min_temp})
def main() :
urls = ["hb"."db"."hd"."hn"."xb"."xn"."gat"]
for id in urls:
url = f'http://www.weather.com.cn/textFC/{id}.shtml#'
The table tag is not complete source code is not, but the browser automatically supplemented, so use HTML5lib to improve
parse_page(url)
# Analyze data
# Rank by lowest temperature
# def sort_key(data):
# min_temp = data['min_temp']
# return min_temp
# ALL_Data.sort(key=sort_key)
Sort_key =sort_key
# There is something wrong with the data visualization below
ALL_Data.sort(key=lambda data: data['min_temp']) This is the same function as above, with the value returned following the colon
data = ALL_Data[0:10]
# for value in data:
# city=value['city']
# cities.append(city)
# Extract the name of the city
cities_ = list(map(lambda x: x['city'], data))
# Each item in the list data is passed to the lambda expression and then decomposed
temps_ = list(map(lambda x: x['min_temp'], data))
chart = Bar() # titles
chart.add_yaxis(series_name="thetitle",xaxis_index=cities_,yaxis_data=temps_)
#chart. Add_dataset (", cities_, temps_
chart.render('temperature.html') # rendering
if __name__ == '__main__':
main()
Copy the code
CssSelect method
Sometimes it’s easier to select a CSS selector.
1) Search by label name
Select (‘a’) from ‘a’
2) Search by class name
By class name is to add a.. So if I want to find class=’sister’
soup.select(‘.sister’)
3) Search by ID
When you look by id, you add a #. So if I want to find id=’link’
soup.select(‘#link’)
4) Combination search
Select (“p #link1”) # select(“p #link1”) # select(“p #link1”
Select (“head>titile”); select(“head>titile”)
5) Search by attribute
You can also add attribute elements to the lookup, which need to be enclosed in brackets.
soup.select(‘a[href=”www.baidu.com”]’)
<! DOCTYPEhtml>
<html>
<head>
<title></title>
<style type="text/css">
.line1{
background-color: pink;
}
#line2{
background-color: rebeccapurple;
}
.box p{ /* Will select */ for all descendants
background-color: azure;
}
.box > p{ /* The child element is not */
background-color: aqua;
}
input[name='username']
{
background-color: coral;
}
</style>
</head>
<body>
<div class="box">
<div>
<p>the zero data</p> /* This is element Sun */
</div>
<pClass ="line1">the first datap>
<pClass ="line1">the second data </p>
<pId ="line2">the third data; id="line2">the third datap>
/* This is the direct child element */
</div>
<p>
the fourth data
</p>
<from>
<input type="text" name="username">
<input type="text" name="password">
</from>
</body>
</html>
Copy the code
5.soup+select
When using CSS selectors, you need to use the soup select in soup. Select
Regular expressions
About regular expressions: Match the desired data from a string according to certain rules.
Match a single character
text='hello'
ret=re.match('he',text) Ahello = 'he'; ahello = 'he'
print(ret.group()) #group can be typed
>>he
Copy the code
Dot (.) Match any character:
text="ab"
ret=re.match('. ',text) #match matches only one character
print(ret.group())
>>a
Copy the code
But (.). Text =”\n”
\d to match any number:
text="123"
ret=re.match('\d',text) # can only match one character
print(ret.group())
>>1
Copy the code
\D matches any non-number
text="2a"
ret=re.match('\d',text) # can only match one character
print(ret.group())
>>a
Copy the code
\s matches the whitespace character (\n,\t,\r, space)
text=""
ret=re.match('\s',text) # can only match one character
print(ret.group())
>>
Copy the code
There are matches, but only empty characters
\w matches a-z and A-Z as well as numbers and underscores
text="_"
ret=re.match('\w',text) # can only match one character
print(ret.group())
>>_
Copy the code
If you want to match a different character, you won’t
text="+"
ret=re.match('\w',text) # can only match one character
print(ret. Group ()) > > an errorCopy the code
The \W match fit \W is the opposite
text="+"
ret=re.match('\W',text) # can only match one character
print(ret.group())
>>+
Copy the code
The combination of [] can be matched as long as the characters in brackets are met
text="0888-88888"
ret=re.match('[\d\-]+',text) # matches the number and -, and a + sign matches all of them until no condition is met
print(ret.group())
>>0888-88888
Copy the code
Instead of \ d:0-9] [^0-9]^ This is wrong \D:0-9
\w:[0-9a-zA-Z_] [^0-9a-zA-Z_]
\W:[0-9a-zA-Z_]
Copy the code
text="0888-88888"
ret=re.match('[^ 0-9]',text)
print(ret.group())
>>-
Copy the code
Match multiple characters
* Can match 0 or any number of characters without error
text="0888-88888"
ret=re.match('\d*',text)
print(ret.group())
>>0888
Copy the code
+ can match 1 or any number of characters at least one, otherwise an error will be reported
text="abcd" #text="+abcd"
ret=re.match('\w+',text)
print(ret.group())
>>abcd #>>ab
Copy the code
? Match one or zero (either none or only one)
text="abcd" #text="+abcd"
ret=re.match('\w? ',text)
print(ret.group())
>>a #>> Matched to 0
Copy the code
{m} matches m
text="abcd" #text="+abcd"
ret=re.match('\w{2}',text)
print(ret.group())
>>ab # only matches two
Copy the code
{m,n}: matches m- N characters
text="abcd" #text="+abcd"
ret=re.match('{1, 5} \ w',text) # Matches the most
print(ret.group())
>>abcd # > > an error
Copy the code
A small case
1. Verify your phone number:
text="13070970070"
ret=re.match('1[34578]\d{9}',text) # verify, the first one is 1, the second one is one of 34578, and the last nine are whatever
print(ret.group())
Copy the code
2. Verification Email:
text="[email protected]"
ret=re.match('\w+@[a-z0-9]+\.[a-z]+',text) # first w matching to arbitrary character, and then there is at least one, so if there are no. +, until the match to the abnormal @ that does not belong to w matching, and then is to have only one @ @, then match the @ behind one or more characters, and then \. Match any character to match. The final com is matched with an [a-z] and may have a + sign
print(ret.group())
Copy the code
3. Verify the url:
text="http://www.baidu.com"
ret=re.match('(http|https|ftp)://[^\s]+',text) Select one of HTTP, HTTPS, FTP, and // to match a non-empty string
print(ret.group())
Copy the code
Verify your ID card
text="12345678909876543x"
ret=re.match('\d{17}[\dxX]',text) The seventeen digits in front of the # can be a number, and the next one can be either a number or a number, or an x or an x, so put a bracket around it
print(ret.group())
Copy the code
Bits of knowledge
^ off font: denoted by… start
text="hello"
ret=re.match('^a',text) # This match comes with an out-of-font
print(ret.group())
>>h
Copy the code
text="hello"
ret=re.search('o',text) # Search is a global search
print(ret.group())
>>o
Copy the code
text="hello"
ret=re.match('^o',text) The first one is not o. If it's ^h, you can find h
print(ret. Group ()) > > an errorCopy the code
If it’s in brackets it’s going to be the opposite
$in… At the end
text="[email protected]"
ret=re.match('\[email protected]$',text) The # ending in 163.com can better verify the mailbox
print(ret.group())
Copy the code
| matching multiple string or expression
text="https"
ret=re.match('http|https|ftp',text) Use parentheses () for combinations
print(ret.group())
Copy the code
Greedy and non-greedy modes
text="https"
ret=re.match('\d+',text) This is the greedy mode, matching as many characters as possible
ret1=re.match('\d+? ',text) # here is non-greed mode, match to one is ok
print(ret.group())
Copy the code
text="< h1 > title < / h1 >"
ret=re.match(. '< + >',text)
ret1=re.match('<. +? > ',text) H1 = h1 h1 = h1 h1 = h1 h1 = h1 h1
print(ret.group())
Copy the code
Matches numbers between 0 and 100
text="99"
ret=re.match('[1-9]\d? $100 | $',text)# The first cannot be 0, so one to nine, the second is not necessary to add one? And you have to end with that number, and 100 is the most special, so you have to consider it alone, and 100 is the number that ends with 100
print(ret.group())
Copy the code
Native strings and escape characters
text="the mac book pro is $1999"
ret=re.match('\$\d+',text) $= 1999 = 1999 = 1999 = 1999
print(ret.group())
>>$299
Copy the code
R ‘\ n’ raw native
Print out the \ n
text="\\n"
ret=re.match('\\\\n',text) #\ n just becomes \n because two \ s escape into one \
# has to add \\\\ to make the re recognize n
print(ret.group())
>>\n
Copy the code
Group, a group
text="apple's price $99,orange's price is $10"
ret=re.search('.*(\$\d+).*(\$\d+)',text)
print(ret.group())
Ret.group () is the same as ret.group(0)
print(ret.group(1)) # match the first group 99
print(ret.group(2)) # Matches the first group 10
print(ret.groups()) # Group all subgroups
Copy the code
findall
Find a list of all that meet the criteria
text="apple's price $99,orange's price is $10"
ret=re.findall('\$\d+',text) # will find all that satisfy and return a list
Copy the code
sub
text="apple's price $99,orange's price is $10"
ret=re.sub('\$\d+'."0",text) # Replace all matches with 0
print(ret) Return a new string apple's price 0,orange's price is 0
Copy the code
Tags can be replaced with Spaces by
XXX
text="<h1>xxx</h1>"
ret=re.sub('<. +? > '."",text)
Copy the code
The split function
text"hello world ni hao"
ret=re.split(' ',text)
print(ret) #['hello','world','ni','hao']
Copy the code
comlie:
If you use it often you can save it for later use
text="the number is 20.50"
r=re.compile('\d+\.? \d*')
r=re.compile(""" \d+ # Number before the decimal point \.? # decimal point itself \d* # number after decimal point """,re.VERBOSE)
ret=re.search(r,text) # re.verbose can write comments
print(ret.group())
Copy the code
Regular actual combat crawl poetry network
# regular instance crawl ancient poetry network
import re,requests
def parse_page(url) :
headers={
'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/79.0.3945.16 Safari/537.36'
}
response=requests.get(url,headers=headers)
text=response.text
titles=re.findall(r'
.*?
(.*?) '
\sclass="cont">,text,re.DOTALL) # Because there is \n in the page, so there will be. If you don't match the \n, it's going to stop and it's going to return null
This can be done by adding a re.dotall after #. To match all characters including \n plus? Prevent the non-greed mode without a match to only one topic
dynasties=re.findall(r'
.*?
(. *?) '
*?>
\sclass="source">,text,re.DOTALL) * * * * * * * * * * * * * * * * * * * * * * * * * * *
authors=re.findall(r'
.*?
. *?
(. *?) '
*?>
*?>
\sclass="source">,text,re.DOTALL) # Since this is the second a tag, we need to get the first one first and then fix the second one
content_tag=re.findall(r'
(. *?)
',text,re.DOTALL) Using regular expressions is to see it as a string instead of a web page, what child elements will there be parent elements
contents=[]
for content in content_tag:
#print(content)
x=re.sub(r'<.*? > '."",content) # replace the tag
contents.append(x.strip())
poems=[]
for value in zip(titles,dynasties,authors,contents):
title, dynasty, author, content=value
more_peoms={
'title':title,
'daynastie':dynasty,
'authors':author,
'content':content
}
poems.append(more_peoms)
for poem in poems:
print(poem)
def main() :
for page in range(10):
url=f"https://www.gushiwen.org/default_{page}.aspx"
parse_page(url)
if __name__ == '__main__':
main()
Copy the code
3) The third step of crawler data storage
Json File processing
Json is a lightweight data exchange format.
Supports objects (dictionaries), arrays (lists), and integer strings. Strings should be in double quotes, not single quotes
import json
Convert python objects to JSON strings
persons=[
{
'username':"zhilioa".'age':18.'country':"china"
},
{
'username':"zhaxiaolie".'age':20.'country':"china"
}
]
json_str=json.dumps(persons)
print(json_str)
print(type(json_str)) # json is actually a string
with open('person.json'.'w',encoding='utf-8') as fp:
fp.write(json_str)
You can also use json,dump(person,fp, ensure_ASCII =False) to save the file pointed to by fp directly. The last one is to turn it into asckii and close it to prevent conversion
class Person (object) :
country='china'
a={
'person':Person
}
json.dumps(a) This type cannot be changed to JSON format
Copy the code
If it’s JSON, it’s a list
persons=json.load(xxxx)
CSV File processing
A comma
1. Read the CSV file
import csv
def read_csv_demo1() :
with open('stock.csv'.'r') as fp:
reader =csv.reader(fp) You can read a CSV file and return iterators
next(reader) # Start second, skip the header
for x in reader:
print(x) Print out the list
def read_csv_dome2() :
with open('stock.csv'.'r') as fp:
# This way it will not be included in the heading line
reader=csv.DictReader(fp)
for x in reader:
print(x)
value={"name"=x['secShortname']."volume"=x['turnoverVol']}
print(value)
if __name__=='__main__'
read_csv_demo2()
Copy the code
2. Write to a CSV file:
import csv
header=['username'.'age'.'height']
def writer_csv_demo1() :
values=[('zhanghan'.12.1800),
('wangwu'.16.170), # List ancestor way
('lisi'.14.111)]
with open('class.csv'.'w',encoding="utf-8",newline=' ') as fp:
writer = csv.writer(fp) #newline is newline by default, to prevent blank lines
writer.writerow(headers) # This can be one line
writer.writerows(values) # This can be multiple lines
def writer_csv_demo2() :
values=[{'username':'zhanghan'.'age':12.'height':180}, # List dictionary way
{'username':'lisi'.'age':11.'height':130},
{'username':'wangwu'.'age':15.'height':140}]
with open('class1.csv'.'w',encoding='utf-8') as fp:
writer=csv.DictWriter(fp,headers) # Writeheader is used when writing to a table header
writer.writeheader() # Manually put the header in, otherwise it is not written in
writer.writerrows(values)
if __name__=='__main__'
writer_csv_demo2()
Copy the code