This is the 14th day of my participation in the August More Text Challenge. For details, see: August More Text Challenge

Web crawler

Want to md file can comment or private letter me!

1) The first step of crawler network request

A. Urllbi library

1.urlopen

Returns a file-like handle object that parses the web page

resp=request.urlopen('http://www.baidu.com')  
print(resp.read())
Copy the code
2.urlretrieve

Save the page locally and name it ‘bidus.html’

request.urlretrieve('http://www.baidu,com'.'baidu.html')
Copy the code
3.urlencode

Convert dictionary data to URL-encoded data

If the url is In Chinese, the browser will encode the Chinese into %+ hexadecimal number, the browser sent to the server, the server is not able to receive Chinese

data={'name':'crawlers'.'great':'hello world'.'age':100}
qs=parse.urlencode(data)
print(qs)
Copy the code
4.parse_qs

The encoded URL parameters can be decoded

qs='xxxxx'
print(parse.parse_qs(qs))
Copy the code
5.urlparse & urlsplit

Urlparse & urlsplit splits the URL into parts and returns those parts

Urlparse returns one more parameter, params, as urlsplit does

6. The request. The request type

When the request header is added, add some data (to prevent anticrawlers), such as adding user-agent

headers={
        'User-Agent':'xxx'                                  This is to let the server know about the browser and not a crawler.
        }
req=request.Request('http://www.baidu.com',headers=headers) # Send the request with the information in the request header.
Copy the code
7.ProxyHandler

Principle of proxy: before requesting the destination server, first request the proxy server, and then let the proxy server to request the destination server website, the proxy server gets the data of the destination website, and then forward our code

handler=request.ProxyHandler({"http":"xxxxxx"})   
Use ProxyHandler to create a handler. Use ProxyHandler to create a handler
opener=request.build_opener(handler)
# Create opener with handler
req=request.Request("http:xxxxxx")
resp=opener.open(req)
# Call opener to send the request, and the page can be accessed at the proxy IP address
print(resp.read())
Copy the code

Common proxy requests:

1. Westthorn Free agent IP (free agent is not feasible, it is easy to fail)

2. Quick agent

3. The agent for the cloud

8.Cookie

The data is given to the server, and then the user’s data is returned to the browser, letting the browser know who the user is (typically 4KB in size)

Set-Cookie:NAME=VALUE; Expires/max-age=DATE; Path=PATH; Domain=DOMAIN_NAME; SECURE Meaning of the parameter: NAME: Indicates the NAME of the cookie. VALUE: indicates the VALUE of the cookie. Expires: indicates the time when the cookie Expires

Use cookies:

from urllib import request
request_url="http://xxxxxxx"
headers={
'User-Agent':"xxxx".# Simulate the request as a browser, not a crawler mechanism, to prevent anti-crawlers
'cookie':'xxxx'                                    
# Add cookie, put user information in, simulate packaging, make it more like a crawler
}
request.Request(url=request_url,headers=headers)   # Send request
resp=request.urlopen(req)                          # Parse web page
print(resp.read().decode('utf-8'))                 
# Read it down, but remember to decode it! Otherwise everything that's returned is encoded
with open('xxx.html'.'w',encoding='utf-8') as fp: 
The encoding is used to change STR to bytes because STR must be written to the hard disk as bytes
# Machine read and write it after all
    The write function must write a data type of STR
    #resp.read() reads a bytes data type
    #bytes is changed to STR by decode
    # STR should be changed to bytes through encode
    fp.write(resp.read().decode('utf-8'))          
    # UtF-8 decoded in order to make the contents readable
Copy the code
9. HTTP. CookieJar module
1.CookieJar

Manage the storage of cookie objects, and store them in memory

2.FileCookieJar(filename, delayload=None, policy=None)

This is derived from the CookieJar, which is used to store cookies as soon as a file is created. Dalayload means that you can support delayed access to files (access to files only when needed).

3.MozillaCookieJar(filename, delayload=None, policy=None)

Derived from FileCookieJar, create cookies with the Mozilla browser.

from urllib import request,parse
from http.CookieJar import CookieJar
headers={
        'User-Agent':'xxxxx'
        }
#1. Landing page
def get_opener() :
    cookiejar=CookieJar()                                      
    Create a CookieJar object that supports HTTP requests
    handler=request.HTTPCookieProcessor(cookiejar)             
    Create an HTTPCookieProcess object with the CookieJar
	#HTTPCookieProcess is a process that processes cookie objects and builds handler objects
    opener=request.bulid_opener(handler)                       
    #1.4 Using the handler created in the previous step, call the build_opener() method to create an opener object, taking the constructed handler
    #1.5 Use Opener to send login requests
    return opener

def login_the_url(opener) :
    data={"name":"xxxxx"."password":"xxxxxx"}
    data=parse.urlencode(data).encode('utf-8')                 
    Note that the message sent to the server must be encoded before being accepted by the server
    login_url='http//:xxxx'                                    
    # This page is the login page
    req=request.Request(login_url,headers=headers,data=data)   
    # Do not create a opener when accessing a personal page,
    opener.open(req)                                           
    # Just use the previous opener, which already contains the cookies needed for login

#2. Visit the home page
def visit_profile(opener) :                                     
    # The information of opener here also contains cookies, so there is no need to create a new opener again
    url="http://xxxxxx"                                        
    # This page is the page for information to be crawled
    req=request.Request(url,headers=headers)
    resp=opener.open(req)                                      
    # Request. Urlopen is not allowed. The request is not supported with parameters
    with open('xxx.html'.'w',encoding='utf-8') as fp:
        fp.write(resp.read().decode("utf-8"))                  
        # Note that the write is to decode the display

if __name__='main':
    opener=get_opener()
    login_the_url(opener)
    visit_profile(opener)
Copy the code

2. Request library

Crawler the first step of the web crawler request library import Request

1. Send get request:
1. No parameters
response=request.get("http//:xxx")    # This will allow you to request access to the web page
Copy the code
2. Take parameters
import request
kw={"wd":"xxx"}
headers={"User-Agent":"xxx"}
response=request.get("http//:xxx",params=kw,headers=headers)
# Params accepts a dictionary or string as a query parameter. The dictionary type is automatically converted to a URL encoding. Urlencode () is not required.
print(response.text)
Response. text returns data in Unicode format, that is, a string encoded by Unicode. Chinese characters may be garbled
print(response.content)
Response. content returns byte stream data
# Response. Content-decode (' utF-8 '
Copy the code
3.response.text&response.content

Response. content: This is the data captured directly from the network without any decoding, so it is of bytes type. The string transmitted on the disk and the network is of bytes type 2. This is the string that requests decode response.content. Decoding now requires an encoding, and requests will make their own guesses about which encoding to use. So sometimes you might guess wrong, and that would lead to garbled decoding. This is where response.content-decode (‘ UTF-8 ‘) should be used for manual decoding

4.other

1. Print (response.encoding) #

2. Print (response.status_code) #

2. Send post request:

1. Post requires parameters

		    import request
            url='http://xxx'

            headers={
                'User-Agent':'http//:xxx'.This is the user agent that lets the server know it's a browser, not a crawler
                'Referer':'http//:xxx'          
# is used to indicate where to link to the current page so that the server can get some information to process so that the server does not fail to treat it as a crawler
            		}

            data={                              # This is looking at the data above
            'first':'true'.'pn':1.'kd':'python'
           }

            resp=request.post(url,headers=headers,data=data)
            print(resp.json)                       Convert it to JSON format
Copy the code
3. Join the agent mechanism:
           import requests

            proxy={
            'http':'xxx'                        # Proxy IP address
            }

            response=requests.get("http//:xxx",proxies=proxy)
            print(response.text)
Copy the code
4. About the session

(This session is not the web development session):

   			import request
            url="http//:xxx"
            data={
                    "name":"xxx"."password":"xxx"
                }
            headers={
                    'User-Agent':"xxx"
            }
            session=requests.session()                             The difference between #session and #session is that it has built-in cookies
            session.post(url,data=data,headers=headers)
            response=session.get('http//:xxx')
            print(response.text)
Copy the code

If you want to share cookies across multiple requests, you should use session

5. Handle the untrusted SSL certificate

(The certificates of some websites are not trusted) The url will be red insecure, so you can directly access the request for the trusted certificate

            resp=requests.get('http://xxxxxx',verify=False)
            print(resp.content.decode('utf-8'))
Copy the code

2) The second step of data analysis of crawler

Analytical tools Parsing speed Use the difficulty
BeautifulSoup The slowest The most simple
lxml fast simple
regular The fastest The most difficult

Xpath can search its XML and HTML documents for the desired information

Install driver: xpath Helper (Chrome)

XPath syntax:

1. Select nodes:

1)nodename(selects all the children of the bookstore node) eg: Bookstore selects all the children of the bookstore node

2)/(if in the front, represents the selection from the root node. Otherwise select a node under a node) Local eg:/bookstore selects all bookstore nodes under the root element eg: You can’t find /div on a web page, because you’re looking at the root node, and there’s no div above the root node HTML, and div is in the grandson node body, and/HTML /div is not

3) select a node from the global node, select a node from the global node, select a node from the global node, select a node from the global node, select a node from the global node, select a node from the global node, select a node from the global node. //script Selects the script from the global, not only the head script, but also the body script

4) @ (select the attribute of a node) is similar to the attribute of an object-oriented class

              <book price="xx">The price is the book attribute eg: //book[@price] selects all book nodes that have the price attributeCopy the code
              <div id="xxx">//div[@id] select all div nodes that have an ID attributeCopy the code
2. Call point

Nodes that are used to find a specific node or contain a specified value are embedded in square brackets

1) eg:// bookstore/book[1] selects the first child element of the bookstore eg://body/div[1

2) eg:/bookstore/book[last()] selects the penultimate book element of the bookstore

Eg ://body/div[position()<3] Selects the first two position elements under the div of the body element

4) eg://book[@price] select the book element with the price attribute

//div[@class=’s_position_list’]; //div[@class=’s_position_list’]

Contains:

 eg:<div class="content_1 f1">//div[contains(@class,"f1")] uses contains for fuzzy matching to match the f1 attribute under classCopy the code
3. A wildcard

(* indicates a wildcard)

1) * Matches any element eg:/bookstore/* Selects all child elements of bookstore

2) @* matches any attribute in the node eg://book[@*] selects all book elements with attributes

4. Select multiple paths

(through the | operator to select multiple paths)

1) eg: / / bookstore/book | / / book/title selection # all bookstore elements under the element book and the book under the elements of all the title element eg: / / dd [@ class = “job_bt”] | //dd[@class=”job-advantage”] # select job_bt and job-advantage attributes for all dd classes

There are other operators and or and things like that

summary:

1. Use // to get the whole page element, and then write the tag name, and then write the predicate to extract. [@ eg: / / div class = ‘ABC’] 2. / just get child nodes directly, and 3. / / is to obtain the sons of node contains: sometimes a property contains multiple values, you can use the contains functions eg: //div[contains(@class,’xxx’)]

LXML library

1. Basic use:

1) Parse HTML strings: parse using lXML.etree.html

from lxml importEtree (this is written in C) text="Here's the code."                                   
# The code here is not canonical incomplete HTML
html=etree.HTML(text)                                
# Use the etree.html class to turn the string into an HTML document and parse it, but this is an object
result=etree.tostring(text,encoding='utf-8')         
# Serialize the HTML document as a string, but this is bytes. To prevent garbled characters, add encoding=' UTF-8 '.
Parse the page using UTF-8 encoding to prevent garbled characters, because the default is Unicode
result.decode('utf-8') 
# To decode in order to be readable
Copy the code
2. Parse the HTML file

Parse continues with lxml.etree. Parse

parser=etree.HTMLParser(encoding='utf-8')               
# Build an HTML parser to prevent the loss of source code for web pages
html=etree.parse("Tencent. HTML (put the address)",parser=parser)  
# Parse can parse directly, but sometimes some pages are incomplete
If # is missing a div or something, it will cause an error. The solution is to add the parser parser
result=etree.tostring(text,encoding='utf-8')
result.decode('utf-8')
Copy the code

The effect is the same as above

This method uses an XML parser by default, so if you encounter some non-standard HTML code, you will have to create your own HTML parser

from lxml import etree

parser = etree.HTMLParser(encoding="utf-8")  # Build an HTML parser to prevent incomplete pages from being parsed
html = etree.parse("tencent.html", parser=parser)



The xpath function returns a list
Get all tr tags //tr
trs = html.xpath("//tr")
for tr in trs:
    # print(tr)  
    # Return an iterator that returns an iterator object that is unreadable to humans.
    print(etree.tostring(tr, encoding="utf-8").decode("utf-8"))
    Etree. Tostring tostring, then encode, then decode
    # You can use decode instead of decode

    
    
# 2. Get the second tr tag
trs = html.xpath("//tr[2]")     This is to return an element, the iterator element
print(trs)
trs = html.xpath("//tr[2]") [0]  # that's just taking the first element here
print(trs)
print(etree.tostring(trs, encoding='utf-8').decode("utf-8"))
The iterator element is rendered as a string. The utF-8 encoding is decoded to render the iterator element, which is the source code of the web page



# 3. Get all tr tags with class equal to even
evens = html.xpath("//tr[@class='even']")
for even in evens:
    print(etree.tostring(even, encoding="utf-8").decode("utf-8"))
Write tr first, then all tags whose class attribute is equal to even



# 4. Get the href attribute of all a tags. This is the attribute here, return the attribute
ass = html.xpath("//a/@href")
print("http://hr.tencent.com/" + ass)  # can directly click on the page
# 4.1 Get all the href attributes of a, and this one shows everything in a, after all []
ass = html.xpath("//a[@href]")



# 5. Get all job information (plain text)
"" "< tr > < td class = "XXX" > < a target = "XXX" href = "XXX" > I was the first text < / a > < / td > < td > I am the second text < / td > < td > I was the third text < / td > < / tr > "" "
words = html.xpath("//tr[position()>1]")  # get everything except the first tr tag
all_things=[]
for word in words:
    # href=tr.xpath("a")                   
    Gettag (' a '); gettag (' a ');
    # because a is not a direct subtag of tr,td is
    # href=tr.xpath("//a")                 
    This ignores the default for tr., since // is the global a tag
    href = tr.xpath(".//a")                
    // select * from tr; // select * from tr; // select * from tr; And is limited to the a tag under the tr. tag
    href = tr.xpath(".//a/@href")          
    The href attribute for the first a tag. Href is the part of the url at the end of the page
    title = tr.xpath(".//a/text()")        
    So I'm going to get all the text under the a tag which is "I'm the first text"
    title = tr.xpath("./td/text()")        
    # so you can get all the text under the td tags, but here just to get to the "I am the second text", so the above that "I was the first text" this information is under a label is not directly belong to td
    title1 = tr.xpath("./td[1]//text()")   
    This is the first td tag. Note that this is not the same as python's index, which starts at 1, whereas Python's index starts at 0
    //text(); //text(); //text();
    title2 = tr.xpath("./td[2]//text()")   You get the second text, "I am the third text."
    all_thing={
        "first": title1,                   # Make it a list
        "second": title2
    }
    all_things.append(all_thing)           # Put it in the list
    print(href)
    break

LXML with xpath
# Practice works
Copy the code
3. Xpath Combat Douban
import requests
from lxml import etree
# 1. Grab the page from the target site

headers={
    'User-Agent': "Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/79.0.3945.16 Safari/537.36".# Copy browser, wrap the crawler as a browser
    'Referer': "https://www.baidu.com/s?wd=%E8%B1%86%E7%93%A3&rsv_spt=1&rsv_iqid=0xded42b9000078acc&issp=1&f=8&rsv_bp"
               "=1&rsv_idx=2&ie=utf-8&tn=62095104_19_oem_dg&rsv_enter=1&rsv_dl=ib&rsv_sug3=8&rsv_sug1=5&rsv_sug7=100"
               "&rsv_sug2=0&inputT=1250&rsv_sug4=1784 "
    # Tell the server which page the page is linked from, so the server can get some information for processing, usually used for multi-page crawling
}
url = 'https://movie.douban.com/'
response = requests.get(url, headers=headers)
text = response.text                              # took its web page down
#text=open("Douban.text",'r',encoding="utf-8")
# print(response.text)

# Response. text: Return a decoded string, which is of type STR (Unicode). Garbled characters may occur, because the decoding method may be different, resulting in garbled characters
# Response. content: Return a string of bytes from the web page

# 2. Extract the captured data according to certain rules
html = etree.HTML(text)                      # Parse the page and decode the text
print(html)
# HTML = html.xpath("//ul/li/a/@href"
# HTML = html.xpath("//ul/li/a/text()")

ul = html.xpath("//ul") [0]
print(ul)
lts=ul.xpath("./li")
for li in lts:
    title=li.xpath("@data-title")
    data_release=li.xpath("@data-release")
    #data_duration=li.xpath("@data-ticket data-duration")
    data_region=li.xpath("@data-region")
    data_actors=li.xpath("@data-actors")
    post=li.xpath(".//img/@scr")
    print(data_actors)
    print(post)
    movie={
        'title':title,
        'data_release':data_release
    }
Copy the code
4. Xpath Combat movie paradise
# Creepingmovieheaven
import requests
from lxml import etree

BASE_URL='https://www.dytt8.net/'
url = 'https://www.dytt8.net/html/gndy/dyzz/index.html'
HEADERS = {
        'Referer': 'https://www.dytt8.net/'.'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/79.0.3945.16 Safari/537.36'
    		}
def get_detail_urls(url) :
    response = requests.get(url, headers=HEADERS)
    # print(response.text)          
    The #requests library, by default, decodes the page crawl using the encoding you guessed, and stores it in the text property
    Print (Response.content-decode (encoding=' GBK ', errors='ignore'))
    #F12 Input document.charset on console to check the encoding mode, add this errors to make the program run through Response. content will change the decoding mode into the one you need
    text = response.content.decode(encoding='gbk', errors='ignore')
    html = etree.HTML(text)  # Parse web page
    detail_urls = html.xpath("//table[@class='tbspan']//a/@href")  
    # in table with class=tbspan attribute, because a page has many classes,
    # This class= tbSPAN is specific to the characteristics of the table to be climbed
    < span style = "box-sizing: border-box; color: RGB (74, 74, 74); line-height: 22px; white-space: inherit;
    #for detail_url in detail_urls:
        #print(BASE_URL + detail_url)
    detail_urls=map(lambda url:BASE_URL+url,detail_urls)
    return detail_urls
    This code is equivalent to:
    #def abc(url):
    # return BASE_URL+url
    #index=0
    #for detail_url in detail_urls:
    # detail_url=abc(detail_url)
    # detail_urls[index]=detail_url
    # index+=1


def spider() :
    movies = []
    base_url="https://www.dytt8.net/html/gndy/dyzz/list_23_{}.html"    
    # leaves a {} so it fills the slot
    for x in range(1.7) :# for to find a few pages of the page
        print("= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =")
        print(x)
        print("= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =")                    
        If there is a code that GBK does not recognize, there will be an error because there is a special character that GBK does not recognize
        Text = response.content-decode (' GBK ',errors='ignore')
        url=base_url.format(x)
        detail_urls=get_detail_urls(url)
        for detail_url in detail_urls:             
            This for loop iterates through all the movie details urls in a page
            # print(detail_url)
            movie = parse_detail_page(detail_url)
            movies.append(movie)
    print(movies)                     # After the climb will be all displayed, the time is a little slow



def parse_detail_page(url) :
    movie={}
    response = requests.get(url,headers=HEADERS)
    text = response.content.decode('gbk')     # decoding
    html=etree.HTML(text)                     # Return element
    #titles=html.xpath("//font[@color='#07519a']")    
    # crawl the title from the top of the detail page, but this alone will also crawl the rest of the same standard, so the unique tag is restricted
    title=html.xpath("//div[@class='title_all']//font[@color='#07519a']/text()") [0]       This will allow the specified div to crawl down the specific title, and the addition of text will print out the text in the object's encoded object
    #print(titles)         
    This is to print out the list of objects obtained
    #for title in titles:
        #print(etree.tostring(title,encoding='utf-8').decode('utf-8'))  
        # is printed as a string, otherwise as a byte stream
    movie['titile']=title
    zoomE=html.xpath("//div[@id='Zoom']") [0]   
    # Zoom has a lot of information to crawl, whereas xpath returns a list so you take the first element
    post_imgs=zoomE.xpath(".//img/@src")
    movie['post_imgs']=post_imgs
    #print(post_imgs)
    infos=zoomE.xpath(".//text()")     
    Get all the information under zoom
    #print(infos)

    def parse_info(info,rule) :
        return info.replace(rule,"").strip()  
    # Define a function that passes in the original string and prints the later modified string


    #for info in infos:
    for index,info in enumerate(infos):   
        The following table and elements are printed
        if info.startswith("When s") :# print(info)
            #info = info.replace(" = ", "").strip()
            This code is the same as the following line of function execution code
            # After the time is replaced, the space about the time is replaced
            info=parse_info(info,"When s")
            movie["year"]=info
        elif info.startswith("When producing area") :# info = info. Replace (" when producing area ", ""). The strip ()
            info = parse_info(info, "When producing area")
            movie["country"]=info
        elif info.startswith("When category") :#info = info.replace("◎类  别", "").strip()
            info = parse_info(info, "When category")
            movie["category"]=info
        elif info.startswith("◎ Douban score"):
            info=parse_info(info,"◎ Douban score")
            movie["douban_score"]=info
        elif info.startswith("When running time"):
            info=parse_info(info,"When running time")
            movie["duration"]=info
        elif info.startswith("When the director"):
            info=parse_info(info,"When the director")
            movie["director"]=info
        elif info.startswith("When starring"):
            info=parse_info(info,"When starring")      
            # Because this source code is a line of a list of subscripts, so it is more special, according to the subscript to obtain data
            actors=[info]     
            # Include the first one
            for x in range(index+1.len(infos)):   
            #index is the first row of the main character, so we should start at the second row,
            The first line above # is already included
                actor=infos[x].strip()    
                # Remove the Spaces on both sides
                if actor.startswith("When label") :break
                actors.append(actor)  
                # Get the first one in
            movie['actors']=actors
        elif info.startswith("When profile"):
            info = parse_info(info, "When profile")     
            This profile is the same as the actor's
            movie["director"] = info
            for x in range(index+1.len(infos)):
                profile=infos[x].strip()

                if profile.startswith("[Download address]") :break
            movie["profile"]=profile
    download_url=html.xpath("//td[@bgcolor='#fdfddf']/a/@href")
    movie["download_url"]=download_url
    return movie

if __name__ == '__main__':
    spider()
Copy the code

BeautifulSoup4

Like LXML, BeautifulSoup is an HTML and XML parser, and its main function is how to extract the data

While LXML does only local traversal, BeautifulSoup is based on HTML DOM, loads the entire document, parses the entire DOM tree, and therefore has much more time and memory overhead, so the performance is lower than LXML

BS is relatively simple to parse HTML with a very user-friendly API, and supports CSS selectors, HTML parsers from the Python standard library, and XML parsers from LXML.

But BeautifulSoup is still based on LXML, just like Python is still based on C, so parsing is still dependent on a third-party parser

The parser Method of use advantage disadvantage
The python standard library BeautifulSoup(markup,”html.parser”) Python’s built-in standard library is fast and fault-tolerant Versions prior to python3.3 are less effective
LXML HTML parser BeautifulSoup(markup,”lxml”) Fast and fault tolerant You need to install C language, which is PIP install
LXML XML parser BeautifulSoup(markup,[“lxml”,”xml”]) BeautifulSoup(markup,”xml”) Fast, only support XML parser You need to install C language, which is PIP install
html5lib BeautifulSoup(markup,”html5lib”) The best fault tolerance is to parse the text in a browser-like fashion and produce an HTML5 document Slow speed does not depend on external extension

If it is a strange page, it is recommended to use HTML5Lib to parse the page, to prevent errors, he will automatically fix the existence of the error

Simple use:

from bs4 import BeautifulSoup
html=""" xxxxxx """

bs=BeautifulSoup(html,"lxml")		# Turn it into HTML mode to fill in the missing elements

print(bs.prettify())				Print it out in a beautiful way
Copy the code
1. Four commonly used objects:

BeautifulSoup replaces a complex HTML document with a complex tree of nodes, each of which is a Python object, all of which can be boiled down to four types:

  1. Tag

    Tags are HTML tags

  2. NavigatebleString

  3. BeautifulSoup

  4. Comme

2.find&find_all
find:

1) Only the first tag can be extracted. If only one tag is found, it is returned

find_all:

0) can extract all tags and return multiple elements as a list

1) When extracting a tag, the first parameter is the name of the tag. Then, if you want to use attributes to filter when extracting tags, you can pass in the name of the attribute and its corresponding value as a keyword argument in this method. Or use the ‘attrs’ attribute and pass all the attributes and their values to the ‘attrs’ attribute in a dictionary

2) Sometimes, when extracting tags, you don’t want to extract that many tags, so you can use ‘limit’ to limit how many tags to extract

3.string,strings,stripped_strings,get_test
string:

Gets a non-tag string under a tag, just one, returned as a normal string

strings:

Gets all the descendants of a tag and returns the generator. You can add list to make it a list

stripped_strings:

Gets the string of all descendants of a tag with whitespace removed, and returns the generator

get_text:

Gets all descendant non-tag strings under a tag, but instead of returning as a list, returns as a normal string

from bs4 import BeautifulSoup

html="""
xxxxxx
"""
soup=BeautifulSoup(html,"lxml")


#1. Get all tr tags
trs=soup.find_all('tr')
for tr in trs:
    print(tr)
    print(type(tr))  
    This is a Tag type, but the repr method in BeautifulSoup prints the Tag as a string


#2. Get 2 tr tags
trs=soup.find_all('tr',limit=2)  
#limit gets at most two elements and returns the list, with [1] at the end returning the second element


#3. Get all tr tags with class equal to even

trs=soup.find_all('tr',class_='even')  #class is a Python keyword, so it's marked with an underscore in BS4
for tr in trs:
    print(tr)

trs=soup.find_all('tr',attrs={'class':"even"})  You can use attrs information as an argument
for tr in trs:
    print(tr)

Select * from a where id = test and class = test
aList=soup.find_all('a'.id='test',class_='test')   # how many features can go all the way up
for a in aList:
    print(a)

aList=soup.find_all('a',attrs={"id":"test"."class":"test"})   # how many features can go all the way up
for a in aList:
    print(a)


#5. Get the href attribute for all a tags
aList=soup.find_all('a')    # Find all a tags
for a in aList:
    # 1. Operate by subscript
    href=a['href']     		# This is easy
    print(href)
    #2. Pass the ATTRs attribute
    href=a.attrs['href']    Get the href attribute under the a tag
    print(href)

#6. Get all job information (plain text)
trs=soup.find_all('tr') [1:]    The # job information is within the TR tag. The first one is not
infos_=[]
for tr in trs:
    info={}
    Method a #
    tds=tr.find_all("td")      Find all td tags under tr tag
    title=tds[0]               The #title element is hidden inside
    print(title.string)        # to extract the string
    title=tds[0].string        The first element in # TDS is the title
    category=tds[1].string     The second element in # TDS is the classification
    nums=tds[2].string         The third element in # TDS is the number
    city=tds[3].string         The fourth element in # TDS is the city
    pubtime=tds[4].string      The fifth element in # TDS is the release time
    info['title']=title
    info['category']=category
    info['nums']=nums
    info['city']=city
    info['pubtime']=pubtime
    infos_.append(info)

    Method # 2
    #infos=tr.strings             
    # You can crawl down all the plain text (not the tag), so you get a generator, an object
    #for info in infos:
    # print(info) #
    #infos = list(tr.string)
    infos=list(tr.stripped_strings)   # removes whitespace from the string
    info['title']=infos[0]
    info['category']=infos[1]
    info['nums']=infos[2]
    info['city']=infos[3]
    info['pubtime']=infos[4]
    infos_.append(info)               # more concise and simple
Copy the code
import requests
from bs4 import BeautifulSoup
import html5lib
from pyecharts.charts import Bar

ALL_Data = []


def parse_page(url) :
    headers = {
        'User-Agent': 'Mozilla / 5.0(Windows NT 10.0; WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 79.0.3945.16Safari / 537.36'.'Referer': 'http: // www.weather.com.cn / forecast / index.shtml'
    }
    response = requests.get(url, headers=headers)
    text = response.content.decode('utf-8')
    soup = BeautifulSoup(text, 'html5lib')
    conMidTab = soup.find('div', class_='contentboxTab1')  Select * from div where class='contentboxTab1'
    # print(conMidTab)
    tables = conMidTab.find_all('table')
    for table in tables:
        trs = table.find_all('tr') [2:]
        for index, tr in enumerate(trs):  This returns the subscript and the value
            tds = tr.find_all('td')  Get the td tag, which returns a list
            city_td = tds[0]  # City was the first hashtag
            if index == 0:
                city_td = tds[1]  # Change the value of the first subscript to the first (Harbin) instead of the first (Heilongjiang) due to the structure problem.
            city = list(city_td.stripped_strings)[0]  # Return the generator so convert it to a list, and then take the zeroth element out
            temp_td = tds[-2]  # minimum temperature is the penultimate td tag
            min_temp = list(temp_td.stripped_strings)[0]  # Grab all the text in it
            ALL_Data.append({"city": city, "min_temp": int(min_temp)})
            # print({"city": city, "min_temp": min_temp})


def main() :
    urls = ["hb"."db"."hd"."hn"."xb"."xn"."gat"]
    for id in urls:
        url = f'http://www.weather.com.cn/textFC/{id}.shtml#'
        The table tag is not complete source code is not, but the browser automatically supplemented, so use HTML5lib to improve
        parse_page(url)
    # Analyze data
    # Rank by lowest temperature
    # def sort_key(data):
    # min_temp = data['min_temp']
    # return min_temp
    # ALL_Data.sort(key=sort_key)   
    Sort_key =sort_key
    
    
    # There is something wrong with the data visualization below
    ALL_Data.sort(key=lambda data: data['min_temp'])  This is the same function as above, with the value returned following the colon
    data = ALL_Data[0:10]
    # for value in data:
    # city=value['city']
    # cities.append(city)   
    # Extract the name of the city
    cities_ = list(map(lambda x: x['city'], data))  
    # Each item in the list data is passed to the lambda expression and then decomposed
    temps_ = list(map(lambda x: x['min_temp'], data))
    chart = Bar()  # titles
    chart.add_yaxis(series_name="thetitle",xaxis_index=cities_,yaxis_data=temps_)
    #chart. Add_dataset (", cities_, temps_
    chart.render('temperature.html')  # rendering


if __name__ == '__main__':
    main()
Copy the code

CssSelect method

Sometimes it’s easier to select a CSS selector.

1) Search by label name

Select (‘a’) from ‘a’

2) Search by class name

By class name is to add a.. So if I want to find class=’sister’

soup.select(‘.sister’)

3) Search by ID

When you look by id, you add a #. So if I want to find id=’link’

soup.select(‘#link’)

4) Combination search

Select (“p #link1”) # select(“p #link1”) # select(“p #link1”

Select (“head>titile”); select(“head>titile”)

5) Search by attribute

You can also add attribute elements to the lookup, which need to be enclosed in brackets.

soup.select(‘a[href=”www.baidu.com”]’)

<! DOCTYPEhtml>
<html>
<head>
    <title></title>
    <style type="text/css">
        .line1{
            background-color: pink;
        }
        #line2{
            background-color: rebeccapurple;
        }
        .box p{ /* Will select */ for all descendants
            background-color: azure;
        }
        .box > p{ /* The child element is not */
            background-color: aqua;
        }
        input[name='username']
        {
            background-color: coral;
        }

    </style>
</head>
<body>
<div class="box">
    <div>
        <p>the zero data</p>  /* This is element Sun */
    </div>
    <pClass ="line1">the first datap>
    <pClass ="line1">the second data </p>
    <pId ="line2">the third data; id="line2">the third datap>
    /* This is the direct child element */
</div>
<p>
        the fourth data
</p>
<from>
    <input type="text" name="username">
    <input type="text" name="password">
</from>
</body>
</html>
Copy the code

5.soup+select

When using CSS selectors, you need to use the soup select in soup. Select

Regular expressions

About regular expressions: Match the desired data from a string according to certain rules.

Match a single character
text='hello'

ret=re.match('he',text)  Ahello = 'he'; ahello = 'he'

print(ret.group()) #group can be typed

>>he
Copy the code

Dot (.) Match any character:

text="ab"

ret=re.match('. ',text)  #match matches only one character

print(ret.group())

>>a
Copy the code

But (.). Text =”\n”

\d to match any number:

text="123"

ret=re.match('\d',text)   # can only match one character

print(ret.group())

>>1
Copy the code

\D matches any non-number

text="2a"

ret=re.match('\d',text)   # can only match one character

print(ret.group())

>>a
Copy the code

\s matches the whitespace character (\n,\t,\r, space)

text=""

ret=re.match('\s',text)   # can only match one character

print(ret.group())

>> 
Copy the code

There are matches, but only empty characters

\w matches a-z and A-Z as well as numbers and underscores

text="_"

ret=re.match('\w',text)   # can only match one character

print(ret.group())

>>_
Copy the code

If you want to match a different character, you won’t

text="+"

ret=re.match('\w',text)   # can only match one character

print(ret. Group ()) > > an errorCopy the code

The \W match fit \W is the opposite

text="+"

ret=re.match('\W',text)   # can only match one character

print(ret.group())

>>+
Copy the code

The combination of [] can be matched as long as the characters in brackets are met

text="0888-88888"

ret=re.match('[\d\-]+',text)   # matches the number and -, and a + sign matches all of them until no condition is met

print(ret.group())

>>0888-88888
Copy the code
Instead of \ d:0-9]       [^0-9]^ This is wrong \D:0-9

\w:[0-9a-zA-Z_]     [^0-9a-zA-Z_]

\W:[0-9a-zA-Z_]
Copy the code
text="0888-88888"

ret=re.match('[^ 0-9]',text)   

print(ret.group())

>>-
Copy the code
Match multiple characters

* Can match 0 or any number of characters without error

text="0888-88888"

ret=re.match('\d*',text)   

print(ret.group())

>>0888
Copy the code

+ can match 1 or any number of characters at least one, otherwise an error will be reported

text="abcd"   #text="+abcd"

ret=re.match('\w+',text)   

print(ret.group())

>>abcd     #>>ab  
Copy the code

? Match one or zero (either none or only one)

text="abcd"  #text="+abcd"

ret=re.match('\w? ',text)   

print(ret.group())

>>a			#>> Matched to 0
Copy the code

{m} matches m

text="abcd"  #text="+abcd"

ret=re.match('\w{2}',text)   

print(ret.group())

>>ab   # only matches two
Copy the code

{m,n}: matches m- N characters

text="abcd"  #text="+abcd"

ret=re.match('{1, 5} \ w',text)    # Matches the most

print(ret.group())

>>abcd    # > > an error
Copy the code
A small case

1. Verify your phone number:

text="13070970070" 

ret=re.match('1[34578]\d{9}',text)    # verify, the first one is 1, the second one is one of 34578, and the last nine are whatever

print(ret.group())
Copy the code

2. Verification Email:

text="[email protected]" 

ret=re.match('\w+@[a-z0-9]+\.[a-z]+',text)    # first w matching to arbitrary character, and then there is at least one, so if there are no. +, until the match to the abnormal @ that does not belong to w matching, and then is to have only one @ @, then match the @ behind one or more characters, and then \. Match any character to match. The final com is matched with an [a-z] and may have a + sign

print(ret.group())
Copy the code

3. Verify the url:

text="http://www.baidu.com" 

ret=re.match('(http|https|ftp)://[^\s]+',text)    Select one of HTTP, HTTPS, FTP, and // to match a non-empty string

print(ret.group())
Copy the code

Verify your ID card

text="12345678909876543x" 

ret=re.match('\d{17}[\dxX]',text)    The seventeen digits in front of the # can be a number, and the next one can be either a number or a number, or an x or an x, so put a bracket around it

print(ret.group())

Copy the code
Bits of knowledge

^ off font: denoted by… start

text="hello"

ret=re.match('^a',text)   # This match comes with an out-of-font

print(ret.group())

>>h
Copy the code
text="hello"

ret=re.search('o',text)   # Search is a global search

print(ret.group())

>>o
Copy the code
text="hello"

ret=re.match('^o',text)   The first one is not o. If it's ^h, you can find h

print(ret. Group ()) > > an errorCopy the code

If it’s in brackets it’s going to be the opposite

$in… At the end

text="[email protected]"

ret=re.match('\[email protected]$',text)   The # ending in 163.com can better verify the mailbox

print(ret.group())

Copy the code

| matching multiple string or expression

text="https"

ret=re.match('http|https|ftp',text)   Use parentheses () for combinations

print(ret.group())
Copy the code

Greedy and non-greedy modes

text="https"

ret=re.match('\d+',text)   This is the greedy mode, matching as many characters as possible
ret1=re.match('\d+? ',text)  # here is non-greed mode, match to one is ok

print(ret.group())
Copy the code
text="< h1 > title < / h1 >"

ret=re.match(. '< + >',text)   

ret1=re.match('<. +? > ',text) H1 = h1 h1 = h1 h1 = h1 h1 = h1 h1 print(ret.group()) Copy the code

Matches numbers between 0 and 100

text="99"

ret=re.match('[1-9]\d? $100 | $',text)# The first cannot be 0, so one to nine, the second is not necessary to add one? And you have to end with that number, and 100 is the most special, so you have to consider it alone, and 100 is the number that ends with 100

print(ret.group())
Copy the code

Native strings and escape characters

text="the mac book pro is $1999"

ret=re.match('\$\d+',text)	$= 1999 = 1999 = 1999 = 1999

print(ret.group())

>>$299
Copy the code

R ‘\ n’ raw native

Print out the \ n

text="\\n"

ret=re.match('\\\\n',text)  #\ n just becomes \n because two \ s escape into one \
# has to add \\\\ to make the re recognize n

print(ret.group())
>>\n     
Copy the code

Group, a group

text="apple's price $99,orange's price is $10"

ret=re.search('.*(\$\d+).*(\$\d+)',text)

print(ret.group())  
Ret.group () is the same as ret.group(0)
print(ret.group(1))  # match the first group 99
print(ret.group(2))  # Matches the first group 10
print(ret.groups())  # Group all subgroups
Copy the code

findall

Find a list of all that meet the criteria

text="apple's price $99,orange's price is $10"

ret=re.findall('\$\d+',text)  # will find all that satisfy and return a list

Copy the code

sub

text="apple's price $99,orange's price is $10"

ret=re.sub('\$\d+'."0",text)  # Replace all matches with 0

print(ret)   Return a new string apple's price 0,orange's price is 0
Copy the code

Tags can be replaced with Spaces by

XXX

text="<h1>xxx</h1>"
ret=re.sub('<. +? > '."",text)
Copy the code

The split function

text"hello world ni hao"
ret=re.split(' ',text)
print(ret)  #['hello','world','ni','hao']
Copy the code

comlie:

If you use it often you can save it for later use

text="the number is 20.50"
r=re.compile('\d+\.? \d*')

r=re.compile(""" \d+ # Number before the decimal point \.? # decimal point itself \d* # number after decimal point """,re.VERBOSE)

ret=re.search(r,text)   # re.verbose can write comments

print(ret.group())
Copy the code
Regular actual combat crawl poetry network
# regular instance crawl ancient poetry network
import re,requests

def parse_page(url) :
    headers={
        'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/79.0.3945.16 Safari/537.36'
    }
    response=requests.get(url,headers=headers)
    text=response.text
    titles=re.findall(r'
      
.*? (.*?) '
\sclass="cont">
,text,re.DOTALL) # Because there is \n in the page, so there will be. If you don't match the \n, it's going to stop and it's going to return null This can be done by adding a re.dotall after #. To match all characters including \n plus? Prevent the non-greed mode without a match to only one topic dynasties=re.findall(r'

.*? (. *?) ' *?>

\sclass="source">
,text,re.DOTALL) * * * * * * * * * * * * * * * * * * * * * * * * * * * authors=re.findall(r'

.*? . *? (. *?) ' *?> *?>

\sclass="source">
,text,re.DOTALL) # Since this is the second a tag, we need to get the first one first and then fix the second one content_tag=re.findall(r'
(. *?)
'
,text,re.DOTALL) Using regular expressions is to see it as a string instead of a web page, what child elements will there be parent elements contents=[] for content in content_tag: #print(content) x=re.sub(r'<.*? > '."",content) # replace the tag contents.append(x.strip()) poems=[] for value in zip(titles,dynasties,authors,contents): title, dynasty, author, content=value more_peoms={ 'title':title, 'daynastie':dynasty, 'authors':author, 'content':content } poems.append(more_peoms) for poem in poems: print(poem) def main() : for page in range(10): url=f"https://www.gushiwen.org/default_{page}.aspx" parse_page(url) if __name__ == '__main__': main() Copy the code

3) The third step of crawler data storage

Json File processing

Json is a lightweight data exchange format.

Supports objects (dictionaries), arrays (lists), and integer strings. Strings should be in double quotes, not single quotes

import json
Convert python objects to JSON strings
persons=[
    {
        'username':"zhilioa".'age':18.'country':"china"
    },
{
        'username':"zhaxiaolie".'age':20.'country':"china"
    }
]
json_str=json.dumps(persons)
print(json_str)
print(type(json_str))   # json is actually a string

with open('person.json'.'w',encoding='utf-8') as fp:
    fp.write(json_str)
    You can also use json,dump(person,fp, ensure_ASCII =False) to save the file pointed to by fp directly. The last one is to turn it into asckii and close it to prevent conversion

class Person (object) :
    country='china'

a={
    'person':Person
}
json.dumps(a)  This type cannot be changed to JSON format
Copy the code

If it’s JSON, it’s a list

persons=json.load(xxxx)

CSV File processing

A comma

1. Read the CSV file

import csv

def read_csv_demo1() :
	with open('stock.csv'.'r') as fp:
    	reader =csv.reader(fp)   You can read a CSV file and return iterators
    	next(reader)    # Start second, skip the header
    	for x in reader:  
        	print(x)    Print out the list
      
def read_csv_dome2() :
    with open('stock.csv'.'r') as fp:
        # This way it will not be included in the heading line
        reader=csv.DictReader(fp)
        for x in reader:
            print(x)
            value={"name"=x['secShortname']."volume"=x['turnoverVol']}
			print(value)
if __name__=='__main__'
	read_csv_demo2()
Copy the code

2. Write to a CSV file:

import csv

header=['username'.'age'.'height']

def writer_csv_demo1() :
    values=[('zhanghan'.12.1800),
       ('wangwu'.16.170),     # List ancestor way
       ('lisi'.14.111)]
	with open('class.csv'.'w',encoding="utf-8",newline=' ') as fp:
    	writer = csv.writer(fp)    #newline is newline by default, to prevent blank lines
    	writer.writerow(headers)  # This can be one line
    	writer.writerows(values)	# This can be multiple lines

        
def writer_csv_demo2() :
    values=[{'username':'zhanghan'.'age':12.'height':180},    # List dictionary way
       {'username':'lisi'.'age':11.'height':130}, 
       {'username':'wangwu'.'age':15.'height':140}]
    with open('class1.csv'.'w',encoding='utf-8') as fp:
        writer=csv.DictWriter(fp,headers)  # Writeheader is used when writing to a table header
    	writer.writeheader()  # Manually put the header in, otherwise it is not written in
        writer.writerrows(values)

        
        
if __name__=='__main__'
	writer_csv_demo2()        
Copy the code