Define the framework
Main functions: 1. Initial url 2. Get data 3. Save data
In the process
from bs4 importBeautifulSoup # parse web pageimportRe # re extractionimportUrllib. Request, urllib. Error # error handling, web page extractionimportXLWT # table processingCopy the code
def main(a)Get the information before saving the original url baseurl= ""Datalist =getData(baseurl) # savepath =".xls"SaveData (datalist,savepath) def getData(baseurl)returnDatalist def saveData(datalist,savepath):# save information def askURL(url): # retrieve pagereturn html
if __name__ == "__main__": # function run, equivalent to c main() main()Copy the code
Extraction of web pages
The modularization process of urllib.requst is as follows:
def askURL(url): # Disguised head head= {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3861.400 QQBrowser/10.7.4313.400"
}
request = urllib.request.Request(url,headers = head)
html = ""Pyth will automatically assign it to you when you use itstringtypetry:
response = urllib.request.urlopen(request)
html = response.read().decode("utf-8")
#print(html)
except urllib.error.URLError as e:
if hasattr(e,"code"):
print(e.code)
if hasattr(e,"reason"):
print(e.reason)
return html
Copy the code
Head is where we disguise the request to the page, as shown in the figure below:Note that this is a dictionary, and to deal with key-value pairs, it is usually enough to disguise user-Agent.
request = urllib.request.Request(url,headers = head)
Copy the code
This is to send a Request, Request can view the source, will not expand.
response = urllib.request.urlopen(request)
html = response.read().decode("utf-8")
Copy the code
Response is the returned information, and urlopen is used to open the HTML page information for reading. At this point, print (HTML) can be used to print out the information, and you can see the HTML structure of the tree information table.
Parse web pages
# Crawl urldef getData(baseurl):
datalist = []
for i in range(0.10)Call the function that gets the web page 10 times= baseurl + str(i*25) HTML = askURL(URL) # Save the information obtained from the page # parse soup = BeautifulSoup(HTML,"html.parser")
for item in soup.find_all('div',class_="item"Data = [] # Save all information about a movie item = STR (item)#print(item)
Link = re.findall(findLink,item)[0ImgSrc = re.findall(findImgSic,item)[0[data.append(imgSrc) # add titles = re.findall(findTitle,item)if(len(titles)==2):
ctitle = titles[0]
data.append(ctitle)
otitle = titles[1].replace("/"."")
data.append(otitle)
else :
data.append(titles[0])
data.append(' 'Rating =re.findall(findRating,item) data.append(rating) judgeNum =re.findall(rating) data.append(judgeNum) inq = re.findall(findIng,item)iflen(inq) ! =0:
inq = inq[0].replace(","."")
data.append(inq)
else :
data.append("")
bd = re.findall(findBd,item)[0]
bd = re.sub('br(\s+)? />(\s+)? '."",bd)
bd = re.sub('/'."",bd) data.append(bd.strip()) datalist. Append (data) # Store a processed movie#print(datalist)
return datalist
Copy the code
I use BeautifulSoup, which is different, so just find one that’s convenient
soup = BeautifulSoup(html, "html.parser")
Copy the code
Parse the extracted web pages with html.parser. Meanwhile, we can print out page information such as:
print(type(soup.head))
print(soup.title)
print(soup.title.string)#string is printed text
Copy the code
“. Title “”. A “”. Head “are tags in HTML pages that can also be used
print(soup.a.attrs)
Copy the code
get
Attrs: print (type ()) to find the exact type of each part.
Search for documents
1.find_all();
1 String filtering: Searches for strings that exactly match the stringCopy the code
list = soup.find_all("a")
print(list)
Copy the code
2 Regular search: Matches with search ()Copy the code
import re
list=soup.find_all(re.compile("a"))
Copy the code
2. The kwargs () parameter
list=soup.find_all(id="head")
Copy the code
3. The text parameter
list=soup.find_all(text = "hao123")
Copy the code
That is, search in text form
Regular extraction
Will write regular blog, here will not expand, here just remember how to use. Different things need to be extracted from each webpage. Here, taking Douban Top250 as an example, we can see from the webpage HTML that the information we want is in item. 即
soup = BeautifulSoup(html,"html.parser")
for item in soup.find_all('div',class_="item"Data = [] # Save all information about a movie item = STR (item) # Force conversion to stringCopy the code
Put all the information we need into the item, and then we’re going to use the regex in the item to extract what we want to put into the data, and in this for loop, let’s say we’re looking for a link to a movie,
Link = re.findall(findLink,item)[0] #Copy the code
[0] is required to fetch the first link. Now let’s define findLink as a global variable, i.e.
# findLink = re.compile(r'<a href="(. *?) ">')# create regular expression rulesCopy the code
Here we use regular rules to extract information from the HTML of the web page and then we use
data.append(Link)
Copy the code
Add information to data, and then
Datalist. Append (data) # Store the processed movie informationreturn datalist
Copy the code
This is the general extraction process, different information needs different details, and the preservation of information also needs further explanation, today more about this, next time to continue.