The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with
Author: Big Smart Yitai Source: CSDN
Link to this article: blog.csdn.net/weixin\_464…
Source of the case: In the hands-on operation of the online course “Python Web Crawler and Information Acquisition” of Beijing Institute of Technology, I found that the code demonstrated in the video could not run completely. After personal exploration, I recorded as follows:
import requests from bs4 import BeautifulSoup import bs4 def getHTMLText(url): try: r=requests.get(url,timeout=30) r.raise_for_status() r.encoding=r.apparent_encoding return r.text except: return ""; def fillUnivList(ulist,html): soup=BeautifulSoup(html,"html.parser") for tr in soup.find('tbody').children: if isinstance(tr,bs4.element.Tag): tds=tr('td') ulist.append([tds[0].string,tds[1].string,tds[2].string]) def printUnivList(ulist,num): Print (" {: ^ 10} {: ^ 6} \ \ t t {: ^ 10} ". The format (" rank ", "school name", "total")) for I in range (num) : u=ulist[i] print("{:^10}\t{:^6}\t{:^10}".format(u[0],u[1],u[2])) def main(): uinfo=[]; url='http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html' html=getHTMLText(url) fillUnivList(uinfo,html) printUnivList(uinfo,20) main()Copy the code
On the first case, the site has some changes, practical jump to www.shanghairanking.cn/rankings/bc… In addition, if the above code is copied to run, it will be found that the variable name error, its UINFO and UList should be unified into one variable
After the above problems are fixed, the execution code will find that the statement to extract key information is invalid, and the information cannot be captured
Re-analyzing the source code of the web page, you can see that all the school information is placed under tbody
Extract one of these messages and take a look:
Tr (‘td’)[1]. String cannot be retrieved. Tr (‘ TD ‘)[1]
The ranking and the total score are also not available in string, so the solution I’ve come up with is to manually convert them to strings
To sum up, I changed the key code for obtaining information into:
ulist.append([str(tr('td')[0].contents[0]).strip(), tds[1].a.string, str(tr('td')[4].contents[0]).strip()])
Copy the code
The complete code is as follows
"" def getHTMLText(url): try: r = requests.get(url, timeout=30) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: return "" def fillUnivLIst(ulist, html): soup = bs4.BeautifulSoup(html, "html.parser") for tr in soup.find('tbody').children: If isinstance(tr, bs4. Element.tag): tds = tr('td') ulist.append([str(tr('td')[0].contents[0]).strip(), tds[1].a.string, str(tr('td')[4].contents[0]).strip()]) return ulist def printUnivList(ulist, num): # TPLT = "{10} 0: ^ \ t {1: {3} ^ 6} \ t {10} 2: ^" print (" {: ^ 10} {: ^ 6} \ \ t t {: ^ 10} ". The format (" rank ", "school name", "total")) # Print (tplt.format(" 数 "," 数 ", CHR =(12288)) for I in range(num): u=ulist[i] print("{:^10}\t{:^6}\t{:^10}".format(u[0],u[1],u[2])) # print(tplt.format(u[0],u[1],u[2],chr=(12288))) def main(): ulist = [] url = 'https://www.shanghairanking.cn/rankings/bcur/2020' html = getHTMLText(url) ulist = fillUnivLIst(ulist, html) printUnivList(ulist, 20) main()Copy the code
The running results are as follows:
Finally, as for the optimization of typesetting, I tried the solution provided in the video and reported a mistake. The problem has not been solved for the time being, so we will discuss it later when we are free