Found a lightweight crawler framework called Requests-html on Github
I. Introduction to the official website
Chinese document
English website
Github
- Full support for parsing JavaScript!
- CSS selectors (jquery-style, thanks PyQuery).
- XPath selector, for the faint at heart.
- Custom User-Agent (just like a real Web browser).
- Automatic redirection tracking.
- Connection pooling and cookie persistence.
- An exhilarating request experience, magical parsing pages.
Well, it seems to be very powerful, try it, 315 gala reported 360 medical related fake advertising, think about the medical related data
www.fudanmed.com/institute/n… Before we lock the target, we will first crawl the data of the top 100 hospitals in China into Excel for practice
F12 interface for analyzing web elements
The whole page is wrapped in table, and the name of the hospital is wrapped in a label
The simplest and most crude idea is to crawl all the a tags, then loop through the href text, and go straight to the code
# coding=UTF-8
from requests_html import HTMLSession
import xlwt
# Link web sites
session = HTMLSession()
r = session.get('http://www.fudanmed.com/institute/news2019-2.aspx')
Initialize an Excel file
xl = xlwt.Workbook(encoding='utf-8')
sheet = xl.add_sheet('National Hospital Rankings')
sheet.write(0.0.'排名')
sheet.write(0.1.'Name of hospital')
Initialize the ranking
i = 0
# crawl data
def findHospitalName() :
trs = r.html.find("a")
for item in trs:
Get the href value of the a tag
text = item.find('a', first=True).attrs['href']
filterData(text)
# data cleaning
def filterData(text) :
# filter text for link parameters
if "#" in text:
array = text.split("#".1)
# filter out empty
if len(array[1) :global i
i += 1
writeData(i, array[1])
Write data
def writeData(sort, data) :
print(sort)
print(data)
sheet.write(sort, 0, sort)
sheet.write(sort, 1, data)
xl.save('/Users/lsr/Documents/GJProject/py/' + "National Hospital Rankings" + ".xls")
# start
findHospitalName()
Copy the code
Don’t look at the code, in fact, there are two core code
trs = r.html.find("a") Get all a tag data, return Element object array
text = item.find('a', first=True).attrs['href'] Get the HERF attribute of the a tag
Copy the code