There seems to be no better way to write a crawler than with Python. The Python community provides a plethora of crawler tools that can be used directly from the Library. You can write a crawler in a minute. Download Liao xuefeng’s Python tutorial as a PDF ebook for offline reading.
Before you start writing the crawler, we come to analyze the structure the website page 1, page outline of the left is a list of the tutorial, each URL corresponding to the right of an article, on the right top is the title of the article, the middle is the body of the article, the text contents is the focal point that we care, we want to crawl all data is the text of web page, Below is the user’s comment section, which is of no use to us, so we can ignore it.
Tools to prepare
Once you understand the basic structure of your site, you can start preparing the toolkits that crawlers rely on. Requests and Beautifulsoup are two of the most powerful crawlers, reuqests for web requests, and Beautifusoup for manipulating HTML data. With these two shuttles, we don’t need crawler frames like scrapy. Wkhtmltopdf is a great tool for converting HTML to PDF for multiple platforms. Pdfkit is a Python package for wkhtmlTopdf. First install the following dependencies, then install wkHTMLTopdf
pip install requests
pip install beautifulsoup
pip install pdfkitCopy the code
Install wkhtmltopdf
Windows platform directly in wkHTMLTopdf official website 2 download the stable version of the installation, after the installation of the program’s execution PATH into the system environment $PATH variable, Otherwise, PDFkit cannot find wkhtmlTopdf and error “No wkhtmlTopDF executable Found” will occur. Ubuntu and CentOS can be installed directly from the command line
$ sudo apt-get install wkhtmltopdf # ubuntu
$ sudo yum intsall wkhtmltopdf # centosCopy the code
The crawler implementation
Once everything is ready, you can go to the code, but before you write the code, you need to clear your mind. The purpose of the program is to save all the URL corresponding HTML body parts locally, and then use PDfkit to convert these files into a PDF file. Let’s split the task. First, save the HTML text corresponding to a URL locally, and then find all urls to perform the same operation.
Use Chrome to find the tag in the body of the page. Press F12 to find the corresponding div tag:
def parse_url_to_html(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, "html5lib")
body = soup.find_all(class_="x-wiki-content") [0]
html = str(body)
with open("a.html".'wb') as f:
f.write(html)Copy the code
The second step is to parse out all the urls on the left side of the page. In the same way, go to the left menu TAB
Because there are two uk-nav uk-nav-side class attributes on the page, the actual directory list is the second. All urls are retrieved, and the URL-to-HTML function is written in the first step.
def get_url_list(a):
""" Get a list of all URL directories """
response = requests.get("http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000")
soup = BeautifulSoup(response.content, "html5lib")
menu_tag = soup.find_all(class_="uk-nav uk-nav-side") [1]
urls = []
for li in menu_tag.find_all("li"):
url = "http://www.liaoxuefeng.com" + li.a.get('href')
urls.append(url)
return urlsCopy the code
The final step is to convert HTML to PDF. Converting to PDF is easy because pdfKit encapsulates all the logic. You just need to call pdfkit.from_file
def save_pdf(htmls):
Convert all HTML files to PDF files.
options = {
'page-size': 'Letter'.'encoding': "UTF-8".'custom-header': [('Accept-Encoding'.'gzip')
]
}
pdfkit.from_file(htmls, file_name, options=options)Copy the code
Execute save_pdf to generate the ebook PDF file.
conclusion
The total amount of code adds up to less than 50 lines, but wait, the above code omits some details, such as how to get the title of the article, and the img tag for the body content uses a relative path, which is required to change to an absolute path for images to display properly in PDF. Save temporary HTML files to delete, all the details are put on Github.
The complete code can be downloaded on Github, the code in Windows platform test effective, welcome fork download their own improvement. Github address 3, students who can not access github can use code cloud 4, “Liao Xuefeng Python Tutorial” e-book PDF file can be followed by the public account “a programmer’s micro station” reply “PDF” download free reading.
This article was first published on the public account of “A programmer’s micro blog” (id:VTtalk).