preface

Baidu ah, you think you baidu net disk to cancel the speed limit, we satisfied? Of course not, there is a certain degree of library! Had a good document, must not let us download… Today, we will teach you to write with me a library download Weeker, reject a library, start from me.

Our downloader is a GUI program that writes the core file (get. Py), then the command line parsing file (weeker.py), then the command line generation using Fire, and finally the CLI conversion using Gooey.

To prepare

The installation

  1. Install Python 3.8;
  2. PIP install requests docx beautifulsoup4 Gooey

directory

Initialize the project (the following script is run on Unix or Linux) :

Of course there will be difficulties on the way to learning Python. How can you learn without good learning materials? Q group number: 928946953, a group of like-minded friends, help each other, there are good video tutorials and PDF! And Daniel answers! ``` cd /path/to/project mkdir Weeker touch get.py weeker.pyCopy the code

The crawler core

The first step is to open get.py and introduce the class library:

From OS import getcwd, system from re import sub import requests import docx from BS4 import BeautifulSoupCopy the code

The functions of each module are as follows:

The name of the module

role

os

Get the current directory

re

Replaces a specific character in a document

requests

It’s used to make network requests. No more talking.

docx

Used to convert TXT to DOCX format.

bs4

Used to parse text out of HTML.

Since we need to determine the path when saving the file, we define a PWD constant to store “current path:

Copy code to hide code PWD = getcwd()Copy the code

To declare a get url: ua: path: the output: the convert method, to achieve our crawler function, including:

The parameter name

role

url

Document addresses, such as literally searched a: wenku.baidu.com/view/11ebd2…

ua

The User Agent. I tried it out and if using the browser UA didn’t work, it would crawl to an AD screen and tell you that you need to log in to do this, so we had to use Googlebot or Baiduspider to bypass UA detection (that’s why search engines find it) and think we were a search engine. With recommended use of the latter, after all, Baidu and library family.

path

Storage directory, excluding file names.

output

File name with a suffix.

convert

Converted format.

Because the author is lazy,

Therefore, this field can only be filled in docX.

Write the get::::: function

Get HTML & parse

Move the cursor to the get::::: function. First we’ll need requests as usual, and we’ll need bS4 for all requests:

Headers = {' user-agent ': Ua} result = requests. Get (URL, headers=headers) soup = BeautifulSoup(res.text, "html.parser") We define an array to store each line of the document everyline = []Copy the code

Add the title

We give the document a title, which is the title of the page.

Everyline.append (soup.title.string)Copy the code

But this will have a problem, add out the title is “XXXXXXx_ Baidu library”, very unsightly. So lift the re.sub to replace it with:

Everyline.append (re.sub('_ ', ", soup.title.string, 1)) everyline.append(re.sub('_ ', soup.title.string, 1))Copy the code

Access to the body

Bd doc-reader = bd doc-reader = BD doc-reader = BD doc-reader

\n, \x0c, and Spaces (\n is a newline character). We split it into arrays and delete the other two characters separately:

For doc in soup. Find_all ('div', attrs={"class": "bd doc-reader"}): Everyline.extend (doc.get_text().split('\n')) # everyline = [i.place (' ', '') for i in everyline] everyline = [i.replace('\x0c', '') for i in everyline]Copy the code

Save the file

The next step is to save the file. Save TXT as TXT first, and then judge the convert parameter. If docx is entered, then suffix TXT and change it to DOCx.

Final_path = path # If it is a relative path, change the connection PWD to an absolute path, otherwise Python does not support it. if not path.startswith('/'): final_path = pwd + '/' + final_path try: file = open(final_path + '/' + output, 'w', encoding='utf-8') for line in everyline: file.write(line) file.write('\n') file.close() except FileNotFoundError as err: print("wenku: error: / / contract/formspapers/formspapers/formspapers/formspapers/formspapers/formspapers/formspapers/formspapers/formspapers/formspapers with open(final_path + '/' + output) as f: Add_paragraph (f.read()) # Add paragraph doku.save (final_path + '/' + output + '.' + convert) # Docx system('rm '+ final_path + '/' + output) # Delete file saved in tryCopy the code

Create the GUI

Open the weeker. Py. The first is a pair of import sentences, where Gooey converts CLI to GUI using argparse-like syntax.

From gooey import gooey, GooeyParser import getCopy the code

Then add if __name__ == ‘__main__’ :

If __name__ == '__main__': main()Copy the code

Let’s define this main() :

@gooey (encoding=' utF-8 ', program_name="Weeker ", language=' Chinese ') def main(): encoding='utf-8', program_name="Weeker ", language=' Chinese ') def main(): Parser = GooeyParser(Description =" Parser, Cheers!" ) parser. Add_argument ("url", metavar=' document address ', widget="TextField") parser. Add_argument ("ua", metavar=' user UA', widget="Dropdown", choices={"Googlebot": 1, 'Baiduspider': Add_argument ("path", metavar=' save path ', widget="DirChooser") parser. Add_argument ("output", metavar=' rename ', Widget ="TextField") parser. Add_argument ("convert", metavar=' format conversion ', widget="Dropdown", choices={'docx': 1}) args = parser.parse_args() get.get(args.url, ua=args.ua, path=args.path, output=args.output, convert=args.convert)Copy the code

@gooey is a decorator that converts main() to a Gooey function. In main, we write the parser.add_argument function similar to argparse, and finally define args = parser.parse_args(), which takes input for each argument from the args member and passes it to get. Py. Let’s run it and something amazing happens:

We have successfully converted CLI to GUI!!

Note: If you prefer the command line, you can search python-fire on GitHub and expose functions and arguments directly to the CLI for better results. Note II: due to computer problems, the finished product can not be packaged, so if necessary, please compile by yourself. Note III: Attached are two py files. Note IV: I just saw a wrong import in the source code. If you downloaded the source code, please check it against the code in the article first.