This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money.

0 x0, introduction

Gold nine silver ten blink of an eye, next week November, October an article did not write, ashamed, just have material, hurriedly water a ~

Solemnly declare:

This paper is only used to record the study of crawler technology, and no crawler script will be provided. The crawler data has been deleted and not transmitted, so please do not use it for illegal purposes. If any other illegal purposes cause losses, it has nothing to do with this paper.

Those who are interested in learning about crawlers can refer to my previous article, The Notes on how to learn about A Python crawler


0 x1, cause

On the way from work, I swipe a recruitment APP and click the course Tab. There are a lot of courses. The quality looks good, really good.

Suddenly remembered last year forwarded N group promotion, white piao to the VIP of the year, should be a lot of friends like me are collected a wave, and joined the collection don’t watch eat ash series ~

One day, will learn, right, but, this VIP year card TM is about to expire!!

Don’t panic, the problem is not big, expired expired, big deal renewal, more than ten yuan of milk tea money, we still can afford ~

A glance at how much the renewal cost:

God, the price… TM ok, mouth (hard), hand but (not good) began to long screenshots…

Will even knowledge leave me because of poverty? No, you can’t leave!!

After a few long screenshots, I started to feel a little off:

A few minutes to capture a picture, so many lessons and chapters, I capture monkey years horse months? I broke the power button. Maybe I didn’t finish it?

Besides, I can’t do anything else by holding the screenshots of my phone all the time. If I slip my hand and press the wrong button, the flash will light up, which will be embarrassing in the crowded subway.

As a lazy developer, you must think of ways to free your hands and achieve your dreams. Just do it!


0x2, Dot dot dot seems not very good

To do the dot-dotting, let’s start with the screenshot:

Cycle three steps in the way, until cut play all the courses, the process looks very simple, mobile phone automatic point and point scheme to choose one of four:

  • Accessibility services
  • Py script + adb command
  • Automatic test tools: Appium, Airtest
  • autojs

Open Android phone, developer tools → display layout and boundaries, you can see ① and ② are native controls, it is good to locate the control to do simulation click/get text.

The biggest difficulty is “long screenshots”. As far as I know, these tools above should not support long screenshots, so you have to realize long screenshots by yourself. The general scheme is:

Multiple sliding screenshots → multiple screenshots splicing growth screenshots

The processing here is quite tedious, sliding distance calculation, image stitching accuracy (content overlap, missing), and so on, it is difficult and not pleasing to do, so, change the scheme, grab the bag go a wave ~


0x3, Packet capture seems not very good

First grab PC side, 23333, request header encryption persuasion ~

Grab the Android terminal, 23333, the same request header, again dissuade ~

Heck, blood pressure is soaring! Then decrypt a wave? Look, it’s 360 reinforcement, so what’s next?

23333, just kidding, the title said simple, there must be a simpler way, that is: dot dot + grab bag


0x4, dot dot + packet capture is ok

From the mobile end of the dot dot, to the PC side of the web dot dot, the conventional technical solution:

  • Selenium is also more efficient than Pyppeteer, which relies on the Chromium kernel without the hassle of environment configuration. I use the former here, but I’m familiar with it.

The gameplay is simple:

Locate elements using the API for finding elements → simulate click → simulate input → get text from a particular tag → save to a local file

Seems a little too easy and mindless? That adds a little bit of technology:

Cooperate with the packet capture tool → Intercept requests from the page → filter out required data → save the data to a local file

Use BrowserMob-Proxy for interception, and then start the crawl process


1. Prepare tools

  • Selenium → PIP install Selenium
  • Chromedriver. Exe → Chrome look at the browser version, the official website directly under the version: Chromedriver, put the project directory;
  • Browsermob-proxy-2.1.4 → Github repository directly, the same decompression put project directory;

Other used library PIP directly installed ~


2. Initialize the proxy server and browser

import os
from browsermobproxy import Server
from selenium import webdriver
import time


Initialize the proxy
def init_proxy() :
    server = Server(os.path.join(os.getcwd(), R 'browsermob - proxy - 2.1.4 \ bin \ browsermob - proxy'))
    server.start()
    return server.create_proxy()


Initialize the browser and pass the proxy object
def init_browser(proxy) :
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--headless')  # Headless browser
    chrome_options.add_argument('ignore-certificate-errors')  # ignore certificate validation
    chrome_options.add_argument('--start-maximized')  # Max screen directly when starting
    Set the user data directory to avoid having to log in again every time
    chrome_options.add_argument(r'--user-data-dir=D:\ChromeUserData')
    chrome_options.add_argument('--proxy-server={0}'.format(proxy.proxy))  Set the proxy for the request
    return webdriver.Chrome(options=chrome_options)


if __name__ == '__main__':
    server_proxy = init_proxy()
    browser = init_browser(server_proxy)
    server_proxy.new_har("test", options={
        'captureContent': True.'captureHeaders': True
    })
    browser.get("https://www.baidu.com")
    time.sleep(2)
    catch_result = server_proxy.har
    for entry in catch_result['log'] ['entries'] :print(entry['response'] ['content'])
        
    Remember to turn it off
    server_proxy.close()
    browser.close()
Copy the code

After a while, you can see that the console prints the crawl log:

Catch_resul = catch_resul = catch_resul = catch_resul

The proxy and browser line up, and then the dots start to click


3. Simulated landing

The browser opens the home page and locates the login TAB:

Check if there is this label, if there is, it indicates that you are not logged in, perform login logic, click this button, the following popup window will appear:

Switch to account password login, locate the node where the mobile phone number and password are entered, enter the mobile phone password, and click login.

Sometimes captchas pop up for risk control or other reasons, such as:

An easy way to handle this is to allow some time after you click log in and do the verification yourself manually.

Because the user data directory of Chrome is set above, it will be in login state after logging in once, and you do not need to log in after opening the browser. Of course, you may need to manually invoke the login method when the top number or Cookie expires.

Write a simple code example ~

def login() :
    browser.get(base_url)
    time.sleep(2)
    not_login_element = browser.find_element_by_class_name("not-login-item")
    if not_login_element is None:
        print("Logged in, logged out.")
    else:
        # Click login
        not_login_element.click()
        time.sleep(1)
        
        # Switch TAB to account password login
        browser.find_elements_by_class_name("account-text") [1].click()
        
        Enter the account password
        input_items = browser.find_elements_by_class_name("input-item")
        input_items[2].find_element(By.TAG_NAME, 'input').send_keys("xxx")
        input_items[3].find_element(By.TAG_NAME, 'input').send_keys("xxx")
        
        # Click login
        browser.find_element_by_class_name("login-btn").click()
        
        # Captchas may pop up sometimes, give you enough time
        time.sleep(20)
        
        # Turn it off when finished
        proxy.close()
        browser.close()
Copy the code

4. Get all course ids

Find all the courses in the column at the bottom of the home page:

# # #

F12 Open the developer tools, switch to the Network TAB, clear it, then refresh the page, find a random course name, search, very good location:

There is no paging, all the data is returned in a Json, so just grab it once, copy all the Json, save it locally, parse it, extract all the course ids, simple code example is as follows:

# Course list with ID
def load_all_course_list() :
    with open(lg_course_item_json, 'r+', encoding='utf-8') as f:
        content_dict = json.load(f)
        c_list = []
        for course in content_dict['content'] ['contentCardList'] [22] ['courseList']:
            c_list.append(str(course['id']))
        cp_utils.write_list_data(c_list, course_id_list_file)
Copy the code

Part of the processing results are as follows:

Fortunately 101 courses, not too many, and then to get the chapters in each course.

5. Get the section ID

Open a chapter, clear, refresh the page, search a title at will, it is also very easy to locate:

What’s the point of getting a course ID? Iterate through the list of course ids above, always replacing the courseID in the URL, which is the URL for each course:

course_template_url = 'https://xxx/courseInfo.htm? courseId={}#/content'.format('course id')
Copy the code

Browser.get () loads the above link directly, but saves the returned JSON directly because some fields may be used later. Here’s a simple code example:

Load the course list
def load_course_list() :
    course_id_list = cp_utils.load_list_from_file(course_id_list_file)
    for course_id in course_id_list:
        proxy.new_har("course_list", options={
            'captureContent': True.'captureHeaders': True
        })
        browser.get(course_template_url.format(course_id))
        time.sleep(random.randint(3.30))
        result = proxy.har
        for entry in result['log'] ['entries'] :if entry['request'] ['url'].find('getCourseLessons? courseId=') != -1:
                content = entry['response'] ['content']
                Filter the correct requests
                if len(str(content)) > 200:
                    text = json.loads(content['text'])
                    course_name = text['content'] ['courseName']
                    json_save_path = os.path.join(course_json_save_dir, course_name + '.json')
                    with open(json_save_path, "w+", encoding='utf-8') as f:
                        f.write(json.dumps(text, ensure_ascii=False, indent=2))
                        print(json_save_path, "File written...")
    proxy.close()
    browser.close()
Copy the code

You can see the successive SAVED JSON files ~


6. Get content

In the same way, keyword search:

Chapter URL:

article_template_url = 'https://xxx/courseInfo.htm? courseId={}#/detail/pc? id={}'.format(course_id, theme_id))
Copy the code

Also loop through, parse the returned data textContent field content, save as HTML, relatively simple, do not post code.

Data volume is not large, half a day can basically climb, open the saved HTML found, are garbled:

Small problem, specify the encoding method can be, directly put the content of the web page into the comment area can:

<html> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <head><title></title></head> <body> <! -- Copy here --> </body> </ HTML >Copy the code

Well, the class has climbed down, and you might say: That’s it? Also too simple, I also feel so, then add a billion little details!!


0x5, plus billions of details

1. HTML to Markdown

Unstyled HTML is ugly when opened and not easy to save, so let’s convert it to Markdown

Life is short, I use Python, don’t panic when you need a wheel, don’t make your own, this is a random search to find:

  • html2text

PIP command line direct install, write paragraph demo, try conversion effect:

import cp_utils
import html2text as ht

if __name__ == '__main__':
    text_marker = ht.HTML2Text()
    content = cp_utils.read_content_from_file('test.html')
    cp_utils.write_str_data(text_marker.handle(content), "after.md")
Copy the code

The conversion result looks good, so far is no problem, then is to traverse the file, batch conversion a wave ~

import cp_utils
import os
import html2text as ht

lg_save_dir = os.path.join(os.getcwd(), "lg_save")
course_json_save_dir = os.path.join(lg_save_dir, "course_json")
article_save_dir = os.path.join(lg_save_dir, "article")
md_save_dir = os.path.join(lg_save_dir, "md")

if __name__ == '__main__':
    text_marker = ht.HTML2Text()
    cp_utils.is_dir_existed(lg_save_dir)
    cp_utils.is_dir_existed(course_json_save_dir)
    cp_utils.is_dir_existed(article_save_dir)
    cp_utils.is_dir_existed(md_save_dir)
    course_dir_list = cp_utils.fetch_all_file(article_save_dir)
    for course_dir in course_dir_list:
        course_name = course_dir.split("\ \")[-1]
        for article_path in cp_utils.fetch_all_file(course_dir):
            article_name = article_path.split("\ \")[-1]
            for lesson_path in cp_utils.filter_file_type(article_path, ".html"):
                lesson_name = lesson_path.split("\ \")[-1].split(".") [0]
                after_save_dir = os.path.join(md_save_dir,
                                              course_name + os.path.sep + article_name + os.path.sep)
                cp_utils.is_dir_existed(after_save_dir)
                md_file_path = os.path.join(after_save_dir, lesson_name + '.md')
                cp_utils.write_str_data(text_marker.handle(cp_utils.read_content_from_file(lesson_path)),
                                        md_file_path)
Copy the code

After a while, all the files have been converted, along with a wave of HZwZ-Markdown-wx MD HTML style script I wrote earlier:

Typesetting suddenly lofty up, 2333, of course, just joking, will not do so big death, respect the author’s work ~


2. Image processing

The pictures in MD are all from the map bed of the site. Some students may have the following requirements:

Requirement 1: Need to view documents offline

Small case, parse the MD file, download the image to the local, replace the original link, along with add a wave of MD title (filename).

Note: Local links in Markdown use relative paths, not absolute paths!

A simple code example is as follows:

# regex to match image URLS
pic_pattern = re.compile(r'! \ \ [. *? \] ((. *?) \] ', re.S)    

def process_md(md_path) :
    content = cp_utils.read_content_from_file(md_path)
    
    # add level 1 title
    title = md_path.split('. ') [0]
    new_content = "# {}\n{}".format(title, content)
    
    # Find all image links
    pic_result = pic_pattern.findall(new_content)
    for pic in pic_result:
        # local image absolute path
        pic_file_path = os.path.join(pic_save_dir, pic.split('/')[-1])
        
        # image relative path
        pic_relative_path = ".. {}pic{}{}".format(os.path.sep, os.path.sep, pic.split('/')[-1])
        
        # Download image
        cp_utils.download_pic(pic_file_path, pic)
       
        Replace the resource URL in the MD file with the local relative path
        new_content = new_content.replace(pic, pic_relative_path)
    
    Save the MD file
    cp_utils.write_str_data(new_content, os.path.join(md_new_save_dir, title + '.md'))
Copy the code

After running, the next graph will report an error:

Holy shit, why does the image name have a carriage return?? Take a look at the error in the MD file:

I accept, HTML2text conversion bug, back to the author to raise issues, the present must first try to solve this problem.

  • Palliation: no error is reported when downloading images. Replace \n in URL with blank when downloading images, but md file modification still needs to be processed;
  • Cure: find out the location of the abnormal picture link, remove \n enter.

Well, how do you locate and replace the extra carriage returns? Please find the string processing artifact → regular expression, solution has the following several:

  • 1) re. The.findall () + STR. The replace ()
# re.m is used to support multi-line matching
error_pic_pattern = re.compile(r'http.*? \n.*? \.. *? \] ', re.M)

# retrieve all matched abnormal images and replace the original string with the string after \n is removed
error_pics = error_pic_pattern.findall(new_content)
for error_pic in error_pics:
    new_content = new_content.replace(error_pic, error_pic.replace("\n".""))
print(new_content)
Copy the code
  • â‘¡ Re.sub () + Data replacement after matching

The sub() function uses function methods where modifications are supported, so we can simplify method â‘  like this:

# Remove the carriage return function
def trim_enter(temp) :
    return temp.group().replace("\n"."")
    
new_content = error_pic_pattern.sub(trim_enter, new_content)
Copy the code
  • â‘¢ Re.sub () + backreference

Backreference: The process of specifying the replacement result by referring to the matching content of the original string.

Re.sub matches are grouped in the same way as re.match, so you only need to refer to the grouping in the replacement expression. There are two ways to refer to the grouping:

  • \number → as \1, indicates the first grouping in the matching result
  • \g

    you →

So you can replace it with the following code:

error_pic_pattern_n = re.compile(r'(http.*?) (\n)(.*? \.\w+\))', re.M)

# divide into three groups, and then join the results of group 1 and group 3 as replacement results
new_content = error_pic_pattern_n.sub(r"\g<1>\g<3>", new_content)
Copy the code

After modification, open the MD file locally and verify that the image can be viewed normally

Requirement 2: want to put some XX notes, but also afraid of anti-theft chain after what

Direct third party CDN, replace the local image URL is good, help people to help in the end, post a simple code example of seven niuyun upload pictures ~

Qiniu CDN configuration information
qn_access_key = 'xxx'
qn_secret_key = 'xxx'

# Upload pictures to Qiniuyun
def upload_qn_cdn(file_path) :
    with open(file_path, 'rb') as f:
        data = f.read()
        Create an authentication object
        q = Auth(qn_access_key, qn_secret_key)
        # upload space name
        bucket_name = 'Storage space name'
        key = 'lg/' + str(int(round(time.time() * 1000))) + '_' + f.name
        token = q.upload_token(bucket_name, key, 3600 * 24)
        ret, info = put_data(token, key, data)
        print(ret)
        print(info)
        if info.status_code == 200:
            full_url = 'http://qiniupic.coderpig.cn/' + ret["key"]
            return full_url
Copy the code

Requirement three: want to make a PDF, easy to view

Getting more and more outrageous… Search Python Markdown to PDF and find a library


0 x6, summary

This section has looked at the crawl process for the course text section of a site. It’s quite simple, you may be asking, why is there no audio and video crawl?

Sorry, maybe I’m really too bad, I didn’t do it for two hours, and I don’t have a strong desire to climb, so forget it.

There is a Java library for those interested in the encryption rules: lagou-course-downloader

On another note, Don’t assume Selenium emulation is safe, launching a browser with dozens of features that websites can sniff out through JavaScript.

Well, that’s all. Thank you