background

A while ago by a website JS reverse climb process stumped, so far also did not understand its anti – dig principle and break method. Finally, I found an automated scripting tool, Autoit 3, which used a stupid method to script the action of people manually clicking on the browser to achieve the purpose of obtaining web data. After getting the web file, I used the code parsing to complete the task.

This article will introduce this automated process, and write a complete autoit 3 crawler script, hoping to inspire readers.

Automated operation analysis

In order toNational information security Vulnerability sharing platformFor example, it issues two 512 responses before returning the data, the third time with the browser dynamically generatedCookieInformation leads to data.

This time let’s start directly from the web page, operate the keyboard to find the “Next page” button, press Enter to complete the request. The process of locating the “Next Page” button through the keyboard is as follows:

  1. First, press the “End” key to get to the bottom of the page.
  2. The second step is to reverse the “Tab” key and press it 15 times to locate the “Next page” button.

Now you can write automated scripts that translate the manual actions you just did into scripted commands:

  1. Switch to the English input method to ensure that the browser input bar information is correct.
  2. Open Chrome;
  3. Type the destination URL into the browser address bar.
  4. Press Enter and wait for 2 seconds to ensure that the page data is loaded.
  5. Press Ctrl +S and send the file name to the storage path. Wait for the “Save” operation to complete.
  6. Press the End keyboard to navigate to the bottom of the page;
  7. Press the reverse Tab key 15 times to locate the “Next Page” button.
  8. Press Enter to request the next page of data;
  9. Loop 5-8 this process N times, N= number of pages to climb.

This process also applies to other highly anti-theft information publishing sites.

Write automated scripts

Write the autoit automation script to create a myspider.au3 file by following the process above:

#include <AutoItConstants.au3> ;; $hWnd = WinGetHandle("[ACTIVE]"); $hWnd is the target window handle, $ret = DllCall("user32.dll", "long", "LoadKeyboardLayout", "STR ", "08040804", "int", 1 + 0) DllCall("user32.dll", "ptr", "SendMessage", "hwnd", $hWnd, "int", 0x50, "int", 1, "int", $ret[0]) $url = "https://www.cnvd.org.cn/flaw/list.htm" spiderData($url) Func spiderData($url) ;; $chromePath = "C:\Users\admin\AppData\Local\Google\Chrome\Application\chrome.exe" Run($chromePath); The login window displays WinWaitActive("[CLASS:Chrome_WidgetWin_1]"); Sleep(2000); WinMove("[CLASS:Chrome_WidgetWin_1]", "Open a new TAB - Google Chrome", 0, 0,1200,740,2); Sleep(500); Send($URL) Sleep(500) Send("{Enter}") Sleep(3000); For $I = 1 To 3 Step 1; Ctrl+S send("^ S ") Sleep(2000) WinWait("[CLASS:#32770]",",10); $timeNow = @year & "" &@mon & "" &@mday & "" &@hour & "" &@min $savePath = "F:\A2021Study\ListData\" &$timeNow & "_page" &$I & ".html" ControlSetText(" Save as ","", "Edit1", $savePath); Click ok ControlClick(" save as ","","Button2"); Again to determine WinWait (" [CLASS: # 32770] ", "", 10) ControlClick (" confirm that save as", ""," for the ");. Sleep(3000); Position to the next page button, and trigger on the following page send (" {END} ") the send () "+ 15} {TAB" send (" {enter} ");. Sleep(3000) Next;; When this is done, close the browser Send("^w") EndFuncCopy the code

During scripting, there are a few things to note:

  • First, the input method switch is very important, otherwise the URL address bar value is easy to mess;
  • Second, Windows file paths are backslashes\Otherwise, the path saved as cannot be identified.
  • Third, the closing method provided in the help documentation isWinClose“, but repeated testing, to determine that this method is not reliable, or will cause the browser abnormally closed resulting in the next open will restore the last URL; Or not at all. A roundabout solution is to use the Off buttonCtrl+WTo complete the purpose of normal shutdown.

Since the crawler runs as a scheduled task, the browser needs to be closed at the end of the script to avoid opening too many browser Windows.

The revelation of

General divided into list page and crawl data details page, click each article details the process of positioning more troublesome, so crawl details page and list apart, in Java code parsing all details after the URL, and then by another autoit scripts to get details page, this process you can write their own, here are introduced in detail.

Finally, let’s summarize the entire crawl process:

The first step is to execute the autoit script that crawls the list to get the list page HTML. The second step, parse the list page HTML, get all the details of the page URL, write into the file; Step 3, execute the autoit script that crawls the detail page, which iterates through the target URL of step 2 to get the detail page HTML; The fourth step is to parse the HTML file of the detail page and get the detail data.Copy the code

Exec (” CMD /c E:\A2021Study\Autoit3\myspider.au3″)). The file path is backslash.

This method is a bit clumsy, but completely manual browser manipulation, can be anti-crawler strategy, interested friends can execute the script to try.

DirCreate creates a file, iniread reads configuration items, one line of code top Java dozens of lines, have to admit that Java operation file is the most troublesome wow!