preface
Background: one day, taking his mobile phone to look at technical articles, but mobile phone to look at technical articles, sometimes really egg pain, because once more code, small screen to see or dazzling; Or one day feel this article, feel to write very good oh, so the first collection, intend to read in a few days, and then wait for me to open the collection of articles, lying X, was deleted by the author. Or want to categorize a blogger’s articles
So the initiation of can climb down the “wechat public number” article, save to the computer idea
Although there is a saying that “life is short, I use Python” in Amway Python, I still want to support my big Java in “crawler”.
A, caught
Fiddler is recommended for mobile phone capture (in this case, Android phones). Fiddler can be downloaded by itself.
Key points: please make sure that the computer and mobile phone are connected to the same WiFi on the same LAN, don’t say how can not catch the bag
1. Query the current IP address of the PC
Win + R (shortcut key), open the [Run] window, and then enter CMD to pop up the command window, and then enter: ipconfig
Remember IP configuration mobile WiFi for a while, not configuration can see the Fiddler’s official website: this article docs.telerik.com/fiddler/Con…
Open mobile WiFi management, display advanced WiFi options, set the proxy server to manual, the proxy host name is just computer IPv4 address: 192.168.0.xxx, and the default proxy server port is 8888
2. Install the Fiddler certificate on the mobile phone
Wechat’s network request is HTTPS, which is highly secure, so Fiddler needs to install its trust certificate on the mobile phone to catch wechat’s request (metaphorical: Fiddler acts as an agent or middleman, deceiving the whole world in the process of establishing HTTPS, so as to gain trust).
The operation is as follows:
- Mobile browser open: ipv4.fiddler:8888/
- Download the certificate
FiddlerRoot Certificate
- Install this certificate on your phone, which may require setting a screen password
- Open [Fiddler] – [Tools] – [HTTPS] and check
Capture HTTPS traffic
3.HttpCanary
In addition to Fiddler, another Android packet capture tool is recommended here: HttpCanary. Apk installed on your phone can also realize real-time packet capture.
HttpCanary — the most powerful Android capture injection tool
Second, the crawler
After configuring the packet capture tool, open a public account, switch to history article messages, and then click more messages to watch the packet capture from Fiddler.
Before each packet capture, you are advised to clear historical packet capture data before performing the operation to locate links.
So we can easily get the wechat public number to get the article interface address:
Mp.weixin.qq.com/mp/profile_…
If you switch to the WebForms TAB, you can see the parameters of the Get request. After that, we can simulate the request.
In the above figure, I have framed several important parameters, which are related to the verification operation of wechat server, so remember not to make mistakes when copying, otherwise the session error will be prompted.
I found by trial and error that every time I climb a new public account, I only need to change these four parameters: __biz, AppMSg_token, pass_ticket, and wap_sid2
How do I crawl all the articles? Have done mobile phone client children’s shoes, should know that we use Recyclerview or ListView to do drop-down refresh or pull up load more, the interface generally need to configure the parameters of nextPage, corresponding to wechat article interface is: The offset parameter is understood as the offset, and the count parameter is understood as the number of loads each time.
Example: I setoffset
Is zero,count
If it is 10, then the first page is loaded with 10 entries, and the start point of the second page should beoffset = 10
.count
It’s still 10 without modification. Hopefully you understand my example, but I think it’s not too hard
1. Build requests, recursive calls
Request this thing, okHTTP of course, to reference the relevant JAR package or gradle dependency. Note: User-Agent uses Fiddler to fetch values to simulate requests from mobile phone clients. The core code is as follows:
String url = "https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz=%s&f=json&offset=%d&count=10&is_ok=1&scene=126&uin=777&key= 777&pass_ticket=%s&wxtoken=&appmsg_token=%s&f=json "; url = String.format(url, MyClass.__biz, startIndex, MyClass.pass_ticket, MyClass.appmsg_token); // System.out.println(url); String cookie = "rewardsn=; wxtokenkey=777; wxuin=777750088; devicetype=android-26; version=2700033c; lang=zh_CN; pass_ticket=%s; wap_sid2=%s"; cookie = String.format(cookie, MyClass.pass_ticket, MyClass.wap_sid2); Request request = new Request.Builder() .url(url) .get() .addHeader("Host", AddHeader ("Connection", "keep-alive"). AddHeader (" user-agent ", "Mozilla/5.0 (Linux; Android 8.0.0; SM-G9500 Build/R16NW; Wv) AppleWebKit / 537.36 (KHTML, Like Gecko) Version/4.0 Chrome/66.0.3359.126 MQQBrowser/ 6.2TBS /044704 Mobile Safari/ 537.36mmwebid /8994 X2700033c MicroMessenger / 7.0.3.1400 (0) the Process/toolsmp NetType/WIFI Language/zh_CN "). The addHeader (" Accept - Language ", "zh-CN,zh-CN; Q = 0.9, en - US; Q =0.8").addheader (" x-requested-with ", "XMLHttpRequest").addheader ("Cookie", Cookie).addheader ("Accept", "*/*") .build(); Response response = okHttpClient.newCall(request).execute(); if (response.isSuccessful()) { String body = response.body().string(); JSONObject jo = new JSONObject(body); if (jo.getInt("ret") == 0) { currentTimes++; System.out.println(" currentTimes + currentTimes "); String general_msg_list = jo.getString("general_msg_list"); general_msg_list = general_msg_list.replace("\\/", "/"); JSONObject jo2 = new JSONObject(general_msg_list); JSONArray msgList = jo2.getJSONArray("list"); for (int i = 0; i < msgList.length(); i++) { JSONObject j = msgList.getJSONObject(i); JSONObject msgInfo = j.getJSONObject("comm_msg_info"); long datetime = msgInfo.getLong("datetime"); SimpleDateFormat sdf = new SimpleDateFormat("yyyyMMdd"); String date = sdf.format(new Date(datetime * 1000)); if (j.has("app_msg_ext_info")) { JSONObject app_msg_ext_info = j.getJSONObject("app_msg_ext_info"); JSONArray multi_app_msg_item_list = app_msg_ext_info.getJSONArray("multi_app_msg_item_list"); If (multi_app_MSg_item_list.length () > 0) {do nothing} else {String content_URL = app_msg_ext_info.getString("content_url"); String title = app_msg_ext_info.getString("title"); int copyright_stat = app_msg_ext_info.getInt("copyright_stat"); String record = date + "-@@-" + title + "-@@-" + content_url; System.out.println(record); datas.add(record); }} else {system.out.println (); If (jo.getint ("can_msg_continue") == 1) {thread.sleep (1000); startIndex = jo.getInt("next_offset"); execute(); } else {system.out.println (" Climb done! ") ); // Save the result saveToFile(); }} else {system.out.println (" can't get article, argument error "); }}Copy the code
2. Save the article information
It seems that the code is not very much, haha, and then save the data to a TXT file, I use the format is: time -@@- title -@@- link (convenient to use the “-@@-” segmentation string), of course, you can also connect Mysql, to store information, I was lazy, did not do.
3. Java-based crawler Framework — WebMagic (Supplement)
A user reminds, WebMagic for Java ready-made crawler framework, here posted, only for users reference, about the official website, feel good oh ~
Portal: webmagic.io
3. Html to Pdf
Now that you have the Url of each article, isn’t it easy to save it to Html? But how to convert Html to Pdf?
1. Wkhtmltopdf tools
1.1 Download wkHTMLTopdf and install it
Portal: wkhtmltopdf.org/, note: system version selection, MY side is Windows system
1.2 Configuring Environment Variables
If you have not configured the system environment variables, you need to go to the bin folder in the wkHTMLTopdf installation directory and run the command
1.3 How to Use it
For example: You want to convert Google web pages to PDF
wkhtmltopdf http://google.com google.pdf
Copy the code
2. Solve the loss problem of wkHTMLTopdf saved pictures
Wkhtmltopdf save PDF, there is a network image loss problem, that is, do not show the picture, then how to solve this problem? By replacing the attributes of the IMG tag in HTML with the values of data-src and SRC, you can change the HTTP link to the local path.
Ideas: Request the url of the article, get the HTML information, parse the HTML through Jsoup, then select the IMG tag through the selector, then get the attribute value of img data-src (image address), and then traverse the downloaded image to the local, after downloading the image successfully, Modify the img data-src attribute value using the method provided by Jsoup to replace the original HTML information. The core code is as follows:
Jsoup: HTML parsing artifact
Request request = new Request.Builder().url(url).get().build(); Response response = okHttpClient.newCall(request).execute(); if (response.isSuccessful()) { String html = response.body().string(); // System.out.println(html); Document doc = Jsoup.parse(html); Elements img = doc.select("img"); for (int i = 0; i < img.size(); ImgUrl = img.get(I).attr("data-src"); if (imgUrl ! = null && ! imgUrl.equals("")) { Request request2 = new Request.Builder() .url(imgUrl) .get() .build(); Response execute = okHttpClient.newCall(request2).execute(); if (execute.isSuccessful()) { String imgPath = imgDir + MD5Utils.MD5Encode(imgUrl, "") + ".png"; File imgFile = new File(imgPath); if (! Imgfile.exists ()) {InputStream in = execute.body().bytestream (); FileOutputStream ot = new FileOutputStream(new File(imgPath)); BufferedOutputStream bos = new BufferedOutputStream(ot); byte[] buf = new byte[8 * 1024]; int b; while ((b = in.read(buf, 0, buf.length)) ! = -1) { bos.write(buf, 0, b); bos.flush(); } bos.close(); ot.close(); in.close(); } // reassign to the local path img.get(I).attr("data-src", imgPath); img.get(i).attr("src", imgPath); // Export HTML HTML = doc.outerhtml (); } execute.close(); } } String htmlPath = dirPath + fileName + ".html"; final File f = new File(htmlPath); if (! f.exists()) { Writer writer = new FileWriter(f); BufferedWriter bw = new BufferedWriter(writer); bw.write(html); bw.close(); writer.close(); } // convert htmltopdf.convert (htmlPath, destPath); // Delete HTML file if (f.edists ()) {f.delete(); } response.close(); }Copy the code
3. Convert to PDF
Public static Boolean convert(String srcPath, String destPath) { StringBuilder cmd = new StringBuilder(); cmd.append("wkhtmltopdf"); cmd.append(" "); cmd.append("--enable-plugins"); cmd.append(" "); cmd.append("--enable-forms"); cmd.append(" "); cmd.append("--disable-javascript"); CMD. Append (" "); cmd.append(" \""); cmd.append(srcPath); cmd.append("\" "); cmd.append(" "); cmd.append(destPath); System.out.println(cmd.toString()); boolean result = true; try { Process proc = Runtime.getRuntime().exec(cmd.toString()); HtmlToPdfInterceptor error = new HtmlToPdfInterceptor(proc.getErrorStream()); HtmlToPdfInterceptor output = new HtmlToPdfInterceptor(proc.getInputStream()); error.start(); output.start(); proc.waitFor(); } catch (Exception e) { result = false; e.printStackTrace(); } return result; }Copy the code
Get terminal input and output information, the above code HtmlToPdfInterceptor
public class HtmlToPdfInterceptor extends Thread { private InputStream is; public HtmlToPdfInterceptor(InputStream is) { this.is = is; } @Override public void run() { try { InputStreamReader isr = new InputStreamReader(is, "utf-8"); BufferedReader br = new BufferedReader(isr); String line = null; while ((line = br.readLine()) ! = null) { System.out.println(line.toString()); }} catch (IOException e) {e.printStackTrace(); }}Copy the code
Wkhtmltopdf conversion process is slow, it is recommended to open multiple threads to do, I am 5 threads to convert, finally look at the results.
summary
Thank you for reading, if there is any wrong place, please also point out the correction!