I believe we all know that to do a good job, you must sharpen your tools first. And as often to do a tug of war with each major website crawler engineers, it is more necessary to make use of all the tools around, in order to break through the other side’s defense faster. Today, I will introduce ten tools to you by daily crawler process. I believe that after you master them, you will be able to improve the work efficiency by an order of magnitude

What does the first part of the reptile do? Target site analysis, of course

1.Chrome

Chrome belongs to the basic tool of crawler, we generally use it to do the initial crawl analysis, page logical jump, simple JS debugging, network request steps, etc. Most of our early work was done on it, and without Chrome, to use an inappropriate metaphor, we’re going from the smart age to the horse and buggy age

Similar tools: Firefox, Safari, Opera

2.Charles

Charles corresponds to Chrome, but it is used to do network analysis on the App side. Compared with web side, network analysis on the App side is relatively simple, focusing on analyzing the parameters of each network request. Of course, if the other side in the server to do parameter encryption, it involves reverse engineering knowledge, that is a large basket of tools, not to talk about here

Similar tools: Fiddler, Wireshark, and Anyproxy

Next, analyze the site’s anti-crawler

3.cUrl

Wikipedia describes it this way

CURL is a file transfer tool that works on the command line using URL syntax. CURL was first released in 1997. CURL is a download tool that supports both uploading and downloading files. CURL also includes libcurl for program development.

CURL cURL cURL cURL cURL cURL cURL cURL cURL cURL cURL cURL cURL

4.Postman

Of course, you can’t just copy the cURL link and change the parameters in most websites to get the data. Then we need to use Postman to do deeper analysis. Why the big kill? Because it’s really powerful. With cURL, you can port the requested content directly and modify the requested content. Check the box to select the desired content parameters

5.Online JavaScript Beautifier

With the above tools, you can almost solve most websites and qualify as a qualified junior reptile engineer. At this time, we want to advance to the need to face more complex site crawler, this stage, you not only need to post the knowledge of the end, but also need to understand some front-end knowledge, because a lot of websites anti-crawler measures are put in the front end. You need to extract the other site’s JS information, and need to understand and reverse back, native JS code is generally not easy to read, at this time, it will help you format it

6.EditThisCookie

Crawlers and anti-crawlers are a tug of war, and you never know what they’re going to do with you, like Cookies. This time you need it to assist your analysis, through Chrome install EditThisCookie plug-in, we can click the small icon in the upper right corner, and then add, delete, change and check the information in the Cookies, greatly improve the simulation of Cookies information

Next, we started to design the crawler architecture

7.Sketch

When we’re sure we can crawl, we shouldn’t rush to write a crawler. Instead, we should start designing the structure of the crawler. According to the needs of the business, we can do a simple crawl analysis, which helps us to develop the efficiency, the so-called knife does not cut wood workers is this truth. For example, is it a search crawl or a traversal crawl? BFS or DFS? How many concurrent requests are there? After considering these issues, we can Sketch a simple architecture diagram

Similar tools: Illustrator, Photoshop

Start the fun crawler development journey

Finally, it’s time for development, and after all the steps above, we’re at this point where everything is ready to go. At this point, we just need to do code and data extraction

8.XPath Helper

In extraction of web data, we usually need to use the xpath syntax page data information extraction, in general, but we can only finish grammar, send a request to the other pages, and then print it out, just know we extracted data is correct, this will launch a lot of unnecessary request on one hand, on the other hand, also waste our time. We can use XPath Helper for this. After installing the plugin in Chrome, we just need to click on it to write the syntax in the corresponding XPath, and then we can see our result visually on the right, efficiency up+10086

9.JSONView

We sometimes extract data in Json format because it is easy to use and more and more websites prefer to use Json format for data transmission. At this point, after installing the plugin, you can easily view the Json data

10.JSON Editor Online

JSONView is the data returned directly from the web page, and the result is Json. But most of the time, the result we request is the HTML web page data rendered by the front-end. What should we do if the Json data obtained after we initiate the request cannot be well displayed in the terminal? JSON Editor Online allows you to format your data in a second and fold your JSON data in a thoughtful way

Since you see here, I believe you must be true fans, here is an egg tool for you.

0.ScreenFloat

What can it do? See the name implies, is a screen of suspension tool, I recently discovered, however, it is particularly important, especially when we need to analysis parameters, often need a few interface to switch back and forth, this time, there are some parameters we need to compare their differences, this time, you can pass it to float, need not to switch in several interface. It’s very convenient. Give you another hidden play, like the one above