Speaking of web crawler, we think of the estimate is Python, crawler is indeed one of the pronoun of Python, compared to Java is inferior. Many people do not know that Java can do web crawler, in fact, Java can also do web crawler and can do very well, there are many excellent Java web crawler framework in the open source community, such as WebMagic. At that time, I participated in the development of a public opinion analysis system, which involved the news collection of a large number of websites. We used WebMagic to write the collection program. However, we did not know its design principle at that time, so we took many detours in using it. Webmagic’s design borrows from Scrapy, so it can be just as powerful as Scrapy. We’ll discuss the WebMagic framework in more detail later.
In the following years of work, I also participated in several crawler projects, but most of them used Python. Apart from the language, crawlers also have a set of ideas. These years of writing crawler programs have greatly helped my personal technical growth, because in the process of crawler, I will encounter all kinds of problems. In fact, doing web crawler is very technical. In addition to ensuring that my collection program is available, I will also encounter all kinds of strange problems of being crawled by websites. For example, if the entire HTML page has a class or ID attribute, you want to extract table data in this page, and do it elegantly, this is a test of your imagination and skill. I was very lucky to have access to the web crawler when I just started my career, which accelerated my understanding and cognition of the Internet and broadened my horizon.
In recent years, web crawler is more popular. If you want to learn Java web crawler, I have summarized the four basic knowledge that you need to know to get started learning Java web crawler based on my own experience.
1. A “moral” reptile
Why did I put that first? Because I think this is important. What is a moral reptile? It is to follow the rules of the crawled server, not to affect the normal operation of the crawled server, not to destroy the crawled service, which is the “moral” crawler.
One of the most frequently debated questions is whether crawlers are legal? I don’t know what you’re going to see
As a computer technology, crawler determines its neutrality. Therefore, crawler itself is not prohibited by law. However, using crawler technology to obtain data has the risk of violation or even crime. The so-called case-by-case analysis, just as the fruit knife itself is not prohibited by law, but used to stab people, it is not tolerated by law.
Is a reptile not illegal? Depending on what you’re doing, what is the nature of a web crawler? The essence of web crawler is to use machine instead of human to visit the page. It’s certainly not illegal for me to look at publicly available news, so it’s not illegal for me to collect publicly available news on the Internet, just like the major search engine sites, which would love to be captured by the spiders of other search engines. On the contrary, it is illegal to collect other people’s private data. You can check other people’s private information by yourself, so it is also illegal to use programs to collect data. Just like the fruit knife mentioned in the answer is not illegal, but it is illegal to stab people.
In order to be a “moral” crawler, Robots protocol is what you must know. The following is baidu Encyclopedia of Robots protocol
In addition to the agreement, we also need restraint in the collection behavior. It is pointed out in Chapter 2, Article 16 of “Data Security Management Measures (Draft for Comments)” :
Network operators shall adopt automatic means to access and collect website data and shall not impede the normal operation of the website; Such behaviors seriously affect the operation of the website. If the automatic access collection traffic exceeds one third of the website’s average daily traffic, the website shall stop the automatic access collection when it requires it to stop.
This rule points out that crawlers should not interfere with the normal operation of the website. If you use crawlers to bring down the website, real visitors will not be able to visit the website, which is a very immoral behavior. Such behavior should be put an end to.
In addition to data collection, we also need to pay attention to the use of data. Even if we collect personal information data with authorization, we should never sell personal data, which is specifically prohibited by law. See also:
According to article 5 of the Interpretation of The Supreme People’s Court and The Supreme People’s Procuratorate on Several Issues concerning the Application of Law in Handling Criminal Cases of Infringing on Citizens’ Personal Information, the interpretation of “serious circumstances” is as follows:
- (1) Illegally obtaining, selling or providing more than 50 pieces of track information, communication content, credit information or property information;
- (2) illegally obtaining, selling or providing at least 500 pieces of citizen’s personal information, such as accommodation information, correspondence records, health and physiological information, transaction information and other information that may affect the safety of person and property;
- (3) Illegally obtain, sell or provide more than 5,000 pieces of citizens’ personal information other than the third and fourth provisions constitute the “serious circumstances” required by the “crime of infringing on citizens’ personal information”.
- In addition, without the consent of the collected, even if it is legally collected citizens’ personal information to provide others, is also under the provisions of article 253 of the Criminal Law of “providing citizens’ personal information”, may constitute a crime.
2. Analyze Http requests
Every time we interact with the server is through Http protocol, of course, there are not Http protocol, this can be collected I do not know, there is no collection, so we only talk about Http protocol, in the Web page analysis Http protocol is relatively simple, we take Baidu to retrieve a news as an example
Request Headers refers to the parameters of the Request Headers required by this Http Request. Some websites will block crawlers based on the Request Headers, so it is necessary to know the parameters. Most of the parameters in the Request Headers are public. User-agent and Cookie are frequently used. User-agent identifies the browser request header, and Cookie stores the User login credentials.
Query String Parameters represent the request Parameters of the Http request, which is very important for POST requests, because you can view the request Parameters here, which is very useful for simulating POST requests such as login.
The above is the link analysis of HTTP request in the webpage version. If you need to collect data in the APP, you need to use the simulator. Because there is no debugging tool in the APP, you can only use the simulator.
- fiddler
- wireshark
3, learn HTML page parsing
We collect pages are HTML pages, we need to get the information we need in THE HTML page, which involves HTML page parsing, that is, DOM node parsing, this is a top priority, if you don’t know this point is like a magician without props, can only stare. Take the following HTML page for example
CSS selectors reference manual: https://www.w3school.com.cn/cssref/css_selectors.asp
XPath tutorial: https://www.w3school.com.cn/xpath/xpath_syntax.asp
Resolution using CSS selectors is written as: #wgt-ask > h1 > span
//span[@class=”wgt-ask”]
In this way, we get the span node, and the value needs to fetch text. For CSS selectors and XPath, in addition to writing our own, we can also use a browser to help us complete, such as Chrome browser
4. Understand the anti-crawler strategy
Because crawler is very widespread now, many websites will have anti-crawler mechanism to filter crawler programs, in order to ensure that the website can be used, this is also a very necessary means, after all, if the website can not be used, there is no interest to talk about. Anti-crawler means are very many, let’s look at several common anti-crawler means.
Anti-crawler mechanism based on Headers
This is a common anti-crawler mechanism. Websites can determine whether the program is crawler by checking user-agent and Referer parameters in Request Headers. To bypass this mechanism, we just need to check the values of user-agent and Referer parameters required by the website in the web page, and then set these parameters in the Request Headers of the crawler.
Anti-crawler mechanism based on user behavior
This is also a common anti-crawler mechanism. The most commonly used is IP access restriction. How many times is an IP allowed to access within a period of time?
For this mechanism, we can solve this problem by setting the proxy IP address. We only need to obtain a batch of proxy IP addresses from the proxy IP website and set the proxy IP address at the time of request.
In addition to the IP limit, there will be an interval based on the time you visit each time, and if you visit each time at a fixed interval, it may also be considered a crawler. One way to get around this limitation is to set the time interval differently when requesting, with a ratio of 1 minute sleep this time and 30 seconds the next time.
Anti-crawler mechanism based on dynamic page
There are many websites where the data we need to collect is generated through Ajax requests or JavaScript. For these sites, it is quite painful. We have two ways to bypass this mechanism. The second approach is reverse thinking, where we retrieve the requested data through an AJAX link and access that link directly.
The above is some basic knowledge of crawler, mainly introduces the use tools of web crawler and anti-crawler strategy, which will be helpful for our crawler learning in the future. Since we have written several crawler projects intermittently in the past few years, Java crawler is also used in the early stage and Python is used in the late stage. Recently suddenly interested in Java crawler again, so ready to write a crawler series blog, recomb Java web crawler, is a summary of Java crawler, if you can help to want to use Java to do network crawler partners, it will be better. Java Web crawlers are expected to be six in-depth articles, from simple to complex, covering all the problems I have encountered over the years. Here are six articles about simulation.
1, web crawler, originally so simple
This article is a web crawler entry, will use Jsoup and HttpClient two ways to get the page, and then use the selector parsing to get the data. You end up with a crawler that’s just an HTTP request, simple as that.
2, web collection encountered login problems, what should I do?
This chapter will briefly talk about obtaining the data required to log in. Taking the personal information of Douban as an example, this chapter will briefly talk about such problems in two ways: setting cookies manually and simulating login.
3. What should I do when WEB page collection meets Ajax asynchronous loading of data?
This chapter simply talk about the problem of asynchronous data, taking netease news as an example, from the use of HTMLUnit tool to get rendered pages and reverse thinking directly to get Ajax request connection to get data two ways, a simple talk about the treatment of this kind of problem.
4, web collection IP is blocked, what should I do?
IP access is limited this should be a common thing, take Douban movie as an example, mainly to set up the proxy IP as the center, a simple talk about the SOLUTION to the IP limit, but also a simple talk about how to set up their own IP proxy service.
5, network acquisition performance is poor, what should I do?
Sometimes there are requirements for the performance of the crawler, and this single-thread approach may not work, we may need multi-threaded or even distributed crawler, so this article mainly talks about multi-threaded crawler and distributed crawler architecture scheme.
6. Use case analysis of webMagic, the open source crawler framework
I used to use Webmagic to do a crawler, but at that time I did not understand, did not have a good understanding of webmagic framework, after several years of experience, I now have a new understanding of this framework, Therefore, I want to build a simple demo according to the specifications of Webmagic to experience the power of Webmagic.
The above is my crawler series of blog, with text records dribs and drabs, record my non-professional crawler players encountered pits in those years, if you are ready to learn Java network crawler, might as well pay attention to a wave, I believe you will get a certain harvest, because I will carefully write each article.
The last
Play a small advertisement, welcome to scan the code to pay attention to the wechat public number: “The technical blog of the flathead brother”, progress together.