Balanced politeness strategy
Crawlers can search faster and deeper than humans, so they can bring down a site. There is no need to say that a single crawler performs multiple requests per second and downloads large files. A server can also have a hard time responding to requests from multi-threaded crawlers.
As Koster (Koster, 1995) noted, the use of crawlers is useful for many tasks, but comes at a cost to the general community. The costs of using crawlers include:
network resource: A crawler works in parallel using a comparable bandwidth height over a long period of time.
server overload: Especially when access to a given server is high.
a poorly quality crawler may bring down the server or router, or attempt to download a page that it cannot handle.
personal crawler, which may block the network or server if used by too many people.
A partial solution to these problems is the Roaming Exclusion Protocol, also known as the robots.txt Protocol (Koster, 1996), which is a standard for administrators to indicate that part of a network server cannot be reached. The standard does not include recommendations for the interval between reaccessing a server, although the interval is the most effective way to avoid server overload. Recent commercial search software such as Ask Jeeves, MSN, and Yahoo can use an extra “crawl-delay” parameter in robots.txt to indicate the delay between requests.
The first suggestion for connection interval time was made by Koster in 1993, which was 60 seconds. At this rate, if a site had more than 100,000 pages, even if we had a perfect connection with zero latency and unlimited bandwidth, it would take two months to download the entire site, and only a fraction of the resources on this server would be available. This seems unacceptable.
Cho (Cho and Garcia-Molina, 2003) uses 10 seconds as the interval for access, and the WIRE crawler (Baeja-Yates and Castillo, 2002) uses 15 seconds as the default interval. MercatorWeb(Heydon and Najork, 1999) crawlers use an adaptive balancing strategy: if it takes T seconds to download a document from a certain server, the crawler waits 10t seconds before starting the next page. Dill et al. (Dill et al., 2002) uses 1 second.
For those using crawlers for research purposes, a more detailed cost-benefit analysis is necessary, and ethical considerations need to be taken into account when deciding which sites to crawl at and how fast to crawl.
Access records show that known crawlers have access intervals ranging from 20 seconds to 3-4 minutes. It is important to note that even if you are polite and take all security measures to avoid server overload, some web server administrators will complain. Brin and Page note that running a crawler on more than half a million servers generates a lot of mail and phone calls. That’s because there are millions of people online who don’t know what a reptile is because it’s the first time they’ve seen it. (Brin and Page, 1998)
The parallel strategy
A parallel crawler is one that runs multiple processes in parallel. Its goal is to maximize download speed while minimizing the overhead of parallelism and downloading duplicate pages. In order to avoid downloading a page twice, the crawler system needs a strategy to deal with the newly discovered URL during crawler running, because the same URL may be caught by different crawler processes.
A high-level architecture for web crawlers
As mentioned above, a crawler cannot only have a good grasping strategy, but also needs a highly optimized structure.
Shkapenyuk and Suel (Shkapenyuk and Suel, 2002) point out that: While it is easy to design a slow crawler that downloads several pages a second in a short time, designing a high-performance crawler that can download millions of pages in a few weeks presents many challenges in system design, I/O and network efficiency, robustness, and ease of use.
Web crawlers are at the heart of search engines, and the details of their algorithms and structures are treated as trade secrets. When the design of a crawler is released, there are always some details that are missing in order to prevent others from copying the work. There is also concern about “search engine spam”, which is mainly used to prevent major search engines from publishing their sorting algorithms.