Learning is like rowing upstream; not to advance is to drop back.

Concept of reptile

A Web crawler is a robot that recursively traverses a site and then crawls data along hyperlinks on the Web.

The search engine we use is itself a big crawler. It pulls back all the documents it encounters and processes them to form a searchable database. Then, when the user looks it up, it returns the information the user needs.

Deny robot access

In 1994, a simple voluntary restraint technique was proposed to keep crawlers out of places where they didn’t fit, and to give webmasters a mechanism to better control robot behavior. This standard is called the Denied robot Access standard, but it is usually just called robots.txt based on the file that stores the access control information.

Web site and robots.txt file

If a site has a robots.txt file, the crawler must retrieve the robotx.txt file and process it before accessing any links from that site.

Crawler gets robots.txt

  • access

The crawler will use the GET method to GET the robots.txt resource. If there is a robots.txt file, the server returns it in a text/plain body. If the server responds with a 404 Not Found Http status code, the crawler assumes that there are no crawler access restrictions on the server and can request any file.

  • The response code

Many sites don’t have a robots.txt resource, but crawlers don’t know about it. It must attempt to obtain the robots.txt resource from each site. A crawler will adopt a different strategy for a simple result.

  • If the server responds with a success status (HTTP status code 2xx), the crawler must parse robots.txt and retrieve content from the site using rejection rules.
  • If the server responds by saying that the resource does not exist (HTTP status code 404), the crawler assumes that the server has not activated any exclusion rules and that it is not restricted in obtaining content from the site.
  • If the server says the resource has access rights (HTTP status code 401 or 403) as a response, the crawler assumes that retrieving content from the site is completely restricted.
  • If the request fails (HTTP status code 503), the crawler delays retrieving content from the site until it can retrieve robots.txt.
  • If a redirect is requested (HTTP status code 3xx), the crawler will follow until it can obtain robots.txt.

TXT file format

The syntax of the robots.txt file is very simple. It’s kind of like the request headers we normally write.

User-Agent: slurp
User-Agent: webcrawler
Disallow: /user

Copy the code

Each record in the file describes a set of exclusion rules for a particular set of crawlers. In this way, different exclusion rules can be used for different crawlers.

  • User-Agentline

Each crawler record begins with one or more user-agent lines of the following form:

User-Agent: <spider-name>
Copy the code

or

User-Agent: *
Copy the code

The crawler sends the crawler name in the request header user-Agent of the HTTP GET request. When handling robots.txt files, one of the following rules must be followed:

  • The first one<spider-name>Is a case-insensitive string for the crawler name.
  • The first one<spider-name>for*.

If the crawler cannot find a user-agent row that matches its name and cannot find a wildmatched user-Agent row, then no record matches it and access is not restricted.

  • DisallowandAllowline

Disallow and Allow lines follow user-agent lines that the crawler rejects records. Used to indicate which URL paths are forbidden or allowed for a particular crawler.

The crawler must match the urls it expects to visit sequentially with all the Disallow and Allow rules in the exclusion record. Use the first match found. If no match is found, the URL is allowed.

For Allow/Disallow lines to match a URL image, the rule path must be a case-sensitive prefix to the URL path. For example, Disallow:/ TMP matches all of the following urls:

http://909500.club/tmp
http://909500.club/tmp/
http://909500.club/tmp/es6.html
Copy the code

If the rule path is an empty string, it matches everything

HTML crawler control tag

There is a more direct way to control which pages a crawler can access when writing HTML. We can do this with the meta information tag meta:

<meta name="robots" content="Crawler control instruction">
Copy the code

Crawler control instruction

  • NOINDEXTells the crawler not to process the page content and to ignore the document.
<meta name="robots" content="NOINDEX">
Copy the code
  • NOFLLOWTell the crawler not to climb any external links on this page.
<meta name="robots" content="NOFLLOW">
Copy the code
  • FLLOWTells the crawler that it can climb any external links on the page.
<meta name="robots" content="FLLOW">
Copy the code
  • NOARCHIVETells the crawler that a local copy of this page should not be cached.
<meta name="robots" content="NOARCHIVE">
Copy the code
  • ALLIs equivalent toINDEX, FLLOW.
  • NONEIs equivalent toNOINDEX, NOFLLOW.

conclusion

Robot.txt I believe that some of you should have encountered, in fact, it is not necessary to add this file in the site directory, you can also return a web server to the same content.

For example: wechat small program or public number development time need to upload verification file, many backend students sometimes too upload trouble, directly echo a same content in nginx.

Robots.txt can also be returned with echo.

One last word

  1. Move your rich little hands and “like it.”
  2. Move your rich little hands, “click here”
  3. All see here, might as well “add a follow”
  4. Might as well “forward”, good things to remember to share