What is a robots.txt file?

When the web crawler executes, it should first check whether the root directory of the site has the robots.txt file. While the text file exists, the content should be crawled according to its rules, that is, crawled within the open range of the site. Of course, if your site does not want to be any form of crawler, this time the search engines can not include the content of your site, search engines do not include site SEO optimization will also be affected. Robots.txt file against the gentleman, not against the villain. In most cases, the verification of the robots.txt file is ignored. It is better to check the existence of the robots.txt file before the web crawler, and check the scope of the crawler according to the rules defined in the file.

What are the rules for defining robots.txt files?

The robots. TXT file mainly contains keywords user-agent, Allow, and Disallow to define rules. User-agent generally refers to the restriction of User identity. Allow and Disallow Allow or deny access to URL addresses.

Instance of a

1Allow all robots access
2
3User-agent: * 
4
5Allow: /
Copy the code

Example 2

1# deny all robot access
2
3User-agent: *
4
5Disallow: /
Copy the code

Examples of three

1# deny all robots access to a directory
2
3User-agent: *
4
5Disallow: /user/load/data
Copy the code

Instances of four

1Allow all robots to access a specific directory
2
3User-agent: *
4
5Allow: /user/load
6
7Allow: /user/excel
Copy the code

Examples of five

1# deny all robots access to htML-terminated files in a directory
2
3User-agent: *
4
5Disallow: /api/*.html
Copy the code

Examples of six

1Allow only all users to access files at the end of.jsp
2
3User-agent: *
4
5Allow: .jsp$
6
7Disallow: /
Copy the code

In practical application, these three keywords can be arbitrarily combined according to the needs of specific sites to complete the configuration of crawler rule range.

More exciting things to come to wechat public account “Python Concentration Camp”, focusing on Python technology stack, information acquisition, communication community, dry goods sharing, looking forward to your joining ~