\
Fish sheep from sunken temple, reproduced please contact qubit | public QbitAI
Robots.txt is no stranger to those who have come into contact with web crawlers. This ASCII code file stored in the root directory of the website indicates which contents in the website can be captured and which contents are forbidden to be captured.
This year, robots.txt will turn 25 years old, in order to celebrate the Internet MVP’s birthday, Google again, open source robots.txt parser, in an attempt to push robot exclusion protocol (REP) officially become the Internet industry standard.
\
Non-standard standards
Robots Exclusion Protocol is a standard proposed by Dutch software engineer Martijn Koster in 1994. Its core is to control the behavior of crawler Robots through a simple text file like robots.txt.
REP has conquered the Internet industry with its simplicity and efficiency. More than 500 million websites use robots.txt. It can be said that it has become the de facto standard for restricting crawlers, such as Googlebot browsing robots.txt when crawling a web page to ensure that it does not offend the site’s special statement.
However, after 25 years of serving the Internet industry, REP remains an unofficial standard.
This has caused a lot of trouble.
Like spelling mistakes. There are many people who ignore the colon in the robots.txt rule, and it’s not unheard of for a crawler to spell Disallow as Dis Allow.
In addition, the REP itself does not cover all cases, such as when a server error 500 occurs, does the crawler catch everything or nothing?
For site owners, vague factual standards make it difficult to write rules correctly. That’s troubling enough, not to mention the fact that not all reptiles respect robots.txt.
Birthday present from Google
Google, which started out as a search company, saw REP’s embarrassment. On REP’s 25th anniversary, Google announced that it would work with REP author Martijn Koster, webmasters, and other search engines to submit a draft to the Internet Engineering Task Force (IETF) to standardize the use of REP, in an effort to make it an official standard!
To this end, Google has also opened source one of its tools for crawling the web, the robots.txt parser, to help developers build their own parsers in the hope of creating more common formats and improving standards.
This open source C++ library has been in existence for 20 years and covers many cases of robots.txt files from Google’s production history. The open source package also includes a testing tool to help developers test rules.
Google says it wants to help website owners and developers create more amazing experiences on the Web, rather than worrying about restricting crawlers.
The content of the draft has not yet been fully released, but it will generally focus on the following directions:
- Any URI-based transport protocol can use robots.txt. Not only HTTP, but ALSO FTP and CoAP.
- Developers must parse at least the first 500 KiB of robots.txt. Define the maximum file size to ensure that opening the file does not take too long to reduce the strain on the server.
- The new maximum cache time or cache instruction value is 24 hours, giving the website owner the flexibility to update the robots.txt at any time, and the crawler will not use the robots.txt request to overload the website.
- When a previously accessible robots.txt file becomes inaccessible due to a server failure, known disallowed pages are not crawled for a long time.
The net friend comment on
Google this open source again caused hot discussion.
Some netizens said that Google as the leader of the search industry, most search engines are willing to follow, they are willing to do pioneer unified industry standards is a very meaningful thing.
Others are excited and surprised by Google’s willingness to open source the robots.txt parser. Will Google open source other modules related to search in the future? It’s a little exciting to think about.
Martijn Koster herself echoed the sentiments of some netizens: Google is awesome!
portal
Open source: github.com/google/robo…
– the – \
\
Click to become a registered member of the community ** “Watching” **