Google to make robots.txt an Internet standard after 25 years - GoFuckYourself.com

Bladewire · 07-01-2019, 05:50 PM

Google demanding more free work & expense from people to bend to their fucking will

Google to make robots.txt an Internet standard after 25 years

The Robots Exclusion Protocol (REP) — better known as robots.txt — allows website owners to exclude web crawlers and other automatic clients from accessing a site. “One of the most basic and critical components of the web,” Google wants to make robots.txt an Internet standard after 25 years.

Despite its prevalence, REP never became an Internet standard, with developers interpreting the “ambiguous de-facto” protocol “somewhat differently over the years.” Additionally, it doesn’t address modern edge cases, with web devs and site owners ultimately still having to worry about implementation today.

On one hand, for webmasters, it meant uncertainty in corner cases, like when their text editor included BOM characters in their robots.txt files. On the other hand, for crawler and tool developers, it also brought uncertainty; for example, how should they deal with robots.txt files that are hundreds of megabytes large?

To address this, Google — along with the original author of the protocol from 1994, webmasters, and other search engines — has now documented how REP is used on the modern web and submitted it to the IETF.

The proposed REP draft reflects over 20 years of real world experience of relying on robots.txt rules, used both by Googlebot and other major crawlers, as well as about half a billion websites that rely on REP. These fine grained controls give the publisher the power to decide what they’d like to be crawled on their site and potentially shown to interested users. It doesn’t change the rules created in 1994, but rather defines essentially all undefined scenarios for robots.txt parsing and matching, and extends it for the modern web.

The robots.txt standard is currently a draft, with Google requesting comments from developers. The standard will be adjusted as web creators specify “how much information they want to make available to Googlebot, and by extension, eligible to appear in Search.”

This standardization will result in “extra work” for developers that parse robots.txt files, with Google open sourcing the robots.txt parser used in its production systems.

This library has been around for 20 years and it contains pieces of code that were written in the 90’s. Since then, the library evolved; we learned a lot about how webmasters write robots.txt files and corner cases that we had to cover for, and added what we learned over the years also to the internet draft when it made sense.

brassmonkey · 07-01-2019, 06:00 PM

they want to reduce page removal right?? this is something they have to pay for currently. i think they are trimming the fat to focus on tech items. i use robots on everything

trevesty · 07-02-2019, 05:47 AM

Been running websites for over 15 years and making money from it. Tens of thousands of sites at least...

And every single one of them has had a robots.txt file. I don't see the issue.

Bladewire · 07-02-2019, 09:23 AM

Quote:

Originally Posted by trevesty

And every single one of them has had a robots.txt file. I don't see the issue.

It's not going to be the robot.txt that it's always been.

It will be mandatory and you'll have to add all sorts of parameters that you don't currently have, and likely aren't aware of, and if any of them are null, or if you don't have the robot text file exactly how Google wants it you will be dinged and your SE placement will suffer.

brassmonkey · 07-02-2019, 12:11 PM

Quote:

Originally Posted by Bladewire

It's not going to be the robot.txt that it's always been.

It will be mandatory and you'll have to add all sorts of parameters that you don't currently have, and likely aren't aware of, and if any of them are null, or if you don't have the robot text file exactly how Google wants it you will be dinged and your SE placement will suffer.

a sitemap is more complex

have no issues of google, bing, or yandex saying change a thing.

Bladewire · 07-07-2019, 01:07 PM

Quote:

Originally Posted by brassmonkey

a sitemap is more complex

have no issues of google, bing, or yandex saying change a thing.

We agree

rowan · 07-07-2019, 11:37 PM

Funny how Google is going on about making a de-facto a standard, when they explicitly ignore a fairly important (IMHO) de-facto directive: Crawl-delay.

Website: I'm asking you nicely to please limit your fetching to once per 60 seconds.

GoogleBot: No.

thommy · 07-08-2019, 12:15 AM

Quote:

Originally Posted by brassmonkey

they want to reduce page removal right?? this is something they have to pay for currently. i think they are trimming the fat to focus on tech items. i use robots on everything

I think this is just one reason the other is that they don´t get fined for what they show.

actually Google shows many documents and websites that do not have a robot.txt

now let´s imagine a funny example:

a weapon company uploads the newest secret version of a killer machine into their web - Google crawls it and publish it without the explicit demand of doing so - they would be also in trouble.

THE INTERNET law is not existing and google works worldwide under the laws of 255 different countries.
I think that robots.txt would be the simplest way to allow or deny to crawl and publish
stuff from a site.

we can see everywhere in internet that rules and laws are going to an excessive point. users have to agree to cookies (even when this was a common technique for the part 25 years).

in addition, an internet presence is not necessarily a privilege of companies. consumer protection can also apply here to the site operator.

magneto664 · 07-08-2019, 02:30 AM

Quote:

Originally Posted by thommy

a weapon company uploads the newest secret version of a killer machine into their web - Google crawls it and publish it without the explicit demand of doing so - they would be also in trouble.

Every day thousands of others bots scan your website, ahref, majestic, exploit looking bots, advert bots, other shit bots, most of them have loaded default directories and file names or directory paths for scripts working on your site. If you do not want something to appear on the Internet, you do not upload to the internet. Simple.[/QUOTE]

Quote:

Originally Posted by thommy

I think that robots.txt would be the simplest way to allow or deny to crawl and publish
stuff from a site.

If in the robot file you select which file or directory to bypass the possible that Google will do. But for others it will be a gift.

Klen · 07-08-2019, 02:45 AM

Quote:

Originally Posted by thommy

I think this is just one reason the other is that they don´t get fined for what they show.

actually Google shows many documents and websites that do not have a robot.txt

now let´s imagine a funny example:

a weapon company uploads the newest secret version of a killer machine into their web - Google crawls it and publish it without the explicit demand of doing so - they would be also in trouble.

THE INTERNET law is not existing and google works worldwide under the laws of 255 different countries.
I think that robots.txt would be the simplest way to allow or deny to crawl and publish
stuff from a site.

we can see everywhere in internet that rules and laws are going to an excessive point. users have to agree to cookies (even when this was a common technique for the part 25 years).

in addition, an internet presence is not necessarily a privilege of companies. consumer protection can also apply here to the site operator.

Average internet user does not have any knowledge about robots and crawling so you cant really expect everyone to follow. A better solution would be , instead crawl robot crawling everything on website, is to have explicitly stated what should be crawled instead.

thommy · 07-08-2019, 02:46 AM

Quote:

Originally Posted by magneto664

Every day thousands of others bots scan your website, ahref, majestic, exploit looking bots, advert bots, other shit bots, most of them have loaded default directories and file names or directory paths for scripts working on your site. If you do not want something to appear on the Internet, you do not upload to the internet. Simple.

but this bots are not google. nobody will try to sue them.

I really know how a robots.txt is working but the point is that millions who have an internet presence don´t know.

if google crawls something from their site WITHOUT AN EXPLICIT demand to do so, they can be seen as "victim" from the one or other judge and can sue Google for millions.

this is why it would make sense to make robots.txt as THE rule to crawl your site and sites without robots.txt would not be touched.

Quote:

If in the robot file you select which file or directory to bypass the possible that Google will do. But for others it will be a gift.

as i said - if there are no clear rules for that it will open big doors for lawsuits. and not the others would be the ones that have to fight it - it would be the one who have the money to pay.

thommy · 07-08-2019, 02:55 AM

Quote:

Originally Posted by KlenTelaris

Average internet user does not have any knowledge about robots and crawling so you cant really expect everyone to follow. A better solution would be , instead crawl robot crawling everything on website, is to have explicitly stated what should be crawled instead.

that is exactly what i meant.

the laws in the various countries are so different that you can not even decide who is a professional who HAVE to know it and who is not.

when the internet started nobody ever thought about such things like privacy and permission to crawl a page. it was simply assumed that everyone who posts something on the internet wants others to find it. this case have changed a lot in the meantime and the views on right or wrong in the world are so completely different that everything have to be EXPLICIT allowed and not just assumed.

07-01-2019, 05:50 PM	#1
Bladewire StraightBro Industry Role: Join Date: Aug 2003 Location: Monarch Beach, CA USA Posts: 56,232	Google to make robots.txt an Internet standard after 25 years Google demanding more free work & expense from people to bend to their fucking will Google to make robots.txt an Internet standard after 25 years The Robots Exclusion Protocol (REP) — better known as robots.txt — allows website owners to exclude web crawlers and other automatic clients from accessing a site. “One of the most basic and critical components of the web,” Google wants to make robots.txt an Internet standard after 25 years. Despite its prevalence, REP never became an Internet standard, with developers interpreting the “ambiguous de-facto” protocol “somewhat differently over the years.” Additionally, it doesn’t address modern edge cases, with web devs and site owners ultimately still having to worry about implementation today. On one hand, for webmasters, it meant uncertainty in corner cases, like when their text editor included BOM characters in their robots.txt files. On the other hand, for crawler and tool developers, it also brought uncertainty; for example, how should they deal with robots.txt files that are hundreds of megabytes large? To address this, Google — along with the original author of the protocol from 1994, webmasters, and other search engines — has now documented how REP is used on the modern web and submitted it to the IETF. The proposed REP draft reflects over 20 years of real world experience of relying on robots.txt rules, used both by Googlebot and other major crawlers, as well as about half a billion websites that rely on REP. These fine grained controls give the publisher the power to decide what they’d like to be crawled on their site and potentially shown to interested users. It doesn’t change the rules created in 1994, but rather defines essentially all undefined scenarios for robots.txt parsing and matching, and extends it for the modern web. The robots.txt standard is currently a draft, with Google requesting comments from developers. The standard will be adjusted as web creators specify “how much information they want to make available to Googlebot, and by extension, eligible to appear in Search.” This standardization will result in “extra work” for developers that parse robots.txt files, with Google open sourcing the robots.txt parser used in its production systems. This library has been around for 20 years and it contains pieces of code that were written in the 90’s. Since then, the library evolved; we learned a lot about how webmasters write robots.txt files and corner cases that we had to cover for, and added what we learned over the years also to the internet draft when it made sense.

07-01-2019, 06:00 PM	#2
brassmonkey Pay It Forward Industry Role: Join Date: Sep 2005 Location: Yo Mama House Posts: 75,548	they want to reduce page removal right?? this is something they have to pay for currently. i think they are trimming the fat to focus on tech items. i use robots on everything __________________ EMAIL ==>[email protected] ==> #NOBIDEN2024 TRUMP 2024!!! \| END DACA!!!! \| HCR2060 <= ILLEGAL ALIENS!!!!...👮 => TRUMPS PAYDAY!!!!... - Support The Laken Riley Act!!! - Trump Nobel Prize...

07-02-2019, 05:47 AM	#3
trevesty Confirmed User Industry Role: Join Date: Aug 2006 Location: Midwest Posts: 3,788	Been running websites for over 15 years and making money from it. Tens of thousands of sites at least... And every single one of them has had a robots.txt file. I don't see the issue.

07-07-2019, 11:37 PM	#7
rowan Too lazy to set a custom title Join Date: Mar 2002 Location: Australia Posts: 17,373	Funny how Google is going on about making a de-facto a standard, when they explicitly ignore a fairly important (IMHO) de-facto directive: Crawl-delay. Website: I'm asking you nicely to please limit your fetching to once per 60 seconds. GoogleBot: No.