Welcome to the GoFuckYourself.com - Adult Webmaster Forum forums.

You are currently viewing our boards as a guest which gives you limited access to view most discussions and access our other features. By joining our free community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content and access many other special features. Registration is fast, simple and absolutely free so please, join our community today!

If you have any problems with the registration process or your account login, please contact us.

Post New Thread Reply

Register GFY Rules Calendar
Go Back   GoFuckYourself.com - Adult Webmaster Forum > >
Discuss what's fucking going on, and which programs are best and worst. One-time "program" announcements from "established" webmasters are allowed.

 
Thread Tools
Old 07-01-2019, 05:50 PM   #1
Bladewire
StraightBro
 
Bladewire's Avatar
 
Industry Role:
Join Date: Aug 2003
Location: Monarch Beach, CA USA
Posts: 56,232
Google to make robots.txt an Internet standard after 25 years


Google demanding more free work & expense from people to bend to their fucking will

Google to make robots.txt an Internet standard after 25 years

The Robots Exclusion Protocol (REP) — better known as robots.txt — allows website owners to exclude web crawlers and other automatic clients from accessing a site. “One of the most basic and critical components of the web,” Google wants to make robots.txt an Internet standard after 25 years.

Despite its prevalence, REP never became an Internet standard, with developers interpreting the “ambiguous de-facto” protocol “somewhat differently over the years.” Additionally, it doesn’t address modern edge cases, with web devs and site owners ultimately still having to worry about implementation today.

On one hand, for webmasters, it meant uncertainty in corner cases, like when their text editor included BOM characters in their robots.txt files. On the other hand, for crawler and tool developers, it also brought uncertainty; for example, how should they deal with robots.txt files that are hundreds of megabytes large?

To address this, Google — along with the original author of the protocol from 1994, webmasters, and other search engines — has now documented how REP is used on the modern web and submitted it to the IETF.

The proposed REP draft reflects over 20 years of real world experience of relying on robots.txt rules, used both by Googlebot and other major crawlers, as well as about half a billion websites that rely on REP. These fine grained controls give the publisher the power to decide what they’d like to be crawled on their site and potentially shown to interested users. It doesn’t change the rules created in 1994, but rather defines essentially all undefined scenarios for robots.txt parsing and matching, and extends it for the modern web.

The robots.txt standard is currently a draft, with Google requesting comments from developers. The standard will be adjusted as web creators specify “how much information they want to make available to Googlebot, and by extension, eligible to appear in Search.”

This standardization will result in “extra work” for developers that parse robots.txt files, with Google open sourcing the robots.txt parser used in its production systems.

This library has been around for 20 years and it contains pieces of code that were written in the 90’s. Since then, the library evolved; we learned a lot about how webmasters write robots.txt files and corner cases that we had to cover for, and added what we learned over the years also to the internet draft when it made sense.
Bladewire is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 07-01-2019, 06:00 PM   #2
brassmonkey
Pay It Forward
 
brassmonkey's Avatar
 
Industry Role:
Join Date: Sep 2005
Location: Yo Mama House
Posts: 75,548
they want to reduce page removal right?? this is something they have to pay for currently. i think they are trimming the fat to focus on tech items. i use robots on everything
__________________
EMAIL ==>[email protected] ==> #NOBIDEN2024
TRUMP 2024!!! | END DACA!!!! | HCR2060 <= ILLEGAL ALIENS!!!!...👮
=> TRUMPS PAYDAY!!!!... - Support The Laken Riley Act!!! - Trump Nobel Prize...
brassmonkey is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 07-02-2019, 05:47 AM   #3
trevesty
Confirmed User
 
trevesty's Avatar
 
Industry Role:
Join Date: Aug 2006
Location: Midwest
Posts: 3,788
Been running websites for over 15 years and making money from it. Tens of thousands of sites at least...

And every single one of them has had a robots.txt file. I don't see the issue.
trevesty is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 07-02-2019, 09:23 AM   #4
Bladewire
StraightBro
 
Bladewire's Avatar
 
Industry Role:
Join Date: Aug 2003
Location: Monarch Beach, CA USA
Posts: 56,232
Quote:
Originally Posted by trevesty View Post
And every single one of them has had a robots.txt file. I don't see the issue.
It's not going to be the robot.txt that it's always been.

It will be mandatory and you'll have to add all sorts of parameters that you don't currently have, and likely aren't aware of, and if any of them are null, or if you don't have the robot text file exactly how Google wants it you will be dinged and your SE placement will suffer.
__________________


Skype: CallTomNow

Bladewire is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 07-02-2019, 12:11 PM   #5
brassmonkey
Pay It Forward
 
brassmonkey's Avatar
 
Industry Role:
Join Date: Sep 2005
Location: Yo Mama House
Posts: 75,548
Quote:
Originally Posted by Bladewire View Post
It's not going to be the robot.txt that it's always been.

It will be mandatory and you'll have to add all sorts of parameters that you don't currently have, and likely aren't aware of, and if any of them are null, or if you don't have the robot text file exactly how Google wants it you will be dinged and your SE placement will suffer.
a sitemap is more complex have no issues of google, bing, or yandex saying change a thing.
__________________
EMAIL ==>[email protected] ==> #NOBIDEN2024
TRUMP 2024!!! | END DACA!!!! | HCR2060 <= ILLEGAL ALIENS!!!!...👮
=> TRUMPS PAYDAY!!!!... - Support The Laken Riley Act!!! - Trump Nobel Prize...
brassmonkey is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 07-07-2019, 01:07 PM   #6
Bladewire
StraightBro
 
Bladewire's Avatar
 
Industry Role:
Join Date: Aug 2003
Location: Monarch Beach, CA USA
Posts: 56,232
Quote:
Originally Posted by brassmonkey View Post
a sitemap is more complex have no issues of google, bing, or yandex saying change a thing.
We agree
__________________


Skype: CallTomNow

Bladewire is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 07-07-2019, 11:37 PM   #7
rowan
Too lazy to set a custom title
 
Join Date: Mar 2002
Location: Australia
Posts: 17,373
Funny how Google is going on about making a de-facto a standard, when they explicitly ignore a fairly important (IMHO) de-facto directive: Crawl-delay.

Website: I'm asking you nicely to please limit your fetching to once per 60 seconds.

GoogleBot: No.
rowan is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 07-08-2019, 12:15 AM   #8
thommy
Confirmed User
 
thommy's Avatar
 
Industry Role:
Join Date: Jun 2003
Location: Switzerland / Germany / Thailand
Posts: 5,469
Quote:
Originally Posted by brassmonkey View Post
they want to reduce page removal right?? this is something they have to pay for currently. i think they are trimming the fat to focus on tech items. i use robots on everything
I think this is just one reason the other is that they don´t get fined for what they show.

actually Google shows many documents and websites that do not have a robot.txt

now let´s imagine a funny example:

a weapon company uploads the newest secret version of a killer machine into their web - Google crawls it and publish it without the explicit demand of doing so - they would be also in trouble.

THE INTERNET law is not existing and google works worldwide under the laws of 255 different countries.
I think that robots.txt would be the simplest way to allow or deny to crawl and publish
stuff from a site.

we can see everywhere in internet that rules and laws are going to an excessive point. users have to agree to cookies (even when this was a common technique for the part 25 years).

in addition, an internet presence is not necessarily a privilege of companies. consumer protection can also apply here to the site operator.
__________________
Open for handpicked publishers and advertisers:
www.trafficfabrik.com
thommy is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 07-08-2019, 02:30 AM   #9
magneto664
God Bless You
 
magneto664's Avatar
 
Industry Role:
Join Date: Aug 2014
Location: Glasgow, $cotland
Posts: 1,467
Quote:
Originally Posted by thommy View Post
a weapon company uploads the newest secret version of a killer machine into their web - Google crawls it and publish it without the explicit demand of doing so - they would be also in trouble.
Every day thousands of others bots scan your website, ahref, majestic, exploit looking bots, advert bots, other shit bots, most of them have loaded default directories and file names or directory paths for scripts working on your site. If you do not want something to appear on the Internet, you do not upload to the internet. Simple.[/QUOTE]

Quote:
Originally Posted by thommy View Post
I think that robots.txt would be the simplest way to allow or deny to crawl and publish
stuff from a site.
If in the robot file you select which file or directory to bypass the possible that Google will do. But for others it will be a gift.
__________________
magneto664 📧 gmail.com
Adult Backlinks 💘Best Website Stats 💘 Best CDN for Adult Content
My Fav: 👍Chaturbate 👍 Stripchat 👍 Dateprofits 👍 AdultFriendFinder
magneto664 is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 07-08-2019, 02:45 AM   #10
Klen
 
Klen's Avatar
 
Industry Role:
Join Date: Aug 2006
Location: Little Vienna
Posts: 32,234
Quote:
Originally Posted by thommy View Post
I think this is just one reason the other is that they don´t get fined for what they show.

actually Google shows many documents and websites that do not have a robot.txt

now let´s imagine a funny example:

a weapon company uploads the newest secret version of a killer machine into their web - Google crawls it and publish it without the explicit demand of doing so - they would be also in trouble.

THE INTERNET law is not existing and google works worldwide under the laws of 255 different countries.
I think that robots.txt would be the simplest way to allow or deny to crawl and publish
stuff from a site.

we can see everywhere in internet that rules and laws are going to an excessive point. users have to agree to cookies (even when this was a common technique for the part 25 years).

in addition, an internet presence is not necessarily a privilege of companies. consumer protection can also apply here to the site operator.
Average internet user does not have any knowledge about robots and crawling so you cant really expect everyone to follow. A better solution would be , instead crawl robot crawling everything on website, is to have explicitly stated what should be crawled instead.
__________________
For GFY administration inquiries- email info at gfy.com or send PM.
For advertising inquiries - email marketing at gfy.com

Inquiries which are not related to administration or advertising on GFY wont be processed.
Klen is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 07-08-2019, 02:46 AM   #11
thommy
Confirmed User
 
thommy's Avatar
 
Industry Role:
Join Date: Jun 2003
Location: Switzerland / Germany / Thailand
Posts: 5,469
Quote:
Originally Posted by magneto664 View Post
Every day thousands of others bots scan your website, ahref, majestic, exploit looking bots, advert bots, other shit bots, most of them have loaded default directories and file names or directory paths for scripts working on your site. If you do not want something to appear on the Internet, you do not upload to the internet. Simple.
but this bots are not google. nobody will try to sue them.

I really know how a robots.txt is working but the point is that millions who have an internet presence don´t know.

if google crawls something from their site WITHOUT AN EXPLICIT demand to do so, they can be seen as "victim" from the one or other judge and can sue Google for millions.

this is why it would make sense to make robots.txt as THE rule to crawl your site and sites without robots.txt would not be touched.




Quote:
If in the robot file you select which file or directory to bypass the possible that Google will do. But for others it will be a gift.
as i said - if there are no clear rules for that it will open big doors for lawsuits. and not the others would be the ones that have to fight it - it would be the one who have the money to pay.
__________________
Open for handpicked publishers and advertisers:
www.trafficfabrik.com
thommy is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 07-08-2019, 02:55 AM   #12
thommy
Confirmed User
 
thommy's Avatar
 
Industry Role:
Join Date: Jun 2003
Location: Switzerland / Germany / Thailand
Posts: 5,469
Quote:
Originally Posted by KlenTelaris View Post
Average internet user does not have any knowledge about robots and crawling so you cant really expect everyone to follow. A better solution would be , instead crawl robot crawling everything on website, is to have explicitly stated what should be crawled instead.
that is exactly what i meant.

the laws in the various countries are so different that you can not even decide who is a professional who HAVE to know it and who is not.

when the internet started nobody ever thought about such things like privacy and permission to crawl a page. it was simply assumed that everyone who posts something on the internet wants others to find it. this case have changed a lot in the meantime and the views on right or wrong in the world are so completely different that everything have to be EXPLICIT allowed and not just assumed.
__________________
Open for handpicked publishers and advertisers:
www.trafficfabrik.com
thommy is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Post New Thread Reply
Go Back   GoFuckYourself.com - Adult Webmaster Forum > >

Bookmarks

Tags
robots.txt, google, standard, rep, web, internet, developers, files, site, webmasters, modern, protocol, draft, crawlers, hand, learned, address, doesn’t, library, googlebot, corner, owners, rules, created, major



Advertising inquiries - marketing at gfy dot com

Contact Admin - Advertise - GFY Rules - Top

©2000-, AI Media Network Inc



Powered by vBulletin
Copyright © 2000- Jelsoft Enterprises Limited.