GoFuckYourself.com - Adult Webmaster Forum

GoFuckYourself.com - Adult Webmaster Forum (https://gfy.com/index.php)
-   Fucking Around & Business Discussion (https://gfy.com/forumdisplay.php?f=26)
-   -   Y'all ever had problems with google crawl rate? (https://gfy.com/showthread.php?t=1320184)

wankawonk 11-26-2019 12:19 AM

Y'all ever had problems with google crawl rate?
 
My sites have millions of pages because they're tube aggregators and CJ tubes

Google...google will ruin my servers hitting them hundreds of thousands of times a day (times many sites). It's a serious problem because their crawler doesn't follow normal caching patterns...the average user will hit my front page or a page that ranks, make a common search query, and click on the same video the last 100 users did. Everything served from redis, no problem. Cheap. Google crawls queries users never make, they hit videos users never click on...their crawler never hits the cache. They account for like 80% of my database load because they never. hit. the cache.

For years I just used the search console to slow their crawl rate. They have never respected crawl-delay in robots.txt.

Lately it's even worse--with the new console I can't set their crawl rate limit anymore.

I've had to block them from parts of my sites just to keep my shit running.

Driving me nuts. Anyone struggled with this? Any tips?

Mr Pheer 11-26-2019 12:25 AM

Googlebot has taken down my servers before when I was running stuff I don't talk about. Pretty much same scenario as yours, a few dozen domains and it was crawling pages like 40 - 50 per second for 3 days straight.

Running tail -f on the logs and shit was just flying off the screen.

wankawonk 11-26-2019 12:28 AM

Quote:

Originally Posted by Mr Pheer (Post 22567470)
Googlebot has taken down my servers before when I was running stuff I don't talk about. Pretty much same scenario as yours, a few dozen domains and it was crawling pages like 40 - 50 per second for 3 days straight.

Running tail -f on the logs and shit was just flying off the screen.

YES

tail -f <nginx log path> | grep -i 'google'

HOLY FUCKING SHIT

wankawonk 11-26-2019 12:34 AM

Quote:

Originally Posted by Mr Pheer (Post 22567470)
Googlebot has taken down my servers before when I was running stuff I don't talk about. Pretty much same scenario as yours, a few dozen domains and it was crawling pages like 40 - 50 per second for 3 days straight.

Running tail -f on the logs and shit was just flying off the screen.

Yandex has done it to me too but for that shit I can 404 'em b4 they hit the database bc they send no traffic anyway...with google I gotta seriously think about how to balance letting them molest my servers and how much traffic I think they'll send if I let them

rowan 11-26-2019 04:48 AM

I've had the same problem.

Basically, Google are arrogant cunts that refuse to follow the "non-standard" Crawl-Rate robots.txt directive, even though it's a de-facto standard, and a pretty clear indication by the webmaster that a crawler should slow down.

Since Google ignores such a directive, you have to log into webmaster tools to manually configure the rate to something lower. Furthering their arrogance, the setting expires it after 90 days, reverting back to normal behaviour. You have to log in and manually configure the crawl rate again to stop them beating the shit out of your server.

Fuck Google. :mad:

Klen 11-26-2019 04:52 AM

I had once also agreggator tube server taken down by semrush bot. All those crawling bots are really badly configured.

rowan 11-26-2019 04:59 AM

Quote:

Originally Posted by Klen (Post 22567547)
I had once also agreggator tube server taken down by semrush bot. All those crawling bots are really badly configured.

I've had problems with Semrush too. They don't respect a Disallow directive when there's a redirect.

So if url1 redirects to disallowed url2, they'll still load url2... even though robots.txt asks them not to.

They helpfully suggested that I should just disallow url1 as well as url2.

Klen 11-26-2019 05:01 AM

Quote:

Originally Posted by rowan (Post 22567552)
I've had problems with Semrush too. They don't respect a Disallow directive when there's a redirect.

So if url1 redirects to disallowed url2, they'll still load url2... even though robots.txt asks them not to.

They helpfully suggested that I should just disallow url1 as well as url2.

I banned their ip ranges on firewall level instead.

rowan 11-26-2019 05:07 AM

Quote:

Originally Posted by Klen (Post 22567554)
I banned their ip ranges on firewall level instead.

Better to do it via robots.txt, that way you don't need to keep up with any future changes to their IP ranges.

Code:

User-agent: SemrushBot
Disallow: /

(This assumes their crawler can understand these directives. :1orglaugh )

rowan 11-26-2019 05:23 AM

Quote:

Originally Posted by wankawonk (Post 22567467)
For years I just used the search console to slow their crawl rate. They have never respected crawl-delay in robots.txt.

Lately it's even worse--with the new console I can't set their crawl rate limit anymore.

Okay, I just realised I didn't read the OP properly. You can no longer limit the crawl rate in the webmaster console? Hmmm...

wankawonk 11-26-2019 02:22 PM

Quote:

Originally Posted by rowan (Post 22567565)
Okay, I just realised I didn't read the OP properly. You can no longer limit the crawl rate in the webmaster console? Hmmm...

I cannot find it. It was in the "legacy tools" section for a while but has been removed.

Has anyone else been able to find it since the new console was released? I've looked everywhere, checked on search engines, BHW, GFY...no one is talking about it, I suppose because most people's sites don't exactly have millions of pages that all require cpu-intensive DB queries.

Google really are arrogant cunts--they don't respect crawl-delay and appear to have removed all ability to tell them to slow the fuck down.

The Porn Nerd 11-26-2019 02:33 PM

I hate Google, and it's getting worse. Checked your GMail lately? The damn inbox loads 4 times (by my count) before displaying and there is always a few seconds delay when accessing emails. It's like Big G is copying this, relaying that, storing this bit of data, crawling your ass for that....it's like digital rape.

:helpme

wankawonk 11-26-2019 02:57 PM

I'm always on board with "adapt or die" but I've lost income over the last few months and had to change my current and future strategy because I can no longer have a site with millions of crawlable pages (and have had to block google from crawling pages that I could previously let them crawl at an acceptable rate--so of course now they're out of SERPs and I lost that traffic)

Like how ridiculous is it that now when I build and operate sites I literally have to make sure there's not too many crawlable pages, or else google will molest my servers to death. And there's literally no way to tell them to stop, other than to forcibly block them.

when I started in this industry my entire business model and strategy was "make sites with millions of crawlable embedded videos and hope a few thousand of them rank." Such a strategy is now borderline unviable.

rowan 11-26-2019 07:15 PM

Quote:

Originally Posted by wankawonk (Post 22567924)
Like how ridiculous is it that now when I build and operate sites I literally have to make sure there's not too many crawlable pages, or else google will molest my servers to death. And there's literally no way to tell them to stop, other than to forcibly block them.

A 429 Too Many Requests error code could help, although you would need to be careful not to overdo it.

According to this page, Google sees 429 as being the same as 503 Service Unavailable.

https://www.seroundtable.com/google-...ode-20410.html

wankawonk 11-26-2019 10:56 PM

Quote:

Originally Posted by rowan (Post 22568041)
A 429 Too Many Requests error code could help, although you would need to be careful not to overdo it.

According to this page, Google sees 429 as being the same as 503 Service Unavailable.

https://www.seroundtable.com/google-...ode-20410.html

Perhaps I will try this. Perhaps it will work better than what I have been doing. Thanks.

But shouldn't I just be able to easily tell them, "don't hit my site more than 100k times per day" and they should respect that?

rowan 11-26-2019 11:28 PM

Quote:

Originally Posted by wankawonk (Post 22568080)
But shouldn't I just be able to easily tell them, "don't hit my site more than 100k times per day" and they should respect that?

Absolutely agree. Even if Crawl-Rate is non standard, if that directive appears in robots.txt, it's a clear instruction that should be respected.

Google seems to be following the letter of the law, rather than the spirit, which doesn't always work so well on the internet.

wankawonk 11-27-2019 02:24 AM

Quote:

Originally Posted by rowan (Post 22568041)
A 429 Too Many Requests error code could help, although you would need to be careful not to overdo it.

According to this page, Google sees 429 as being the same as 503 Service Unavailable.

https://www.seroundtable.com/google-...ode-20410.html

I've been thinkin about this and realizing how smart it is

hit me more than once in any given second? 429. I can't see how it could hurt SEO...you're literally telling them, "you're hitting me too much, slow down". how could they penalize you for that? yeah there might be less-important URLs they might not hit but that was always a factor back when we could use search console to slow them down.

404'ing them or blocking them from urls with robots.txt seems stupid in comparison.


All times are GMT -7. The time now is 04:48 PM.

Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
©2000-, AI Media Network Inc123