Welcome to the GoFuckYourself.com - Adult Webmaster Forum forums.

You are currently viewing our boards as a guest which gives you limited access to view most discussions and access our other features. By joining our free community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content and access many other special features. Registration is fast, simple and absolutely free so please, join our community today!

If you have any problems with the registration process or your account login, please contact us.

Post New Thread Reply

Register GFY Rules Calendar
Go Back   GoFuckYourself.com - Adult Webmaster Forum > >
Discuss what's fucking going on, and which programs are best and worst. One-time "program" announcements from "established" webmasters are allowed.

 
Thread Tools
Old 08-28-2009, 01:26 AM   #1
CunningStunt
Confirmed User
 
CunningStunt's Avatar
 
Industry Role:
Join Date: Aug 2006
Posts: 5,594
:mad How can you stop THIEVES SCRAPING your SITES?

I have a content rich site that this is happening to more and more.

You can ban their IP addresses, but they soon pop up with another.

Is their any way to lock down the server better, yet still allow the friendly SE bots? It's becoming an increasing problem.
CunningStunt is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-28-2009, 01:34 AM   #2
cLin
Registered User
 
Industry Role:
Join Date: May 2009
Location: Orange County, CA
Posts: 94
We use strongbox which says it has an anti scraping technology but to be honest, I've never dealt with it. The logic they use makes sense though.

http://bettercgi.com/strongbox/features.html#antislurp
__________________
Chris
The Ex Girlfriend Pics
cLin is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-28-2009, 02:59 AM   #3
CunningStunt
Confirmed User
 
CunningStunt's Avatar
 
Industry Role:
Join Date: Aug 2006
Posts: 5,594
Thanks Chris.

I forgot to mention the site is written in asp.net, don't know how different methods work exactly with different server setups.

Knowing nothing much about server security, I would just have a list of all known "friendly" search engine bots, and everything else gets fooked out if there are multiple session attempts. Is that how it works?

I should contact my dedicated host, I'm sure they'll be able to recommend something, that's their speciality after all.
CunningStunt is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-28-2009, 03:59 AM   #4
Klen
 
Klen's Avatar
 
Industry Role:
Join Date: Aug 2006
Location: Little Vienna
Posts: 32,235
Well,one of methods which i use it is to delete old content and replaced it with completely new content on different location.
Klen is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-28-2009, 04:07 AM   #5
CunningStunt
Confirmed User
 
CunningStunt's Avatar
 
Industry Role:
Join Date: Aug 2006
Posts: 5,594
Quote:
Originally Posted by KlenTelaris View Post
Well,one of methods which i use it is to delete old content and replaced it with completely new content on different location.
Almost all my content is ranking and linked to. How can doing 301's etc help? I don't want to be shifting stuff around all the time.

I'd like to simply shut off programmatic attempts to access my sites from unknown IP addresses, period.
CunningStunt is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-28-2009, 04:28 AM   #6
Dirty Dane
Sick Fuck
 
Dirty Dane's Avatar
 
Industry Role:
Join Date: Feb 2004
Location: www
Posts: 9,491
Is it a paysite or free? And when you say "thieves".. are they duplicating your website or just download content?

Last edited by Dirty Dane; 08-28-2009 at 04:32 AM..
Dirty Dane is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-28-2009, 04:41 AM   #7
CunningStunt
Confirmed User
 
CunningStunt's Avatar
 
Industry Role:
Join Date: Aug 2006
Posts: 5,594
Quote:
Originally Posted by Dirty Dane View Post
Is it a paysite or free? And when you say "thieves".. are they duplicating your website or just download content?
Free site.

First instance they had an autoscraper on a daily scrape to pull the new entries and posting to a cough **new** site, which I soon had closed down via dmca. Now I've caught it at an earlier stage.
CunningStunt is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-28-2009, 04:46 AM   #8
SeanLEE
Confirmed User
 
SeanLEE's Avatar
 
Join Date: Feb 2006
Location: Miami, FL
Posts: 1,556
Just put advertisements in your player- then they are advertising for you.
__________________
I spammed in threads!
SeanLEE is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-28-2009, 04:49 AM   #9
quantum-x
Confirmed User
 
quantum-x's Avatar
 
Join Date: Feb 2002
Location: ICQ: 251425 Fr/Au/Ca
Posts: 6,863
Honestly, there's not much you can do.
If you can see it, you can steal it.... cookies, restrictions, captchyas etc can all be defeated.

The best suggestions are: watermark, adverts, etc.

One trick that really does stump most spiders, however, is to link to your content via JS or CSS.
quantum-x is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-28-2009, 05:27 AM   #10
Dirty Dane
Sick Fuck
 
Dirty Dane's Avatar
 
Industry Role:
Join Date: Feb 2004
Location: www
Posts: 9,491
Quote:
Originally Posted by CunningStunt View Post
Free site.

First instance they had an autoscraper on a daily scrape to pull the new entries and posting to a cough **new** site, which I soon had closed down via dmca. Now I've caught it at an earlier stage.
Ok....

Banning IPs won't help much, because they might just use rotating IPs.

Banning known offline user-agents could help, but that is also easy to override (they are sending fake user-agent info).

If you have movies, put it in javascript (the agent usually can't read that) - and text outside (for SE). The downside of this is surfers who disabled javascript.

If you have a decent CPU on your server, then trick their "browser" into fake links with long delays (like a cgi link) or fake targets that temporarily kill and ban too many attempts. This will lag or cut off their agent temporarily. (might be a very good idea to use robots.txt on those links because you do not want to trick google..)

You can also create something that their agent doesn't understand to sort out. The more garbage, the better. For instance, if you have invisible links to the "tubegirl" , they will end up with all kinds of shit, they have to sort out manually.
You can also structure your site in a way that has no logic and is hard to restructure for a software.

You can also watermark your stuff with "licensed to...", but talk with your sponsors before doing that. And if they promote same sponsor, then you should talk about that too

Last edited by Dirty Dane; 08-28-2009 at 05:31 AM..
Dirty Dane is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-28-2009, 05:30 AM   #11
faxxaff
Confirmed User
 
Industry Role:
Join Date: Dec 2002
Location: Marina Hemingway
Posts: 2,134
You can use trap files ... like 1x1 pix big files named 6tjgTTvtfgh.jpg or something like that. If it get's downloaded, you know it's an illegal bot. Now, write a script that will block that user based on IP and session ID or have him download some malicious bullshit ...
__________________
Asian Babes
faxxaff is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-28-2009, 06:11 AM   #12
rowan
Too lazy to set a custom title
 
Join Date: Mar 2002
Location: Australia
Posts: 17,393
I'm facing this problem myself as I have a site with millions of pages. Currently if an IP downloads too many pages within a short period of time and it's NOT on the whitelist (eg the IPs of Google) it gets firewalled for a period of time. It's a pretty aggressive approach but it works (for now)

Most of the people trying to scrape don't bother trying to hide it, so their 3 fetches a second gets picked up pretty quickly.
rowan is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-28-2009, 06:52 AM   #13
Dirty Dane
Sick Fuck
 
Dirty Dane's Avatar
 
Industry Role:
Join Date: Feb 2004
Location: www
Posts: 9,491
Please post the domains of these bastards. Without the http
Dirty Dane is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-28-2009, 07:05 AM   #14
Agent 488
Registered User
 
Industry Role:
Join Date: Feb 2006
Posts: 22,511
i was going to say dmca but looks like you did it.

please no idiots come in here saying that is your site is online you are consenting for it to be scraped. fuck off.
Agent 488 is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-28-2009, 09:20 AM   #15
SmokeyTheBear
►SouthOfHeaven
 
SmokeyTheBear's Avatar
 
Join Date: Jun 2004
Location: PlanetEarth MyBoardRank: GerbilMaster My-Penis-Size: extralarge MyWeapon: Computer
Posts: 28,609
Quote:
Originally Posted by budsbabes View Post
please no idiots come in here saying that is your site is online you are consenting for it to be scraped. fuck off.
lol well i hate to do this but....

Have you given consent for google to scrape you ? yahoo ? bing ? etc.

Basically you have an apple tree in a public place. You aren't stopping anyone from picking your apples, infact you like some people picking your apples ( google ) even though they never asked to pick your apples and furthermore they are showing your apples on their site and making money from it. shitloads of money.

Can't be too suprised when some fatty comes by and picks all your apples one day, after watching everyone else pick them and you not stopping them
__________________
hatisblack at yahoo.com
SmokeyTheBear is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-28-2009, 09:28 AM   #16
DonovanTrent
Confirmed User
 
DonovanTrent's Avatar
 
Join Date: Dec 2006
Location: Las Vegas, NV
Posts: 968
Quote:
Originally Posted by CunningStunt View Post
I have a content rich site that this is happening to more and more.
I'm just kind of wondering why you have a "content rich site" that is free. But that's just me, I may be missing something here.
__________________
Donovan Trent
DonovanTrent is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-28-2009, 09:32 AM   #17
Agent 488
Registered User
 
Industry Role:
Join Date: Feb 2006
Posts: 22,511
search engines link to my content. in order for someone to read it they have to click through to my site. there is a difference.



Quote:
Originally Posted by SmokeyTheBear View Post
lol well i hate to do this but....

Have you given consent for google to scrape you ? yahoo ? bing ? etc.

Basically you have an apple tree in a public place. You aren't stopping anyone from picking your apples, infact you like some people picking your apples ( google ) even though they never asked to pick your apples and furthermore they are showing your apples on their site and making money from it. shitloads of money.

Can't be too suprised when some fatty comes by and picks all your apples one day, after watching everyone else pick them and you not stopping them
Agent 488 is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-28-2009, 09:32 AM   #18
seeandsee
Check SIG!
 
seeandsee's Avatar
 
Industry Role:
Join Date: Mar 2006
Location: Europe (Skype: gojkoas)
Posts: 50,945
Quote:
Originally Posted by Dirty Dane View Post
Ok....

Banning IPs won't help much, because they might just use rotating IPs.

Banning known offline user-agents could help, but that is also easy to override (they are sending fake user-agent info).

If you have movies, put it in javascript (the agent usually can't read that) - and text outside (for SE). The downside of this is surfers who disabled javascript.

If you have a decent CPU on your server, then trick their "browser" into fake links with long delays (like a cgi link) or fake targets that temporarily kill and ban too many attempts. This will lag or cut off their agent temporarily. (might be a very good idea to use robots.txt on those links because you do not want to trick google..)

You can also create something that their agent doesn't understand to sort out. The more garbage, the better. For instance, if you have invisible links to the "tubegirl" , they will end up with all kinds of shit, they have to sort out manually.
You can also structure your site in a way that has no logic and is hard to restructure for a software.

You can also watermark your stuff with "licensed to...", but talk with your sponsors before doing that. And if they promote same sponsor, then you should talk about that too
nice tips
__________________
BUY MY SIG - 50$/Year

Contact here
seeandsee is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-28-2009, 09:33 AM   #19
Agent 488
Registered User
 
Industry Role:
Join Date: Feb 2006
Posts: 22,511
what is so hard to understand. you can't think of one content rich free site?

Quote:
Originally Posted by DonovanTrent View Post
I'm just kind of wondering why you have a "content rich site" that is free. But that's just me, I may be missing something here.
Agent 488 is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-28-2009, 09:35 AM   #20
Agent 488
Registered User
 
Industry Role:
Join Date: Feb 2006
Posts: 22,511
i understand where you are coming from, but there is a difference between a content preview and a full scrape.

Quote:
Originally Posted by SmokeyTheBear View Post
lol well i hate to do this but....

Have you given consent for google to scrape you ? yahoo ? bing ? etc.

Basically you have an apple tree in a public place. You aren't stopping anyone from picking your apples, infact you like some people picking your apples ( google ) even though they never asked to pick your apples and furthermore they are showing your apples on their site and making money from it. shitloads of money.

Can't be too suprised when some fatty comes by and picks all your apples one day, after watching everyone else pick them and you not stopping them
Agent 488 is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-28-2009, 09:44 AM   #21
SmokeyTheBear
►SouthOfHeaven
 
SmokeyTheBear's Avatar
 
Join Date: Jun 2004
Location: PlanetEarth MyBoardRank: GerbilMaster My-Penis-Size: extralarge MyWeapon: Computer
Posts: 28,609
Quote:
Originally Posted by budsbabes View Post
i understand where you are coming from, but there is a difference between a content preview and a full scrape.
google would like you to think that anyways did you know google offers a service that allows users to browse your site without most of your ads? they scrape the entire page on the fly and only display the text.

even without that , all google is doing is cutting your page up and displaying it as seperate items, the only difference is they get to show way more ads in the process than someone who just scrapes the page and repukes it up.
__________________
hatisblack at yahoo.com
SmokeyTheBear is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-28-2009, 09:58 AM   #22
DonovanTrent
Confirmed User
 
DonovanTrent's Avatar
 
Join Date: Dec 2006
Location: Las Vegas, NV
Posts: 968
Quote:
Originally Posted by budsbabes View Post
what is so hard to understand. you can't think of one content rich free site?
Depends on the content. I've seen plenty of content-rich sites that should be nowhere near free.
__________________
Donovan Trent
DonovanTrent is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-28-2009, 10:11 AM   #23
raymor
Confirmed User
 
Join Date: Oct 2002
Posts: 3,745
Quote:
Originally Posted by cLin View Post
We use strongbox which says it has an anti scraping technology but to be honest, I've never dealt with it. The logic they use makes sense though.

http://bettercgi.com/strongbox/features.html#antislurp
We have Throttlebox, specifically designed for this type of thing.
Throttlebox is an Apache module. The OP says he used ASP.net.
I wonder if that means he's hosting on a Windows desktop instead
of a server OS running Apache.
__________________
For historical display only. This information is not current:
support@bettercgi.com ICQ 7208627
Strongbox - The next generation in site security
Throttlebox - The next generation in bandwidth control
Clonebox - Backup and disaster recovery on steroids
raymor is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-28-2009, 02:47 PM   #24
CunningStunt
Confirmed User
 
CunningStunt's Avatar
 
Industry Role:
Join Date: Aug 2006
Posts: 5,594
Quote:
Originally Posted by DonovanTrent View Post
I'm just kind of wondering why you have a "content rich site" that is free. But that's just me, I may be missing something here.
Yes you are

There's some good advice in here, thanks for taking the time folks.
CunningStunt is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-29-2009, 04:30 PM   #25
raymor
Confirmed User
 
Join Date: Oct 2002
Posts: 3,745
Quote:
Originally Posted by faxxaff View Post
You can use trap files ... like 1x1 pix big files named 6tjgTTvtfgh.jpg or something like that. If it get's downloaded, you know it's an illegal bot. Now, write a script that will block that user based on IP and session ID or have him download some malicious bullshit ...
That's one of several techniques we use. That technique works sometimes, but
often does not. We don't send people malicious files of course, we're not criminals,
but we do use traps. It's a useful part of a multi-layered approach, but not at all
sufficient on it's own.
__________________
For historical display only. This information is not current:
support@bettercgi.com ICQ 7208627
Strongbox - The next generation in site security
Throttlebox - The next generation in bandwidth control
Clonebox - Backup and disaster recovery on steroids
raymor is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-29-2009, 04:38 PM   #26
raymor
Confirmed User
 
Join Date: Oct 2002
Posts: 3,745
Quote:
Originally Posted by SmokeyTheBear View Post
lol well i hate to do this but....

Have you given consent for google to scrape you ? yahoo ? bing ? etc.
Yes, he's given Google permission to index, not scrape, the site, and thereby promote it.
If you've been a webmaster for more than a few days, you know about robots.txt.
By choosing not to put up a "no indexing" sign (robots.txt), you've given implied
permission for Google to promote you by adding you to their index. I'd bet the people
scraping (not indexing) the site don't check for robots.txt.

Besides, use a ounce or so of common sense. Obviously webmasters want their
porno sites listed in search engines. Duh.
__________________
For historical display only. This information is not current:
support@bettercgi.com ICQ 7208627
Strongbox - The next generation in site security
Throttlebox - The next generation in bandwidth control
Clonebox - Backup and disaster recovery on steroids
raymor is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-29-2009, 04:56 PM   #27
Angry Jew Cat - Banned for Life
(felis madjewicus)
 
Industry Role:
Join Date: Jul 2006
Location: In Mom & Dad's Basement
Posts: 20,368
the cat thinks most of the people in this thread have a skewed definition of content.
Angry Jew Cat - Banned for Life is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-29-2009, 05:27 PM   #28
DonovanTrent
Confirmed User
 
DonovanTrent's Avatar
 
Join Date: Dec 2006
Location: Las Vegas, NV
Posts: 968
Quote:
Originally Posted by CunningStunt View Post
Yes you are
I guess I was just asking, based on the content being so valuable to you as to be concerned about protecting it. That's all.
__________________
Donovan Trent
DonovanTrent is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-29-2009, 06:25 PM   #29
GrouchyAdmin
Now choke yourself!
 
GrouchyAdmin's Avatar
 
Industry Role:
Join Date: Apr 2006
Posts: 12,085
One of the ways I've dealt with it is with custom webserver-level applications - never post a direct link to the content. Use a custom hash, decode, and sendfile() the bitch. Otherwise, I've used trivial timestamping and other simple methods to break fuskers. Don't forget to disable supporting HTTP Trace.
__________________
GrouchyAdmin is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-30-2009, 01:50 AM   #30
SmokeyTheBear
►SouthOfHeaven
 
SmokeyTheBear's Avatar
 
Join Date: Jun 2004
Location: PlanetEarth MyBoardRank: GerbilMaster My-Penis-Size: extralarge MyWeapon: Computer
Posts: 28,609
Quote:
Originally Posted by raymor View Post
Yes, he's given Google permission to index, not scrape, the site, and thereby promote it.
he did ? i didnt see the part where he defined his submittion to google, i thought google just scraped his site like they scrape every site..

Quote:
Originally Posted by raymor View Post
If you've been a webmaster for more than a few days, you know about robots.txt.
not a few , just a couple days

Quote:
Originally Posted by raymor View Post
By choosing not to put up a "no indexing" sign (robots.txt), you've given implied
permission for Google to promote you by adding you to their index.
lol so by not posting a sign saying " do not break my car windows" you are implying that it's okay to smash your car windows ? ok got it..

by that theory everyone has permission , why would it be implied for google but not implied for others ? is it called the googlerobots.txt ?


Quote:
Originally Posted by raymor View Post
Besides, use a ounce or so of common sense. Obviously webmasters want their
porno sites listed in search engines. Duh.
now they do .. before google it was just software downloading everything on your server

maybe thats what he wants to do is become so rich and well known you will beg him to come scrape your site just like google.
__________________
hatisblack at yahoo.com
SmokeyTheBear is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Post New Thread Reply
Go Back   GoFuckYourself.com - Adult Webmaster Forum > >

Bookmarks



Advertising inquiries - marketing at gfy dot com

Contact Admin - Advertise - GFY Rules - Top

©2000-, AI Media Network Inc



Powered by vBulletin
Copyright © 2000- Jelsoft Enterprises Limited.