Identify blocks of content in NATS-powered sites? - GoFuckYourself.com

hakkrdan · 05-13-2009, 01:52 AM

Hi -

I'm trying to identify some patterns where I can essentially spider the contents of a NATS-powered site. I'm talking about the intros or short descriptions on a photo set or gallery.

We'll use http://innocenthigh.com as an example because I know they're fucking rad (And no, Billy, I'm not going to scrape your site's content

). For example, hit up their main intro page:

http://www.innocenthigh.com/t1/

As of this writing their most recent update is for "Bree Olsen". Its that "Bree is a really nice girl that ..." that I'm after. According to Web Developer Tools, this textual content can be found inside of:

html > body > div > table > tbody > tr > td > table #Table_01 > tbody > tr > td > table #Table_01 > tbody > tr > td > table #Table_01 > tbody > tr > td > table > tbody > tr > td > span .student_id_story1

I think I can isolate this text depending on where it appears in a span or a paragraph or simply assigned to a class. Hell, I can use any combination of those, but I know that people template the fuckall out of their sites so even that is not a sure-fire way to identify this conent.

Anyone got any tips/tricks? How about from the NATS guys themselves, do you guys check out these posts? If I can get this hammered out, I think I'll be on to something big - unfortunately in the beginning it will only support sites generated via NATS or any other system where content is easily machine-identifiable.

I guess on that note, how many people would care that I was doing this? The only way I'd be doing this is to use that same content to promote said sponsor. I would not be doing this otherwise. If you have a problem with me doing that, then you have a problem with me converting sales for you.

Thanks!

swordfih · 05-14-2009, 04:42 AM

How about RSS? I belive NATS have RSSDish?

Anyway, if you are looking to fetch contents this way it'll need constant attention. Even the most clever pattern matching can often break or not match at all.

hakkrdan · 05-22-2009, 10:12 PM

Hi -

I'm pretty sure that the text presented via RSS is not the same text that I've described in these teasers/trailers. Further, everyone is using those RSS feeds, so I'd bet that SEO value is diminished a bit.

I've been able to get some good matches down, primarily matching something that I *think* is the target string, then qualifying it if its more than 50 words long. In all honesty if the content is not 50 words long I don't want it.

05-13-2009, 01:52 AM	#1
hakkrdan Confirmed User Join Date: Nov 2004 Location: Phoenix, AZ Posts: 223	Identify blocks of content in NATS-powered sites? Hi - I'm trying to identify some patterns where I can essentially spider the contents of a NATS-powered site. I'm talking about the intros or short descriptions on a photo set or gallery. We'll use http://innocenthigh.com as an example because I know they're fucking rad (And no, Billy, I'm not going to scrape your site's content ). For example, hit up their main intro page: http://www.innocenthigh.com/t1/ As of this writing their most recent update is for "Bree Olsen". Its that "Bree is a really nice girl that ..." that I'm after. According to Web Developer Tools, this textual content can be found inside of: html > body > div > table > tbody > tr > td > table #Table_01 > tbody > tr > td > table #Table_01 > tbody > tr > td > table #Table_01 > tbody > tr > td > table > tbody > tr > td > span .student_id_story1 I think I can isolate this text depending on where it appears in a span or a paragraph or simply assigned to a class. Hell, I can use any combination of those, but I know that people template the fuckall out of their sites so even that is not a sure-fire way to identify this conent. Anyone got any tips/tricks? How about from the NATS guys themselves, do you guys check out these posts? If I can get this hammered out, I think I'll be on to something big - unfortunately in the beginning it will only support sites generated via NATS or any other system where content is easily machine-identifiable. I guess on that note, how many people would care that I was doing this? The only way I'd be doing this is to use that same content to promote said sponsor. I would not be doing this otherwise. If you have a problem with me doing that, then you have a problem with me converting sales for you. Thanks! __________________ Dan ICQ: 487641781

05-14-2009, 04:42 AM	#2
swordfih Registered User Join Date: Mar 2005 Location: a few clicks from disneyland Posts: 70	How about RSS? I belive NATS have RSSDish? Anyway, if you are looking to fetch contents this way it'll need constant attention. Even the most clever pattern matching can often break or not match at all. __________________ lamp?

05-22-2009, 10:12 PM	#3
hakkrdan Confirmed User Join Date: Nov 2004 Location: Phoenix, AZ Posts: 223	Hi - I'm pretty sure that the text presented via RSS is not the same text that I've described in these teasers/trailers. Further, everyone is using those RSS feeds, so I'd bet that SEO value is diminished a bit. I've been able to get some good matches down, primarily matching something that I think is the target string, then qualifying it if its more than 50 words long. In all honesty if the content is not 50 words long I don't want it. __________________ Dan ICQ: 487641781