Tuesday 2 December 2008

Leeching whether we need to worry and what we can do?

Since we discovered that an IP had been leeching from us, Blue discovered that softtester has about lots of IPs doing it.

Blue doesn't think theres anything we can do about it and he's not seen any protection for it.

I'd like to discuss and figure out if its causing us problems, I know its costs resources like bandwidth.

Also, if it is a problem you'd think it would have happened to other people in the past and there would be some sort of protection.

Comments?

11 comments:

  1. Yeah I thought a basic session / ip check could be done. Guess we'd have to make sure we got all the bots and redirected them to an appropriate response code page, other that 404. That was if we screw
    up we don't upset the spider.

    The other question is do we need this protection, are we loosing out ?

    ReplyDelete
  2. The other question is do we need this protection, are we loosing out ?

    I guess it depends personally i would not be happy if a lot of my hard work collecting customers improving the site, number of pad submitters was just leeched to build a rival website.

    One thing to remember is that leechers are not normally that common, back in the days ebay had the same problem with leechers and screen scrapers I was one of them :) they rewrote the code and made it
    less easy to leech.

    ReplyDelete
  3. I still a little unsure as to what is getting leeched. Is someone setting up a duplicate site by copying the HTML from softtester?

    On DP, I've seen people selling scripts for PAD sites whereby the script "automatically" updates by scouring the net for PADs. Is this an example of a "leecher" script?

    PADs are public property and it's pretty easy to get your hands on 30 to 50,000 of them. What else can a leecher leech? They can't leech backlinks so a dupicate site still wouldn't perform as well as the original.

    The immediate problem I can see is the bandwidth. What other symptoms are you guys seeing? Can you see a copy of the content on another site?

    ReplyDelete
  4. I guess that massive list of padfiles for shareware site owners (which I used initially too) are basically just links. The owner still has to read the file, which are on different sites and all this
    checking takes time.

    I guess a program / script which could (leech) go round getting the latest information would be extremely useful and save time.

    I like the idea of rewriting "the code and made it
    less easy to leech" what kind of things did they do?

    From an SEO point of view the fact that the page has changed could look good to googlebot too.

    ReplyDelete
  5. I'd be happier with a less interactive approach, I mean what as I going to do when I get an email.

    Just like to lock them out.

    ReplyDelete
  6. A solution to this problem is looking very messy or have numerous issues attached.

    You can't use PHP session variables as these are destroyed once the browser / program closes the page. You can't use cookies as the programs probably won't use them. And using a database will make
    things slower and impact resources.

    Doomed :(

    ReplyDelete
  7. An idea, what about if we use a database a use number e.g.

    iIP1 as int e.g. 200
    iIP2 as int e.g. 1
    iIP3 as int e.g. 78
    iIP4 as int e.g. 56

    iDay as int e.g. 365


    Then we can do a select whatever where iIP1 = CurrentIPFirstPart etc

    Have an index on all 5 fields

    Thoughts?

    ReplyDelete
  8. Thinking about it if somebody wanted to leech your site, no amount of protection that you add will stop those that are serious at getting the data.

    Nothing these days are impossible to work around.

    Storing in Database seems like a good idea however I have a feeling a site your size with traffic as it is would have a big impact on the system as your site will be issuing a insert / select command
    for every page viewed constantly.

    ReplyDelete
  9. I agree with Blue. To defeat this kind of protection, all
    that has to happen is the IP address changes.

    PAD files are public property anyway so someone can
    recreate a shareware site by a number of means.
    Leeching is a bit lazy but you can buy 50,000 PADs quite
    easily on the web.

    My concern would be the traffic/ bandwidth consumed by
    leeching. Can your friendly host simply block access
    from the leech's IP?

    ReplyDelete
  10. Yep we can block IP Addresses at the moment it is a manual process.

    ReplyDelete
  11. Hmmm, I wonder...

    Can I automate producing this log?

    Could I then automate blocking IP addresses?

    Would this be possible, using a scheduled PHP script?

    ReplyDelete