Why Scraping Will Continue and Will Only Get Worse

I wanted to write this post last week but never got the time. And it has actually worked out well because today Brian Stelter at the New York Times wrote an article about scraping. He (and many others) call it "excerpting" which is fine when you take a paragraph from an article. When you take a good percentage of an article and place it on your site, it’s scraping. I wrote about scraping and profit last December and offer additional input after a post I read several weeks ago on Alley Insider. In fact, it’s the same post that Stelter discusses in the Times today.

Let’s start by saying that there are two types of scrapers: those who scrape full rss feeds and those who scrape just enough to keep you on their site and keep the conversation there as well. In both cases the goal is to profit by keeping the reader interacting chatting on the scraper site and will rarely send any traffic back to the source. I’ve thought about this a lot and for a long time thought the full-feed scraper was the "bad" one and the other was "ok" but I’ve changed my mind. Both suck.

Alley Insider is one of the worst scrapers I’ve seen to-date in the "conversation" category. It seems you can’t go a day without some post being scraped. Frankly as I’ve said before I don’t get it. They have talented writers and have a bank of venture capital so why not write original content all the time?

The story discussed in the Times is one where editor Henry Blodget basically lifted nearly 400 words of an article by Peggy Noonan. And what’s on their site today is actually cut back! Note that Blodget added zero original content to the post. At the bottom Blodget notes, "We thank Dow Jones in advance for allowing us to bring it to you." What I wonder is why not quote one tiny bit, add some original analysis and link to the source?

Last week I had lunch with an NYC entrepreneur and we discussed the Peggy Noonan post at length. He said that the reason Blodget does this is because the WSJ will never email him and ask for the post to be removed. The entrepreneur went on saying that assume that the WSJ did ask for the post to be removed. Blodget would post the request on Alley Insider noting that it’s a great example of old media, yadda yadda. The reasoning makes sense.

Blodget has authored a post today about their excerpting policy. Here’s his policy, "We excerpt others the way we hope others will excerpt us." No, seriously that’s what he said. What Blodget is saying is that they will scrape/excerpt as much as they please.

The interesting thing is that Alley Insider added a link post a few times a week. This is a good thing and should continue and replace 100% of the scraping.

Why scraping will continue and will get worse this year

Scraping will continue because it’s easy and drives revenue, SEO credits and pageviews. And it requires no work. Here’s an example of a Digg frontpage entry by Digg top user Muhammad Saleem. In this case, he submitted the absolutely 100% scraped story from Alley Insider instead of the source. How many pageviews and users came to the story via Digg and never visited the source?

The goal should always be to excerpt the bare minimum and send the user to the source. It’s ok to include a small piece of another’s writing when needed to make a point, but whole posts shouldn’t be made of someone else’s work.

And understand, this is not just an issue for Alley Insider. I’ve written before about many other scrapers including so-called aggregators like Socialmedian. In SocialMedian’s case, they too have no reason to scrape so much of the content of a story. Unfortunately I’ve never received a direct answer to why they scrape so much content on stories that are shared on their service.

I will never understand why writers feel the need to scrape other’s hard work. You spend hours or days researching a story only to have someone scrape it in a minute and grab all the associated goodies. All the scraping does is hurt the overall blogging/media industry so let’s stop it now.

Read More: , , , ,
RSS Feed
RSS
4 COMMENTS
  1. Derek says:

    Here at Tynt (http://www.tynt.com) we’ve been looking at the scraping problem as it isn’t only websites that take your content and spread it around. It is also being very heavily passed around via e-mail and other channels.

    We are beta testing a service which will track when people copy content from your site and automatically add an attribution tag. This, in our tests so far, does drive more traffic to your site, can increase your search engine rankings, and also gives you insight into just how much copying is really going on.

    Ping me if Center Networks would like a Beta account.

    Derek (Tynt)

  2. Michael Fidler says:

    That is an interesting idea that Derek has. I try my best to respect not just you, but all the Blogger’s by only clipping enough of a story to entice others to read more on the originating site. I find that usually a paragraph is enough. When it comes from my newsreader, I do not have control over the amount of content that might appear. I’ve noticed that socialmedian does take less than they did before. I have a feeling this might have something to do with your previous posts on the subject. I see it as sharing, not as pilfering a story. We do it on friendfeed too. I know that doesn’t make it right. Last night Dave Morin started a discussion on ff about some remarks he made about Robert Scobble. A discussion quickly began and ran off topic almost immediately. One of the things that brought up was if it would be practical to pull comments together from a post no matter where they might be on the web (I hope I was reading this correctly). By republishing all of the comments here for any post, would this help at all? Here is the link http://twurl.nl/g24c73. It sounds like he’s thinking of pulling comments together, but so they can be used anywhere.

  3. CR says:

    Likewise… this blog, like most Web 2.0 blogs, could just write a line or two from the “About” page of the web 2.0 sites covered and link to them. Its news of news. And folks who “scrape” do news from news of news.

    Consider: Its not feasible for every site that carries news actually contact the source to do a “story”.

    Consider: the user experience / site preference actually rules. Viewers should not be spoken of, or treated, as sheep to be herded to news of news sites. Folks frequent sites They like and if they get news from news of news on those sites thats fine with them because its one less click.

  4. Mike Seidle says:

    I’m glad you posted this. Some of this is simple copyright infringement. Posting whole articles just isn’t right. A lot of these blogs could learn a lot from the Slashdot old school of weblogging – one little quote, a link to the original and if their community discusses on their site, fine. But at least link to the first source wherever possible.

Leave a Reply

Become a sponsor

SPONSORS

CloudContacts
Clicky Web Analytics
Page.ly
Advertise here

STARTUP NEWS

twitter