CATEGORIES
- WEB STARTUPS
- CONFERENCES
- WEB JOBS
- MICROSOFT
- INTERVIEWS
- VIDEO
- AMAZON
- ALL TOPICS
CONTRIBUTORS
Scrape, Scrape, Scrape, Let’s All Scrape Our Way to Profit!
When a writer spends hours, days or weeks on a story, should they be mad when a site scrapes (i.e. steals) their content without permission? It doesn’t matter if the story is news about Apple, a tutorial about Ajax or a recipe of how to make a bacon stuffed burger. Scraping occurs when a site "lifts" content from site A and places it on site B without authorization of site A. Many times site B monetizes the the scraped content. And note that there are certainly times where you need to grab a bit of the source as a quote or to clarify a point in your story.
I’ve written about this topic many times and will continue to write about it until those who scrape or steal are put out of business or change their model. This past week I wrote about Socialmedian and their scraped content. While I don’t believe they are doing it for malicious reasons like some of the bottom-feeding scrapers, they are still participating. I guess it worked well for Socialmedian – they just got a cashout of a few million dollars. I am still waiting for an answer from their CEO as to why they need to scrape any content for their service to work. Note: "Digg does it" is not an answer.
A year ago the only scrapers were the bottom of the barrel scum who took full content and put ads around it. It seems we’ve moved up the food chain to larger sites living off scraping.
Whet Moser, Chicagoland Editor wrote a post titled, "Grand Theft HuffPo". Basically the HuffingtonPost completely scraped one of his writer’s posts without his permission.
Ryan Singel at Wired compares some examples of content on site A and on Huffington Post. Singel notes that Gawker publisher Nick Denton also "hates" the Huffington Post.
This morning I see that Huffington Post contributor Silicon Alley Insider (editor Henry Blodget) has jumped up to defend the scraping done over at HuffPo. Let’s get a quick disclosure out of the way, HuffingtonPost Co-founder Ken Lerer is an investor in Silicon Alley Insider.
Silicon Alley Insider has changed their game a few times since they launched. Initially they were all about NYC but then left for the Valley. It appears shortly after editor Peter Kafka left, they moved to a scraping model. Based on my estimations, they scrape 70-80% of the content that appears on SAI. It actually gets even more interesting for their service because many times the scraped content actually makes its way out to their partner Yahoo Finance.
Interesting note… many startups I meet with in NYC are telling me they are sick of Mr. Blodget’s scraping. I can only hope that these people will stop visiting the site because that’s the only way this game will change. As long as Mr. Blodget is making bank from the scrapes with no penalty, he will continue to do it.
His belief is that if the site "aggregating" the stolen content sends visitors back to the source, then the source should shut their mouth (and keyboard?) and like it. The problem is that the "aggregator" (in this case SAI) is only growing because they are stealing content from others! Henry is basically saying that it’s ok for large sites to scrape but not the bottom feeders. What happens many times is that the scraped story on the "aggregator" will be the site to get the massive traffic through the social news sites (e.g. Digg, Techmeme, etc.) while the source (you know, the one who spent the time to make the story) will get close to nothing. Great for the aggregator, bad for the source. And many times, the reader has no actual idea that the content didn’t come from the "aggregator". This is a huge issue as well – but not for the thief.
I can only assume that Mr. Blodget has a deal with the New York Times after seeing a story about Outside.In on his "aggregator" (http://www.alleyinsider.com/2008/12/another-cash-infusion-for-outsidein). Here we see what looks like a full story but it actually is a complete scrape from the Times. And to make matters worse, the story has comments!
Mr. Blodget has a team of talented writers and there’s just no reason they can’t write new content about the stories they want to cover and still provide the links out to the other sources discussing the same topic. Since the link is the same, both on the scraped story and if they wrote new content, then his argument holds no water.
Update: to clarify the link point - the reason why most aggregators want to scrape as much content as they can is because of how important SEO and traffic from Google is. Just posting a link to the source won’t get them the traffic from Google.
If you read CN regularly, you know my view on sites and services that steal the conversation. Many of the services that are participating in content scraping, are also stealing the conversation. All of the sites we’ve mentioned in this post, are contributing to this practice.
So why is scraping becoming more popular? Simple, it’s all about the cash and pageviews. I wonder if we were no longer using a pageview monetization model if scrapers would still be using this method. It’s so easy to take another’s content, change the story title to grab fresh Google juice, and then sit back and profit.
At the end of the day, the only one who may be able to start to save us is Google. If Google stops indexing the sites that regularly scrape content, we may just start to see real change. I’d love to get Matt Cutts’ take on this topic.



allen – of course blodget scrapes content – what do you expect from him?
Scrapers suck – all they are doing is stealing another’s work to profit from it – they don’t give a rats ass about sending anything back and it doesn’t matter anyway – it’s still theft.
Disappointed in your analysis, Allen. From what I can tell, SAI’s strategy hasn’t changed since I left: Produce interesting stuff that people want to read. You can quibble about the amount of copy that originally appears on other sites, but to suggest that people wouldn’t only be reading SAI because it’s “scraping” makes no sense.
Peter, not sure what you mean here, "but to suggest that people wouldn’t only be reading SAI because it’s "scraping" makes no sense."
This is an interesting post, Allen. In my opinion, this is a fine line and attribution should be always given. However, I’m not sure if that is just enough and then there are several things which are not clear:
a) I get that when you copy an entire piece, that’s bad. But say you quote from another blog post. I read your blog and you quoted just enough so I don’t have to check out the other blog, is that called scraping as well? I mean, both sources don’t get the traffic they deserve, both don’t make the cash they could have made.
If the quoting is already scraping, then a lot of people scrape. ;-)
b) How do you want to deal with duplicate content, period? I mean, look at the countless sites that mirror wikipedia or programming manuals. They all get more traffic than the original source. And by going to another website — even if they are not as blatant to place Adsense on it — it “harms” the original source. I just don’t see anyone taking any action there.
c) Are feedreaders scraping too? That’s the proclaimed revolution. E.g., I can read my news in Google Reader and be done with it. Don’t have to go to CenterGigaCrunch and read them all there. :D (Just kidding.)
This is all so more obvious when it comes to papers in the school. The definition seems to be obvious and non-grey. It’s also unacceptable period. Whereas in general on the web a lot of people “use” whatever is available without giving back.
When I try to “scrape” the content of this article via the socialmedian bookmarklet I get “414 Request-URI Too Large.”
Perhaps you could try to define some terms, rather than just throw them around.
Quoting a source with citation is not “scraping.”
At the very least you should distinguish a “quotation” that lacks a citation to the source, from one that does. IMO, that’s scraping.
If you want to quibble over how much content is quotable without permission, that’s a different matter.
I dislike the manner in which you present your argument, and I would call your titles “flagrant flame-bait.”
Out of curiousity are you the same Richard that defends Socialmedian all the time?
scraping is easily defined as past a quote – in the example I provided on SAI – it’s not a quote – it’s a scrape. Actually as for defining the terms as you note, I did that. I know not everyone knows all the terms used in the industry so I agree it’s important for clarity.
I will have more on this matter tonight most likely – but for now I need to work on my startup! Thanks for stopping by.
Let me try again: To suggest that people are reading SAI only because of the content it “scrapes” doesn’t make sense. I guarantee you that the stuff that is best-read on the site is original content.
Agreed, more agreement happened before you left. I noted in my post that they have some talented writers.
I’m surprised by your SAI analysis, I would never had thought the figure that high, but I haven’t counted it either. They also do provide good analysis, and of the new feeds I’m following this year, I must say I get enjoyment and value from their mix. I hadn’t noticed the NYT things…and I’d be surprised to see them reprinting so much without permission.
Re: Huffpost the same comment I left on SAI and in a post: the main issue here to me was that they were pulling the full content, not part of it. Second to that: that they are pulling half content without permission.
I’m actually not against the general idea, and we do it an a very small scale but only with permission. I get the traffic argument from Huffpost for example, but ultimately the only safe way to syndicate this sort of stuff is with permission up front, and why the Huffpost isn’t asking for it is beyond me, given they’d find a cast of thousands happy to sign up. I’d suggest SAI could probably find sites that would play as well.
The legal difference between scraping and quoting is the subject of fair use doctrine (at least in the United States):
http://en.wikipedia.org/wiki/Fair_use#Amount_and_substantiality
But I think the bigger issue isn’t so much the legal question of what constitutes copyright violation as the ethical issue of leeching off of someone’s content by outwitting them at the SEO game. As far as I can tell, the approach that these sites are taking is to grab search traffic that would otherwise have gone to the articles they excerpt.
More at The Noisy Channel:
http://thenoisychannel.com/2008/12/20/fair-use-and-seo/
Daniel – thanks for the link – there’s no doubt in my mind that SAI is scraping – they will continue to do it until readers leave – but they know that won’t happen. And most publishers won’t even realize that they are doing it.
SEO of course is the major factor with scraping but it’s also about generating "fake" content which can be monetized by both the act of the scrape plus the conversation around the scrape.
I love henry’s argument that as long as he is sending you some traffic you should shutup and like it. It’s like the weathly or the famous shouldn’t serve time for crimes they commit.
I don’t have a blog to flog, but as an outsider (non-participant) and user of techmeme this does not seem like a fair analysis.
I rarely see an SAI post that isn’t being covered by at least 3 other publications. Sometimes I click thru to SAI, sometimes I do not. If they are the “lead” I most certainly do click thru, but I also always check out 1 or2 of the other posts on the topic.
From my perspective, there is VERY little original content, and even less is worthy of reading. It seems to come down to comments as a metric for which links I will click on for a given topic. I always read the AllthingsD posts, but quickly close the tab because there are rarely comment strings of interest (though when they are present, they are really good). However, on SAI, I may not read the article (because it is a “scrape”), but only check out the comments.
It seems clear to me that bloggers that are focused on providing consistently good original content, rather than bitching about traffic and credit as they relate to content, will receive the attention they deserve. Perhaps that is an optimistic perspective.
Hey Allen,
I just wanted to thank you for fighting the good fight on here. Keep it up.
I read your social median post before this and have been reading all the posts about the HuffPo/Reader situation. I publish a local aggregator and blog network in Chicago called The Windy Citizen and while the HuffPo has never lifted full articles from us, their aggressive approach to aggregation has certainly been awfully tempting to copy on our aggregator. Right now we limit description text our users can paste in to 350 characters like Digg. I feel ok about that.
Anyway, what’s so interesting about the discussion here, on the Wired article, over at SAI and even on the Chicago Reader’s blog entries is how many people respond to a different argument than the one being made. The problem isn’t bloggers lifting quotes or event 1-2 paragraphs. The problem is them lifting whole stories without checking first out of pure greed and laziness.
You’ve iterated that several times in this post and in the social median post, and yet people leave comments like “What’s your problem with people posting excerpts and quotes?”
This sort of reality distortion field makes it very hard to have an intelligent conversation. If you say you don’t like A and the person responds with “What’s your problem with B?” then you’re kind of out of luck.
Look at the quotes in the Wired article from the guy at the HuffPo. The Wired guy tells him people have an issue with them lifting full stories. His response:
“You tease, you pull out a piece of it, and then you have a headline or link out,” Peretti said. “Generally publishers are psyched to have a link.”
He’s not responding to the actual complaint. He’s responding to a stupid complaint that no one’s voicing. It’s irrational.
There was a lot of this in your social median post discussion, too. You asked why they were allowing full article scrapes. And the CEO would tell you why they allow partial summaries. That’s not what you asked about. I’m sure there’s a technical term for this kind of deliberate rhetorical obfuscation but “reality distortion field” seems as good a description as any. By responding to the actual question asked, scrapers tacitly acknowledge that they’re scraping full stories, which they can never do.
Anyway, keep up the good fight.
And remember, Techcrunch is the biggest kid on the block. They don’t scrape.
Thanks for putting the time and effort into the above comment Brad. People will call it whatever they can to get away with the crime.
There is a difference btw an application like Socialmedian scraping and a blog like alley insider scraping. Both are bad but the claims will be different.
It’s really not that hard to create good, compelling content. But when you go for the easy, cheap way, you will always lose.
Hilarious that you would want Google’s take on scraping other people’s content for dollars. Talk about exempting the big guys.