- WEB STARTUPS
- WEB JOBS
- ALL TOPICS
Silicon Alley Insider Archive
I have received a free ticket to the startup 2009 conference based on being nominated for an award. I am on the west coast this entire week and won’t be able to attend. If you would like to take my ticket, leave a comment and I will pick a random person at noon on Tuesday. You must be in NYC and be available Wednesday.
The event is a mix of startup demos and a couple of interviews with John Battelle and Jason Calacanis. There are also two VC panels with the event sponsors. The event will be held at NYU.
Update – Bastian is the winner.
I wanted to write this post last week but never got the time. And it has actually worked out well because today Brian Stelter at the New York Times wrote an article about scraping. He (and many others) call it "excerpting" which is fine when you take a paragraph from an article. When you take a good percentage of an article and place it on your site, it’s scraping. I wrote about scraping and profit last December and offer additional input after a post I read several weeks ago on Alley Insider. In fact, it’s the same post that Stelter discusses in the Times today.
Let’s start by saying that there are two types of scrapers: those who scrape full rss feeds and those who scrape just enough to keep you on their site and keep the conversation there as well. In both cases the goal is to profit by keeping the reader interacting chatting on the scraper site and will rarely send any traffic back to the source. I’ve thought about this a lot and for a long time thought the full-feed scraper was the "bad" one and the other was "ok" but I’ve changed my mind. Both suck.
Alley Insider is one of the worst scrapers I’ve seen to-date in the "conversation" category. It seems you can’t go a day without some post being scraped. Frankly as I’ve said before I don’t get it. They have talented writers and have a bank of venture capital so why not write original content all the time?
The story discussed in the Times is one where editor Henry Blodget basically lifted nearly 400 words of an article by Peggy Noonan. And what’s on their site today is actually cut back! Note that Blodget added zero original content to the post. At the bottom Blodget notes, "We thank Dow Jones in advance for allowing us to bring it to you." What I wonder is why not quote one tiny bit, add some original analysis and link to the source?
Last week I had lunch with an NYC entrepreneur and we discussed the Peggy Noonan post at length. He said that the reason Blodget does this is because the WSJ will never email him and ask for the post to be removed. The entrepreneur went on saying that assume that the WSJ did ask for the post to be removed. Blodget would post the request on Alley Insider noting that it’s a great example of old media, yadda yadda. The reasoning makes sense.
Blodget has authored a post today about their excerpting policy. Here’s his policy, "We excerpt others the way we hope others will excerpt us." No, seriously that’s what he said. What Blodget is saying is that they will scrape/excerpt as much as they please.
The interesting thing is that Alley Insider added a link post a few times a week. This is a good thing and should continue and replace 100% of the scraping.
Why scraping will continue and will get worse this year
Scraping will continue because it’s easy and drives revenue, SEO credits and pageviews. And it requires no work. Here’s an example of a Digg frontpage entry by Digg top user Muhammad Saleem. In this case, he submitted the absolutely 100% scraped story from Alley Insider instead of the source. How many pageviews and users came to the story via Digg and never visited the source?
The goal should always be to excerpt the bare minimum and send the user to the source. It’s ok to include a small piece of another’s writing when needed to make a point, but whole posts shouldn’t be made of someone else’s work.
And understand, this is not just an issue for Alley Insider. I’ve written before about many other scrapers including so-called aggregators like Socialmedian. In SocialMedian’s case, they too have no reason to scrape so much of the content of a story. Unfortunately I’ve never received a direct answer to why they scrape so much content on stories that are shared on their service.
I will never understand why writers feel the need to scrape other’s hard work. You spend hours or days researching a story only to have someone scrape it in a minute and grab all the associated goodies. All the scraping does is hurt the overall blogging/media industry so let’s stop it now.
Last night Techcrunch held their 2nd annual Crunchy awards. The evening was a near mirror image of last year’s event. Sixteen awards were given and big congratulations go out to all of the winners. CBS Interactive’s Josh Lowensohn attended the event and has a good recap of each of the awards handed out in San Francisco.
I kept notes on each of the winners from my office in NYC on a scratchpad. I’d like to share some of my general thoughts about the event.
The first award went to Google Reader for best web application. Accepting the award was Marissa Mayer from Google. In a company that has many thousands of employees and a Google Reader team that probably has more than a dozen employees, did Marissa really need to accept the award? It’s clear from past experiences that Google is a very controlling company when it comes to their public face, but why not let someone actually on the Google Reader team have the spotlight. One of my Twitter followers wondered if she was asked by the Crunchies to accept to make sure a woman accepted at least one award. I don’t buy that – I think it’s just Google’s controlling behavior at work.
During the Crunchies event, there were 3 “quick Q&A” and VentureBeat’s Matt Marshall spoke with Mayer. It immediately became apparent to me and the Twitter audience that the questions were staged (for all of the Q&A bits) because Matt started to joke around (seemingly off the script) about cupcakes and when he switched back, Marissa started to answer before Matt even finished his question. Why does Marissa get such fluff interviews? Back at LeWeb, Marc Canter called the interview fluff and last night’s discussion was fluff as well. I don’t know if Marissa demands easy questions but there are plenty of topics that the people in the audience care about. Techcrunch writer Steve Gillmore has been very vocal lately about FeedBurner – why not ask her about that? There are plenty of other topics as well that the audience and the Ustream viewers care about.
Let’s now move to location… that is, where does your company need to be located to win a Crunchy? Loic LeMeur noted on stage that there was only one award given to an international company and all of the companies in that category were from Europe. LeMeur has a longer post today about the valley vs. non-valley topic with regards to the awards show.
As each winner was announced, I took a look at where the company is located and noted it on my scratch pad. Every single startup company, outside of the international award which went to eBuddy, went to a company in California. In fact, all of the award winners are located in Silicon Valley except GitHub which is located near San Diego. Take a minute and think about that…not one company won from NYC, Chicago, Denver, Boulder, St. Louis, Portland or anywhere in between. Internationally, Techcrunch runs blogs in France, UK and Japan and there was no winners from any of those locations or any other city around the world except one winner from Amsterdam.
Does all of the “great” Web technology only come out of California? Absolutely not. I am going to have a lot more on this topic early next week.
Dennis Howlett made a comment that’s worth repeating. “If they were being honest then the Crunchies would be renamed as the Consumer Crunchies”. In fact you should read his Twitter stream for some good, honest commentary on the overall event. He’s right and when I looked at the nominees last month I was disappointed that the only startups in the running were those who basically get pushed around in the early adopter crowd. Where are all of the (()#@*&^%% companies that are creating real value for their users and have business models? Where are the web utility companies? I could name 100 companies that deserved to win an award last night. We see this behavior on a daily basis from the valley and I will have more on this topic as well next week.
In closing, it’s great that Techcrunch puts on this award show and gives the community a chance to celebrate their combined success. My hope is that for their 2010 show they will consider some of the points above and make some positive changes which will benefit the Web community worldwide.
Nicholas Carlson at Silicon Alley Insider has an investigative post today where he analyzed the advertising that is running on Digg. At the end of his investigation, he noted that Digg is clearly running targeted ads now because he viewed ads that matched three of Digg’s categories. He concluded the investigation by calling the change a, "christmas miracle".
Clickety Clack disagreed with the Carlson investigation, noting that the ads were just run of network ads. I decided to take a look into whether the ads were now, in fact, being targeted. Assuming they are, it would certainly be a good step for Digg because frankly I wonder sometimes who is in charge of their business plan. In our test back in March, 52% of Digg’ers blocked ads when visiting CN through a link on the Digg frontpage. Ads alone will never be enough to keep Digg afloat.
What did I find? To test, I used three computers, two logged into Digg, one anonymous on three different Internet connections. Below are a few of the screenshots I grabbed from the pages that Carlson notes are now targeted. As you can see, the ads in my testing are not targeted. Looking at the source code, it also doesn’t appear that there is anything but the typical run-of-network ads running on Digg. At this point, without any confirmation from Digg, I can only infer that Carlson was just lucky in his page load timing in viewing targeted ads on category pages.
Earlier today I provided commentary on the content scraping/stealing issue that has seen a lot of discussion over the past week. I thought it might be interesting on a Saturday night to take a look at the issue from another perspective. So tonight I bring to you the world premiere of "Scrape, Scrape, Scrape".
When a writer spends hours, days or weeks on a story, should they be mad when a site scrapes (i.e. steals) their content without permission? It doesn’t matter if the story is news about Apple, a tutorial about Ajax or a recipe of how to make a bacon stuffed burger. Scraping occurs when a site "lifts" content from site A and places it on site B without authorization of site A. Many times site B monetizes the the scraped content. And note that there are certainly times where you need to grab a bit of the source as a quote or to clarify a point in your story.
I’ve written about this topic many times and will continue to write about it until those who scrape or steal are put out of business or change their model. This past week I wrote about Socialmedian and their scraped content. While I don’t believe they are doing it for malicious reasons like some of the bottom-feeding scrapers, they are still participating. I guess it worked well for Socialmedian – they just got a cashout of a few million dollars. I am still waiting for an answer from their CEO as to why they need to scrape any content for their service to work. Note: "Digg does it" is not an answer.
A year ago the only scrapers were the bottom of the barrel scum who took full content and put ads around it. It seems we’ve moved up the food chain to larger sites living off scraping.
Whet Moser, Chicagoland Editor wrote a post titled, "Grand Theft HuffPo". Basically the HuffingtonPost completely scraped one of his writer’s posts without his permission.
Ryan Singel at Wired compares some examples of content on site A and on Huffington Post. Singel notes that Gawker publisher Nick Denton also "hates" the Huffington Post.
This morning I see that Huffington Post contributor Silicon Alley Insider (editor Henry Blodget) has jumped up to defend the scraping done over at HuffPo. Let’s get a quick disclosure out of the way, HuffingtonPost Co-founder Ken Lerer is an investor in Silicon Alley Insider.
Silicon Alley Insider has changed their game a few times since they launched. Initially they were all about NYC but then left for the Valley. It appears shortly after editor Peter Kafka left, they moved to a scraping model. Based on my estimations, they scrape 70-80% of the content that appears on SAI. It actually gets even more interesting for their service because many times the scraped content actually makes its way out to their partner Yahoo Finance.
Interesting note… many startups I meet with in NYC are telling me they are sick of Mr. Blodget’s scraping. I can only hope that these people will stop visiting the site because that’s the only way this game will change. As long as Mr. Blodget is making bank from the scrapes with no penalty, he will continue to do it.
His belief is that if the site "aggregating" the stolen content sends visitors back to the source, then the source should shut their mouth (and keyboard?) and like it. The problem is that the "aggregator" (in this case SAI) is only growing because they are stealing content from others! Henry is basically saying that it’s ok for large sites to scrape but not the bottom feeders. What happens many times is that the scraped story on the "aggregator" will be the site to get the massive traffic through the social news sites (e.g. Digg, Techmeme, etc.) while the source (you know, the one who spent the time to make the story) will get close to nothing. Great for the aggregator, bad for the source. And many times, the reader has no actual idea that the content didn’t come from the "aggregator". This is a huge issue as well – but not for the thief.
I can only assume that Mr. Blodget has a deal with the New York Times after seeing a story about Outside.In on his "aggregator" (http://www.alleyinsider.com/2008/12/another-cash-infusion-for-outsidein). Here we see what looks like a full story but it actually is a complete scrape from the Times. And to make matters worse, the story has comments!
Mr. Blodget has a team of talented writers and there’s just no reason they can’t write new content about the stories they want to cover and still provide the links out to the other sources discussing the same topic. Since the link is the same, both on the scraped story and if they wrote new content, then his argument holds no water.
Update: to clarify the link point - the reason why most aggregators want to scrape as much content as they can is because of how important SEO and traffic from Google is. Just posting a link to the source won’t get them the traffic from Google.
If you read CN regularly, you know my view on sites and services that steal the conversation. Many of the services that are participating in content scraping, are also stealing the conversation. All of the sites we’ve mentioned in this post, are contributing to this practice.
So why is scraping becoming more popular? Simple, it’s all about the cash and pageviews. I wonder if we were no longer using a pageview monetization model if scrapers would still be using this method. It’s so easy to take another’s content, change the story title to grab fresh Google juice, and then sit back and profit.
At the end of the day, the only one who may be able to start to save us is Google. If Google stops indexing the sites that regularly scrape content, we may just start to see real change. I’d love to get Matt Cutts’ take on this topic.
Last night at the NY Tech meeting, the companies who presented came back on stage for a 5-minute panel discussion. Below is a video of their advice for startups in NYC. The panelists included: