Where Should The Data Reside?

Allen Stern - May 29th, 2009

Apologies in advance for a semi-technical post on a Friday night but I think it’s a topic worth discussing. Over the past few months I’ve noticed more and more sites that are copying pieces of content from one social service and placing it into another social service or blog/website. Is this a good idea?

If I post a message on Twitter, it is instantly copied to my Friendfeed account. If I delete that twit message, it is not removed from Friendfeed. I selected to have Friendfeed read and aggregate my Twitter account so the behavior makes sense on the display side. Since Friendfeed can read and write to Twitter, can’t they just read the current status of messages?

I’ve also noticed more blogs sucking in content from Twitter and Friendfeed. It’s a smart move for the blogs because it makes for more monetizable content and can also make a blog appear more active. Some blogs appear to be scraping the content on their own, some are using comment aggregation services like Disqus. I asked Disqus about their social comment aggregation and was told that they store the aggregated comments on Disqus’ servers. Unlike Friendfeed where I specifically told them to aggregate my content, I didn’t authorize my comments to be aggregated on other blogs, etc. And with regards to Disqus, when I make a comment on Twitter or Friendfeed that is scraped back to the Disqus database, I don’t believe that it’s placed into my Disqus account. This makes it even harder for me to manage. Of course I have practically zero recourse for the blogs that scrape friendfeed/twitter directly.

My take is that it’s fine to display content from other social services but it should be a display only — not/never a store and retain. This way if the content creator decides to delete or edit the content, the updated version will be the one displayed across the Web.

Perhaps this is a data portability topic?

As more social aggregation services pop up and blogs look for more content to monetize, I believe this issue will become a hot topic this year.

Read More: , , , , ,
RSS Feed
RSS
4 COMMENTS
  1. anon says:

    can you kill the gravatar on that last post? i wrote it on a cookie-less comp, but appreantly it follows ip’s or something…it’s not good writing

    …or kill the whole post?

    thanks.

  2. anon says:

    Couple disparate thoughts (it’s late so I’m a bit loopy):
    –Google’s Wave will make this…is it worse? or better? However you look at it…I THINK they’ve got tools to embed Waves into your blog, his blog, her blog, everyone’s blog can have Allens Wave…
    –the general motive of these re-re-et-al-re-publishers is to get google ad cash. They dont know google’s already filtering them out and/or they are a dime-a-dozen (search for some freeware and you’ll find hundreds of sites with same content – its the same idea with these blog/tweat re-publishers)..so they get lost in the crowd. Sure Mahalo (etc) has a bigger footprint, but it doesn’t matter…only sources matter (SEO) long term.
    –@epc – good post…also, no way to FORCE an update on a remote site. And, no way to FORCE proper implementation of a protocol (say, if the protocol includes methods to force updates). eg. I did an email gateway years ago following the exact protocol. In the wild, it needed massive modding to accommodate everyone’s goofy/broken protocol implementations (even big names like IBM/Novell/MS/etc)
    –Twitter to FF to FB to X then Y then Z and back to Twitter…infinite loop? Sweeeet…
    –We got one post on another blog and saw it copied – exact copies – on hundreds of sites. Didn’t do anything – 99.99% of traffic came from original source (see above: re-publishers get few hits – I know someone’s going to argue this point…”I made $10 in one month…!!!”, big deal.).
    –once its on the inets and google has it, you’re hosed. cat’s out of bag
    –this reflects in the so called “Real Time” web (a misnomer). It’s only real-time as fast as the consumers are designed to re-consume…and as fast as the publishers can push. And as correct as consumers care to make it correct. Only the source is “real time” (even then, not really).
    –”My take is that it’s fine to display content from other social services but it should be a display only — not/never a store and retain.” In theory I agree that’s how it should work. But it goes back to caring – by BOTH parties. If the publisher doesn’t care to display the data in a format people want (see: Twitter), a second party can come and suck it all into their system to format and re-display how users want to see it (see: Summize). And, if the “real time” publisher crashes, they don’t want to be pulled down too if their data is a mixture.
    –you mention “data portability”. Is this about “data portability”? Isn’t “data portability” moving your private/semi-private account data from one site to another? Off topic, yeah, “data portability” is scary. Ever delve into FB Connect? Wow. Watch out what you wish for.

  3. Allen Stern says:

    thanks for the great reply epc!

  4. epc says:

    There was some great discussion about this at the recent “Glue” conference but no clearcut answers.

    At a technical lever there’s a couple of problems: it’s trivial to syndicate the data, but non-trivial to synchronize actions on the data. If the feed is an Atom feed there’s a notion of stubs to reflect that a given bit of content has been unpublished, but this concept doesn’t exist in RSS which is what most sites use for data exchange.

    There’s also no notion of what I’m going to call “contractual use of data”. There’s no way to obligate a subscribing party to either update a given element of data (maybe I published something in error and I want to push out the correction) or remove it (for whatever reason).

    An author/publisher I know had a hell of a time getting bad data out of “the system” for a book he wrote. Initially (years ago) he’d talked to O’Reilly about getting it published. For whatever reason that didn’t go through. For reasons even O’Reilly admits were in error, the book appeared in a database update of upcoming titles. For the next several years the title showed up as an O’Reilly title complete with erroneous ISBN even though the author and O’Reilly quickly cleaned up the original bad data source. It flowed out to Amazon, then other sites and even to this day resurfaces years later.

    The problem with establishing some sort of contractual obligation on data flow is …isn’t that DRM? And it is in a way I guess, but not in the sense of preventing copies or use but in requiring some sense of fidelity to the original data.

    Atom tried to achieve a first cut at this both with the stub idea for deletions as well as the requirement of a unique identifier for each chunk of content — the idea being that even if you republish my blog post from my personal site over here on CN, the original id is maintained, but in practice no one does this and the tools don’t really support or enforce it.

Become a sponsor

SPONSORS

Clicky Web Analytics
Advertise here
Business Card Scanner
twitter