Open Data

I’m at Seth Goldstein’s Open Data confab at the Reuters building. I love the mission on the wall: “Open data is to media what open source is to technology. Open data is an approach to content creation that explicitly recognizes the value of implicit user dat. The internet is the first medim to give a voice to the attention that people pay to it. Successful open data companies listen for and amplify the rich data that their audiences produce.”

Katie Neiderhofer of BuzzMetrics is presenting and is asked about opening up their data (because, of course, in the end, it is our data). She doesn’t quite get it, talking about sharing data with a company. Who owns the wisdom of the crowd?

She shows a chart that associates words with the concept safety and groups them: children, life, police, work, home… Bush, president, American, administration…. terrorism, Iraq, military, attacks… And she finds that the emotional words — dangerous, risk, fear, ensure — as associated with the personal words: children, life, etc. This is fascinating data that also becomes useful to associate words and concepts (and, I’d say, behind that the sites and people that talk about them). She shows something called Floodgate with a live view of blog tag clusters; unfortunately, this, too, is closed.

I ask whether they have tied together the work DataMining blogger Matt Hurst did when he was at Buzzmetrics, mapping the social (linking) associations of bloggers with what she shows: the mapping of topics. In other words, have advertisers come to them to find, for example, the most influential food bloggers? Yes, she says. So, Seth says, this becomes a “media planning tool for social media.” But there is also discussion about this being closed. If there is an influence metric, who owns that? I would benefit by knowing that I am an influential food blogger and if I am not given that information, I might shut off the closed network from exploiting me or I might join in an open, competitive network. See: The open-source ad network.

There is much discussion about the sale of our aggregate and/or anonymous behavioral data and issues of both privacy and PR.

Sanjiv Das from Morgan Stanley is about to explain agtorithms. He says that one cannot disrupt markets but must anticipate them (hello, Viacom). He says that data will become commoditized but organization will be proprietary. Amen.

Barak Pridor of ClearForest presents text analysis. For example, he shows search results that occur only in documents that meet some test. I ask whether he could give us things that have the tag X but only if it also has the tag Y. This would be extremely valuable for such things as and Edgeio (e.g., show me posts tagged ‘mexican’ but only if they’re also tagged ‘restaurant’ and ‘new york’). I’m dying for that kind of multilayer search and analysis. It enables so much more.

  • JamesBruni

    Seth’s ideas about putting “threads” in a “vault” to be sold to marketers, advertisers and PR agencies are ambitious (yet a long way off in the future). His presentaton at NY Tech Meetup a few months back got a lot of reaction. I don’t know how his “Root Exchange” for mortgage leads is doing, but he’s definitely got some financing, from notables such as Lew Rainieri, and others.

  • Publius

    Why is it our data? Did we spend the time and money to collect it?

  • SixRocks

    I think you’re missing the point on what ClearForest is up to. What you describe as far as filtering tags is a lay up for them. Where their stuff gets interesting is in real time semantic analysis of web content. They are taking a “top down” approach to creating the semantic web by extracting meaning from messy text.

    I’m one of many who have created mashups based on their web service. Take a look at and poke around (and look at the cool mashups listed on the right hand side!).

  • Jeff –

    We do give away a huge amount of data and analytics, primarily through our site, but also through presentation at conferences like the one you are at with Kate (who, by the way, is quite bright and definitely “gets it”), open publication in academic journals, or informal blog posts like the ones you referenced on Matt’s blog.

    In fact, we put many of our analytic techniques on Blogpulse before we put them into client products. Floodgate is a good example of this; the first use of the Floodgate technology was in the “Blogpulse Live” application debuted a number of months ago on We have a whole second iteration of that technology planned for a forthcoming update of Blogpulse, and all of this will happen before those technologies are ever used in client deliverables. Many of our approaches for influencer/social network analysis have also been debuted through Blogpulse.

    Jonathan Carson
    Nielsen BuzzMetrics

  • I think there is some confusion here between object data and meta data which is obtained via analytics. While there is no real debate about how owns the object data (and there are many models which work by simply taking that data and exploiting it with no permission what so ever – search for example) there seems to be some debate here about who owns the meta data. I suspect that vendors will get value from opening up how they do stuff – e.g. how an influence metric is computed – and derive revenue from the fact that they can do it. In other words, the barrier here may well be simply the scale of the task. For example, it would make sense to disclose how one computes influence, but it still requires a huge amount of infrastructure and historical data to deliver accurate and reliable results.

    With things like text analytics, the key is going to be proving the accuracy of the method. It is one thing to extract a bunch of company names or ticker symbols from social media, but how accurate is it? Is there a bias to one type of blog over another? Convincing people of this is a key challenge.

    As for opening up data, a number of institutions including Intelliseek/BuzzMetrics and TREC have offered data sets for analysis in (academic) research contexts. Again, one needs to consider the ‘owning’ of the data with the cost of aggregation and distribution. Sure, we all own our blog posts, but I don’t own the infrastructure and distribution channels that various institutions invest in to acquire and analyse that data. Thus I can’t perform induction on my ownership and claim ownership of all blog data, the infrastructures that aggregate it and so on.

    (BTW, I’ve worked with Kate in the past but won’t damn her with praise here ;-)

  • Open data is to media what open source is to technology. Open data is an approach to content creation that explicitly recognizes the value of implicit user data.

    The analogy goes further: we can simply port the open source definition to create an open data definition or, more broadly, an open knowledge definition:

    I think this is a nicer way of going about defining open data than talking about ‘recognizing the value implicit in user data’ which is fairly vague (plus what about all the others types of data from geographic to genomic?). Even in the context of companies providing various data services to users it seems to me the main point of open data is to reduce lock-in — not necessarily to recognize to the value inherent in the data itself.

  • For over two years my colleagues and I have been developing a extensible web services platform to help people protect and realize the full value of their online information to benefit themselves and the things they care about.

    We call our service KindClicks and we hope will help convert the web from an institutionally dominated commercially focused media to one that is more democratic and serves social good as well.

    Please feel free to join us

    Help yourself.
    Help the world.

  • we also have a presence on facebook