future sites, and datamining with teeth

Posted on February 14, 2009
Filed Under Collective Intelligence | Leave a Comment

SEED magazine had a tiny article about how web datamining is an inefficient way to conduct scientific inquiry with large datasets, the reason being that any such dataset cannot really have been created with the goal of answering the scientific questions that a research team happen to have decided to ask. I’d like to expand upon this idea:

You can go from the ‘understand first, then seek to control’ paradigm used in online marketing to the inverse: a ‘control/direct first, then seek-to-understand’ paradigm that could be used by science for the common good.

Simply put, there’s nothing to stop you from building a site specifically to capture a certain kind of data. Especially if the site masquerades as something else that users would find interesting, entertaining or useful. And there’s no reason the site couldn’t actually do all of the above (have value for the end-user, while still finding a way to get them to fess up relevant data). This is a very different approach to just ‘making do’ with data that a site happens to be able to collect. And it’s an approach that could benefit direct scientific research.

The future

As we leave more and more of a data trail in netspace, and as more and more people become professionally trained at sifting through the metadata of our digital lives (see ‘recognitionist’ in this post), it becomes more tempting and viable for researchers, not just merchants, to point their apparatus at all this data and it’s associated inferences.

Eventually, these opposing forces of understanding what people are doing online (commerce-driven) vs making people do what you want online (in pursuit of scientific data) are going to start impacting the very architecture of websites.

Just the same way google came along, and by dint of their analysis of webpages, CAUSED page architectures to change dramatically, so too will datamining on the web begin to impact how sites are built. Just as SEO created site selection pressures (mainly via page design and structure, and slightly less via site structure), datamining – whether it is research-driven or commerce-driven or both – will create similar selective pressures on site design. Why would that happen? Well follow this trail of logic and see what you think:

1) usable information is currency.

2) to create usable (or useful) information, we need to gather the right data.

3) Up until now, usable information has mainly rested on mere speed and availability of underlying web data. We’ve gotten to a stage now where the internet has grown and we almost take such parameters for granted; they’ve levelled off, in terms of impact. Data relevancy is increasingly the metric that decides how usable information is. In other words, usability of information will begin to correlate more strongly with the relevancy of the gathered data that constitutes it.

Relevancy means we’re going to start paying more attention to (read ’standardising’ on):
- the core types of data gathered
- the structure of each such type
- convertibility between such data types

4) Sites that harvest relevant, well-structured data are going to end up being rewarded by the two parties mentioned earlier: web marketers and research scientists. Why? Well, there’s only ever been two choices:

  1. build a site each time you wish to control / understand user behaviour, and maintain full control of the gathered data. OR,
  2. buy data from other peoples’ sites.

It makes sense that the two camps will begin to collaborate more: they both want data, but are not necessarily in the site-building business. Eventally they’re going to figure out that if they join forces, they can simply… ahem.. ‘encourage’ sites to adhere to an infrastructure that will evolve to meet their data needs.

5) Once the science-and-commerce hybrid organism starts waving its dollars at us, it’s going to start making sense for us to structure our sites so that we can actually give them the data they need. And maybe even the technologies we build our sites with will start having to fall in line. Expect to see a variant of javascript that creates gravity wells for your mouse pointer (take that, eyeballs… you and your wayward, wandering behaviour!). Ok maybe not that, but seriously: there’ll be more under-the-hood code to support:

a) – content granularity: every/any element on a site may be an identifiable, trackable entity in it’s own right. We have to commend CSS, as a technology, for inducing the groundbreaking, collective paradigm shift which allowed as to see html entities as… well, entities, period. Things you can do anything with. We’re only just catching up on the ‘anything’ part. Look at how google optimizer is taking the idea of content granularity to town.

b) – more universal, seamless ‘login’ platforms sold (or if open source, made available) to site owners so that they can hook into a framework that allows them to NEVER AGAIN ask who you are.

c) – universal data models, universal semantics: think of it as RDF on steroids, like a sort of giant, super-flexible class diagram for all the content swimming about on the net, with well-known/public vocabularies and APIs, and a set of toolkits and plugins which allow a site builder to only hook into the bits they need, without worrying about the rest of the massive beast.

Imagine an ‘audio’ entity being used by a site which allows you to download mp3s. Such a site might not care about related aural concepts like time signature or key. But maybe they do make use of attributes like bpm, artist, duration, downloaded-at, downloaded-by, etc.

An online marketer or researcher could marry such skeletal (audio-related) data flows to those from other sites which gather on different attributes… all tied back to the same, globally understood (or hell, maybe even locally understood) ‘audio’ entity. They can then build their models based on data that is relevant to their corporate or scientific purpose.

d) semantic tracking. Currently, web tracking technologies usually funnel everything to a single gateway, where you can then filter the data according to your needs. This is a model that assumes that all your tracking data is yours alone. In the brief history of tracking, we’ve already evolved to a point of wanting to coalesce tracking data from several sites at once. But why stop there?

Future data networks will require that some (or maybe even most) of our tracking data ends up elsewhere, with our data partners. Similarly, our future site analyses will be rendered incomplete until we ‘import’ a whole bunch of data from other networks which make what we’ve got more relevant – perhaps by providing a better semantic context. Right now all we have is google’s kindness… in letting us know, for example, what search keywords caused people to end up on our site.

Semantic tracking would require that tracking data be dispersed among several networks, with each ‘goal’ of understanding (eg a research or marketing question) implementing its own hierarchical set of filters as a lens through which to acquire the necessary insight(s). Reification would become a bigger part of the resulting tracking systems.

***

So there you have it.

To quote some cylon-esque lore, All this has happened before, and all this will simply happen some more, but on a much bigger, more obvious scale. In other words, there are networks already doing this, mainly via cookies and rdf tags that you or I never have to see. Often too, the whole process is rather surreptitious…

In the future, data networks are going to be able to crawl out from under their myriad rocks and look us straight in the eye, and openly say, “All your metrics are belong to us. To continue, click the ‘gladly’ button to accept datamining under the Metrix 1.9a International Accord.”

And we’ll click it, too, won’t we? It might become as integral to our day as a cup of morning coffee.

Tenuously Linked (blame my tagging):

Comments

Leave a Reply